Is My Model Using the Right Evidence? Systematic Probes for Examining
Evidence-Based Tabular Reasoning
Vivek Gupta1, Riyaz A. Bhat2, Atreya Ghosal3,
Manish Shrivastava3, Maneesh Singh2, Vivek Srikumar1
1University of Utah, USA, 2Verisk Inc., India, 3IIIT-Hyderabad, India
{vgupta, svivek}@cs.utah.edu, {riyaz.bhat, maneesh.singh}@verisk.com,
{atreyee.ghosal@research, m.shrivastava}@iiit.ac.in
Astratto
Neural models command state-of-the-art per-
formance across NLP tasks, including ones
involving ‘‘reasoning’’. Models claiming to
reason about the evidence presented to them
should attend to the correct parts of the input
while avoiding spurious patterns therein, be
self-consistent in their predictions across in-
puts, and be immune to biases derived from
their pre-training in a nuanced, context-
sensitive fashion. Do the prevalent *BERT-
family of models do so? in questo documento, we
study this question using the problem of rea-
soning on tabular data. Tabular inputs are
especially well-suited for the study—they ad-
mit systematic probes targeting the properties
listed above. Our experiments demonstrate that
a RoBERTa-based model, representative of
the current state-of-the-art, fails at reason-
ing on the following counts: Esso (UN) ignores
relevant parts of the evidence, (B) is over-
sensitive to annotation artifacts, E (C) relies
on the knowledge encoded in the pre-trained
language model rather than the evidence pre-
sented in its tabular inputs. Finalmente, through
inoculation experiments, we show that fine-
tuning the model on perturbed data does not
help it overcome the above challenges.
1
introduzione
The problem of understanding tabular or semi-
structured data is a challenge for modern NLP.
Recentemente, Chen et al. (2020B) and Gupta et al.
(2020) have framed this problem as a natural
language inference question (NLI, Dagan et al.,
2013; Bowman et al., 2015, inter alia) via the
TabFact and the INFOTABS datasets, rispettivamente.
The tabular version of the NLI task seeks to de-
termine whether a tabular premise entails or con-
tradicts a textual hypothesis, or is unrelated to it.
659
One strategy for such tabular reasoning tasks
relies on the successes of contextualized represen-
tations (per esempio., Devlin et al., 2019; Liu et al., 2019B)
for the sentential version of the problem. Tables
are flattened into artificial sentences using heuris-
tics to be processed by these models. Surprisingly,
even this na¨ıve strategy leads to high predictive
accuracy, as shown not only by the introductory
papers but also by related lines of recent work
(per esempio., Eisenschlos et al., 2020; Yin et al., 2020).
in questo documento, we ask: Do these seemingly ac-
curate models for tabular inference effectively
use and reason about their semi-structured in-
puts? While ‘‘reasoning’’ can take varied forms,
a model that claims to do so should at least ground
its outputs on the evidence provided in its inputs.
Concretely, we argue that such a model should
(UN) be self-consistent in its predictions across con-
trolled variants of the input, (B) use the evidence
presented to it, and the right parts thereof, E,
(C) avoid being biased against
the given evi-
dence by knowledge encoded in the pre-trained
embeddings.
Corresponding to these three properties, we
identify three dimensions on which to evaluate a
tabular NLI system: robustness to annotation arti-
facts, relevant evidence selection, and robustness
to counterfactual changes. We design systematic
probes that exploit the semi-structured nature of
the premises. This allows us to semi-automatically
construct the probes and to unambiguously de-
fine the corresponding expected model response.
These probes either introduce controlled edits to
the premise or the hypothesis, or to both, thereby
also creating counterfactual examples. Experi-
ments reveal that despite seemingly high test set
accuracy, a model based on RoBERTa (Liu et al.,
2019B), a good representative of BERT deriva-
tive models, is far from being reliable. Not only
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 10, pag. 659–679, 2022. https://doi.org/10.1162/tacl a 00482
Redattore di azioni: Katrin Erk. Lotto di invio: 9/2021; Lotto di revisione: 12/2021; Pubblicato 6/2022.
C(cid:2) 2022 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
T
l
UN
C
_
UN
_
0
0
4
8
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
does it ignore relevant evidence from its inputs, Esso
also relies excessively on annotation artifacts, In
particular the sentence structure of the hypothe-
sis, and pre-trained knowledge in the embeddings.
Finalmente, we found that attempts to inoculate the
modello (Liu et al., 2019UN) along these dimensions
degrades its overall performance.
The rest of the paper is structured as follows.
§2 introduces the Tabular NLI task, while §3
articulates the need for probing evidence-based
tabular reasoning in extant high-performing mod-
els. §4–§6 detail the probes designed for such an
examination and the results thereof while §7 ana-
lyzes the impact of inoculation to aforementioned
challenges through model fine-tuning. §8 presents
the main takeaways and contextualization in the
related art. §9 provides concluding remarks and
indicates future directions of work.1
2 Preliminari: Tabular NLI
Tabular natural language inference is a task similar
to standard NLI in that it examines if a natural
language hypothesis can be derived from the given
premise. Unlike standard NLI, where the evidence
is presented in the form of sentences, the premises
in tabular NLI are semi-structured tables that may
contain both text and data.
Dataset Recently, datasets such as TabFact
(Chen et al., 2020B) and INFOTABS (Gupta et al.,
2020), and also shared tasks such as SemEval
2021 Task 9 (Wang et al., 2021UN) and FEVER-
OUS (Aly et al., 2021), have sparked interest in
tabular NLI research. In this study, we use the
INFOTABS dataset for our investigations.
consists
premise-
hypothesis pairs, whose premises are based on
Wikipedia infoboxes. Unlike TabFact, Quale
only contains ENTAIL and CONTRADICT hypotheses,
INFOTABS also includes NEUTRAL ones. Figura 1
shows an example table from the dataset with four
hypotheses, which will be our running example.
INFOTABS
23,738
Di
The dataset contains 2,540 distinct infoboxes
representing a variety of domains. All hypotheses
were written and labeled by Amazon MTurk work-
ers. The tables contain a title and two columns,
as shown in the example. Since each row takes
the form of a key-value pair, we will refer to the
1The dataset and the scripts used for our analysis are
available at https://tabprobe.github.io.
660
Figura 1: An example of a tabular premise from
INFOTABS. The hypotheses H1 is entailed by it, H2
contradicts it, and H3, H4 are neutral (cioè., neither
entailed nor contradictory).
elements in the left column as the keys, and the
right column provides the corresponding values.
In addition to the usual train and development
sets, INFOTABS includes three test sets, α1, α2, E
α3. The α1 set represents a standard test set that is
both topically and lexically similar to the training
dati. In the α2 set, hypotheses are designed to be
lexically adversarial, and the α3 tables are drawn
from topics not present in the training set. We will
use all three test sets for our analysis.
Models over Tabular Premises Unlike stan-
dard NLI, which can use off-the-shelf pre-trained
contextualized embeddings, the semi-structured
nature of premises in tabular NLI necessitates a
different modeling approach.
Following Chen et al. (2020B), tabular premises
are flattened into token sequences that fit the
input interface of such models. While different
flattening strategies exist in the literature, we adopt
the Table as a Paragraph strategy of Gupta et al.
(2020), where each row is converted to a sentence
of the form ‘‘The key of title is value’’. Questo
seemingly na¨ıve strategy, with RoBERTa-large
embeddings (RoBERTaL henceforth), achieved
the highest accuracy in the original work, shown in
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
T
l
UN
C
_
UN
_
0
0
4
8
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Model
dev
α1
α2
α3
79.78
Umano
Hypothesis Only 60.51
RoBERTaL
75.55
73.59(2.3) 72.41(1.4) 63.02(1.9) 61.82(1.4)
5xCV
79.33
48.89
64.94
83.88
48.26
65.55
84.04
60.48
74.88
Tavolo 1: Results of the Table as a Paragraph
strategy on INFOTABS subsets with RoBERTaL
modello, hypothesis-only baseline and majority
human agreement. The first three rows are re-
produced from Gupta et al. (2020). The last row
represents the average performances (and stan-
dard deviations as subscripts) using models ob-
tained via five-fold cross validation (5xCV).
Tavolo 1.2 The table also shows the hypothesis-only
baseline (Poliak et al., 2018; Gururangan et al.,
2018) and human agreement on the labels.3
To study the stability of the models to vari-
ations in the training data, we performed 5-fold
cross validation (5xCV). An average cross valida-
tion accuracy of 73.53% with a standard deviation
Di 2.73% was observed on the training set which
is close to the performance on the α1 test set
(74.88%). Inoltre, we also evaluated perfor-
mance on the development and test sets. IL
penultimate row of Table 1 presents the perfor-
mance for the model trained on the entire training
dati, and the last row presents the performance
of the 5xCV models. The results demonstrate
that model performance is reasonably stable to
variations in the training set.
3 Reasoning: An Illusion?
Given the surprisingly high accuracies in Table 1,
especially on the α1 test dataset, can we conclude
that the RoBERTa-based model reasons effec-
tively about the evidence in the tabular input to
make its inference? Questo è, does it arrive at its
answer via a sound logical process that takes into
account all available evidence along with com-
2Other flattening strategies have similar performance
(Gupta et al., 2020).
3Preliminary experiments on the development set showed
that RoBERTaL outperformed other pre-trained embeddings.
We found that BERTB, RoBERTaB, BERTL, ALBERTB,
and ALBERTL reached development set accuracies of 63.0%,
67.23%, 69.34%, 70.44%, E 70.88%, rispettivamente. While
we have not replicated our experiments on these other models
due to a prohibitively high computational cost, we expect the
conclusions to carry over to these other models as well.
661
mon sense knowledge? Merely achieving high
accuracy is not sufficient evidence of reasoning:
The model may arrive at the right answer for the
wrong reasons leading to improper and inadequate
generalization over unseen data. This observation
is in line with the recent work pointing out that
the high-capacity models we use may be relying
on spurious correlations (per esempio., Poliak et al., 2018).
‘‘Reasoning’’ is a multi-faceted phenomenon,
and fully characterizing it is beyond the scope of
this work. Tuttavia, we can probe for the ab-
sence of evidence-grounded reasoning via model
responses to carefully constructed inputs and their
variants. The guiding premise for this work is:
Any ‘‘evidence-based reasoning’’ system
should demonstrate expected, predict-
able behavior in response to controlled
changes to its inputs.
In other words, ‘‘reasoning failures’’ can be
identified by checking if a model deviates from ex-
pected behavior in response to controlled changes
to inputs. We note that this strategy has been ei-
ther explicitly or implicitly employed in several
lines of recent work (Ribeiro et al., 2020; Gardner
et al., 2020). In this work, we instantiate the above
strategy along three specific dimensions, briefly
introduced here using the running example in Fig-
ure 1. Each dimension is used to define several
concrete probes that subsequent sections detail.
1. Avoiding Annotation Artifacts A model
should not rely on spurious lexical correlations. In
general, it should not be able to infer the label using
only the hypothesis. Lexical differences in closely
related hypotheses should produce predictable
changes in the inferred label. Per esempio, In
the hypothesis H2 of Figure 1, if the token ‘‘end’’
is replaced with ‘‘start’’, the model prediction
should change from CONTRADICT to ENTAIL.
2. Evidence Selection A model should use the
correct evidence in the premise for determining
the hypothesis label. Per esempio, ascertaining
that the hypothesis H1 is entailed requires the
Genre and Length rows of Figure 1. When a
relevant row is removed from a table, a model that
predicts the ENTAIL or the CONTRADICT label should
predict the NEUTRAL label. When an irrelevant row
is removed, it should not change its prediction
from ENTAIL to NEUTRAL or vice versa.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
T
l
UN
C
_
UN
_
0
0
4
8
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
3. Robustness to Counterfactual Changes A
model’s prediction should be grounded in the
provided information even if it contradicts the
real world, that is to counterfactual information.
Per esempio, if the month of the Released date
changed to ‘‘December’’, then the model should
change the label of H2 in Figure 1 to ENTAIL
from CONTRADICT. Since this information about
release date contradicts the real world, the model
cannot rely on its pre-trained knowledge, Dire,
from Wikipedia. For the model to predict the label
correctly, it needs to reason with the information
in the table as the primary evidence. Although
the importance of pre-trained knowledge cannot
be overlooked, it must not be at the expense of
primary evidence.
Further, there are certain pieces of information
in the premise (irrelevant to the hypothesis) Quello
do not impact the outcome, making the outcome
invariant to these changes. Per esempio, delet-
ing irrelevant rows from the premise should not
change the model’s predicted label. Contrary to
this is the relevant information (‘‘evidence’’) In
the premise. Changing these pieces of informa-
tion should vary the outcome in a predictable
maniera, making the model covariant with these
i cambiamenti. Per esempio, deleting relevant evidence
rows should change the model’s predicted label
to NEUTRAL.
The three dimensions above are not limited to
tabular inference. They can be extended to other
NLP tasks, such as reading comprehension as well
as the standard sentential NLI. Tuttavia, directly
checking for such properties would require consid-
erable labeled data—a big practical impediment.
Fortunately, in the case of tabular inference, IL
(in-/co-)variants associated with these dimensions
allow controlled and semi-automatic edits to the
inputs leading to predictable variation of the ex-
pected output. This insight underlies the design of
probes using which we examine the robustness of
the reasoning employed by a model performing
tabular inference. As we will see in the following
sections, highly effective and precise probes can
be designed without extensive annotation.
Preliminary experiments by Gupta et al. (2020)
on INFOTABS, Tuttavia, reveal that a model trained
just on hypotheses performs surprisingly well on
the test data. This phenomenon, an inductive bias
entirely predicated on the hypotheses, is called
hypothesis bias. Models for other NLI tasks have
been similarly shown to exhibit hypothesis bias,
whereby the models learn to rely on spurious
correlations between patterns in the hypotheses
and corresponding labels (Poliak et al., 2018;
Gururangan et al., 2018; Geva et al., 2019, E
others). Per esempio, negations are observed to be
highly correlated with contradictions (Niven and
Kao, 2019).
To better characterize a model’s reliance on
such artifacts, we perform controlled edits to
hypotheses without altering associated premises.
Unlike the α2 set, which includes minor changes
to function words, we aim to create more sophis-
ticated changes by altering content expressions or
noun phrases in a hypothesis. Two possible sce-
narios arise where a hypothesis alteration, without
a change in the premise, either (UN) leads to a change
in the label (cioè., the label covaries with the vari-
ation in the hypothesis), O (B) does not induce
a label change (cioè., the label is invariant to the
variation in the hypothesis).
In INFOTABS, a set of reasoning categories are
identified to characterize the relationship between
a tabular premise and a hypothesis. We use a
subset of these, listed below, to perform controlled
changes in the hypotheses.
• Named Entities: such as Person, Location,
Organization;
• Nominal modifiers: nominal phrases or
clauses;
• Negation: markers such as no, non;
• Numerical Values: numeric expressions
representing weights, percentages, areas;
• Temporal Values: Date and Time; E,
• Quantifiers: like most, many, every.
4 Probing Annotation Artifacts
Can a model make inference about a hypothesis
without a premise? It is natural to answer in the
negative in general. (Ovviamente, certain hypothe-
ses may admit strong priors, per esempio., tautologies.)
Although we can easily track these expressions
in a hypothesis using tools like entity recogniz-
ers and parsers, it is non-trivial to automatically
modify them with a predictable change on the hy-
pothesis label. Per esempio, some label changes
can only be controlled if the target expression in
662
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
T
l
UN
C
_
UN
_
0
0
4
8
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Preposition Upward Monotonicity Downward Monotonicity
Sopra
under
more than
less than
before
after
CONTRADICT
ENTAIL
CONTRADICT
ENTAIL
ENTAIL
CONTRADICT
ENTAIL
CONTRADICT
ENTAIL
CONTRADICT
CONTRADICT
ENTAIL
Tavolo 2: Monotonicity properties of prepositions.
the hypothesis is correctly aligned with the facts
in the premise. Such cases include CONTRADICT to
ENTAIL, and NEUTRAL to CONTRADICT or ENTAIL,
which are difficult without extensive expression-
level annotations. Nonetheless, in several cases,
label changes can be deterministically known even
with imprecise changes in the hypothesis. For ex-
ample, we can convert a hypothesis from ENTAIL
to CONTRADICT by replacing a named entity in the
hypothesis with a random entity of the same type.
Hence we follow the following strategy: (UN) Noi
avoid perturbations involving the NEUTRAL la-
bel altogether, as they often need changes in the
premise (table) anche. (B) We generate all label-
preserving and some label-flipping transforma-
tions automatically using the approach described
below. (C) We annotate the CONTRADICT to ENTAIL
label-flipping perturbations manually.
Automatic Generation of Label-preserving
Transformations To automatically perturb hy-
potheses, we leverage the syntactic structure of
a hypothesis and the monotonicity properties of
function words like prepositions. Primo, we per-
form syntactic analysis of a hypothesis to identify
named entities and their relations to title expres-
sions via dependency paths.4 Then, based on the
entity type, we either substitute or modify them.
Named entities such as person names and loca-
tions are substituted with entities of the same type.
Expressions containing numbers are modified us-
ing the monotonicity property of the prepositions
(or other function words) governing them in their
corresponding syntactic trees.
Given the monotonicity property of a prepo-
sition (Vedi la tabella 2), we modify its governing
numerical expression in a hypothesis in the same
order to preserve the hypothesis label. Consider
hypothesis H5 in Figure 2, which contains a prepo-
sition (Sopra) with upward monotonicity. Because
of upward monotonicity, we can increase the num-
ber of hours in H5 without altering the label.
Figura 2: Hypothesis H5 contradicts the premise.
Tipo di
Modification
Nominal
Modifier
Temporal
Expression
Negation
Temporal
Expression
Perturbed Hypothesis
C : Breakfast in America is a pop
H1E
E: Breakfast in America which was
produced by Pert Handerson is a pop
album of 46 minutes length.
H1E
album with a length of 56 minutes.
H2C
released towards the end of 1979.
H2C
leased towards the end of 1989.
E: Breakfast in America was not
C : Breakfast in America was re-
Tavolo 3: Example hypothesis perturbations for
the running example from Figure 1. The red itali-
cized text represents changes. Superscripts E/C
represent gold ENTAIL and CONTRADICT labels,
and subscripts E/C represent new labels.
Manual Annotation of Label-flipping Trans-
formations Note that
in the above example,
modifying the numerical expression in the reverse
direction (per esempio., decreasing the number of hours)
does not guarantee a label flip. We need to know
the premise to be accurate. During the experi-
menti, we observed that a large step (half/twice
the actual number) suffices in most cases. We used
this heuristic and manually curated the erroneous
cases. Additionally, all the cases of CONTRADICT
to ENTAIL label-flipping perturbations were anno-
tated manually.5
We generated 2,891 perturbed examples from
the α1 set with 1,203 instances preserving the label
E 1,688 instances flipping it. We also generated
11,550 examples from the T rain set, con 4,275
preserving and 7,275 flipping the label. Some
example perturbations using different types of
expressions are listed in Table 3. It should be noted
that there may not be a one-to-one correspondence
between the gold and perturbed examples, as a
hypothesis may be perturbed numerous times or
not at all. Di conseguenza, in order for the results to be
comparable, a single perturbed example must be
sampled for each gold example: we sampled 967
from the α1 set and 4,274 from the T rain set.
5Annotation done by an expert well versed in the
4We used spaCy v2.3.2 for the syntactic analysis.
NLI task.
663
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
T
l
UN
C
_
UN
_
0
0
4
8
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Model
Prem+Hypo
Hypo-Only
Label
Original
Preserved
mean(stdev)
T rain Set (w/o NEUTRAL)
92.98(0.20)
99.44(0.06)
70.23(0.35)
96.39(0.13)
α1 Set (w/o NEUTRAL)
Label
Flipped
53.92(0.28)
19.23(0.27)
Prem+Hypo
Hypo-Only
68.94(0.76)
63.52(0.75)
69.56(0.77)
60.27(0.85)
51.48(0.86)
31.02(0.63)
Tavolo 4: Results of the Hypothesis-only model
and Prem+Hypo model on the gold and perturbed
hypotheses.
Results and Analysis We tested the hypothesis-
only and full models (both trained on the original
T rain set) on the perturbed examples, without
subsequent fine-tuning on the perturbed exam-
ples.6 The results are presented in Table 4, con
each cell representing the average accuracy and
standard deviation (subscript) across 100 sam-
plings, con 80% of the data selected at random in
each sampling.
We note that the performance degrades substan-
tially in both label-preserved and flipped settings
when the model is trained on just the hypotheses.
When labels are flipped after perturbations, IL
decrease in performance (averaged across both
models) is about 25% E 61% points, on the α1
set and T rain set, rispettivamente. Tuttavia, for the
full model, perturbations that retain the hypothesis
label have little effect on model performance.
The contrast in the performance drop between
the label-preserved and label-flipped cases sug-
gests that changes to the content expressions have
little effect on the model’s original predictions. In-
terestingly, the predictions are invariant to changes
to functions words as well, as per results on α2 in
Gupta et al. (2020). This suggests that the model
might be more prone to changes to the template or
structure of a hypothesis than its lexical makeup.
Consequently, a model that relies on correlations
between the hypothesis structure and the label is
expected to suffer on the label-flipped cases. In
case of label-preserving perturbations of similar
kind, structural correlations between the hypothe-
sis and the label are retained leading to minimal
drop in model performance.
The results of the hypothesis-only model on
the T rain set may appear slightly surprising at
first. Tuttavia, given that the model was trained
6We analyze the impact of fine-tuning on perturbed
examples in §7.
on this dataset, it seems reasonable to assume
that the model has ‘‘overfit’’ to the training data.
Therefore, the model is expected to be vulnerable
even to slight label-preserving modifications to
the examples it was trained on, leading to the huge
drop of 26%. In the same setting, for the α1 set the
performance drop is lesser, namely, Di 3%.
Taken together, we can conclude from these re-
sults that the model ignores the information in the
hypotheses (thereby perhaps also the aligned facts
in the premise), and instead relies on irrelevant
structural patterns in the hypotheses.
5 Probing Evidence Selection
Predictions of an NLI model should primarily be
based on the evidence in the premise, questo è, SU
the facts relevant to the hypothesis. For a tabu-
lar premise, rows containing the evidence neces-
sary to infer the associated hypothesis are called
relevant rows. Short-circuiting the evidence in rel-
evant rows for inference using annotation artifacts
as suggested in §4 or other spurious artifacts in
irrelevant rows of the table is expected to lead to
poor generalization over unseen data.
To better understand the model’s ability to
select evidence in the premise, we use two kinds
of controlled edits: (UN) automatic edits without any
information about relevant rows, E (B) semi-
automatic edits using knowledge of relevant rows
via manual annotation. The rest of the section goes
over both scenarios in detail. All experiments in
this section use the full model that is trained on
both premises and their associated hypotheses.
5.1 Automatic Probing
We define four kinds of table modifications that
are agnostic to the relevance of rows to a hy-
pothesis: (UN) row deletion, (B) row insertion,
(C) row-value update, questo è, changing existing
informazione, E (D) row permutation, questo è, Rif-
ordering rows. Each modification allows certain
desired (valid) changes to model predictions.7 We
examine below the case of row deletion in detail
and refer the reader to the Appendix for the others.
Row deletion should lead to the following de-
sired effects: (UN) If the deleted row is relevant to
the hypothesis (per esempio., Length for H1), the model
7In performing these modifications, we made sure
that the modified table does not become inconsistent or
self-contradicting.
664
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
T
l
UN
C
_
UN
_
0
0
4
8
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
α3
Average
Dataset
ENTAIL
NEUTRAL
CONTRADICT
α1
5.76
4.43
3.23
α2
7.26
3.91
3.70
5.01
5.24
3.01
Average
4.47
4.96
4.42
6.01
4.53
3.31
–
Tavolo 5: Percentage of invalid transitions after
row deletion. For an ideal model, all these
numbers should be zero.
side, the model mostly retains its predictions on
row-value update operations. We refer the reader
to the Appendix for more details.
5.2 Manual Probing
Row modification for automatic probing in §5.1
is agnostic to the relevance of the row to a given
hypothesis. Since only a few rows (one or two)
are relevant to the hypothesis, the probing skew
towards hypothesis-unrelated rows weakens the
investigations into the evidence-grounding capa-
bility of the model. Knowing the relevance of
rows allows for the creation of stronger probes.
Per esempio, if a relevant row is deleted, the EN-
TAIL and CONTRADICT predictions should change to
NEUTRAL. (Recall that after deleting an irrelevant
row the model should retain its original label.)
Probing by altering or deleting relevant rows re-
quires human annotation of relevant rows for each
table-hypothesis pair. We used MTurk to annotate
the relevance of rows in the development and the
test sets, with turkers identifying the relevant rows
for each table-hypothesis pair.
Inter-annotator Agreement. Noi
employed
majority voting to derive ground truth labels from
multiple annotations for each row. The inter-
annotator agreement macro F1 score for each of
the four datasets is over 90% and the average
Fleiss’ kappa is 78 (std: 0.22). This suggests good
inter-annotator agreement. In 82.4% of cases,
almeno 3 out of 5 annotators marked the same
relevant rows.
Results and Analysis We examined the re-
sponse of the model when relevant rows are de-
leted. Figura 4 shows the label transitions. IL
fact that even after the deletion of relevant rows,
ENTAIL and CONTRADICT predictions don’t change
to NEUTRAL a large percentage of times (mostly
the original label remains unchanged and at other
Figura 3: Changes in model predictions after automatic
row deletion. Directed edges are labeled with transition
percentages from the source node label to the target
node label. The number triple corresponds to α1, α2,
and α3 test sets, respectively and, for each source
node, adds up to 100% over the outgoing edges. Red
lines represent invalid transitions. Dashed and solid
black lines represent valid transitions for irrelevant
and relevant row deletion respectively. * represents
valid transitions with either row deletions.
prediction should change to NEUTRAL. (B) If the
deleted row is irrelevant (per esempio., Producer for H1),
the model should retain its original prediction.
NEUTRAL predictions should remain unaffected by
row deletion.
Results and Analysis We studied the impact
of row deletion on the α1, α2, and α3 test sets.
Figura 3 shows aggregate changes to labels after
row deletions as a directed labeled graph. IL
nodes in this graph represent the three labels in
INFOTABS, and the edges denote transitions after
row deletion. The source and end nodes of an
edge represent predictions before and after the
modification.
We see that the model makes invalid transi-
tions in all three datasets. Tavolo 5 summarizes
the invalid transitions by aggregating them over
the label originally predicted by the model. IL
percentage of invalid transitions is higher for EN-
TAIL predictions than for CONTRADICT and NEUTRAL.
After row deletion, many ENTAIL examples are in-
correctly transitioning to CONTRADICT rather than
to NEUTRAL. The opposite trend is observed for the
CONTRADICT predictions.
As with row deletion, the model exhibits invalid
responses to other row modifications listed in
the beginning of this section, like row insertion.
Surprisingly, the performance degrades due to
row permutations as well, suggesting some form
of position bias in the model. On the positive
665
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
T
l
UN
C
_
UN
_
0
0
4
8
2
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
with ENTAIL and CONTRADICT labels, where delet-
ing a relevant row should change the prediction
to NEUTRAL.9
Results and Analysis On the human-annotated
relevant rows, the model has an average precision
Di 41.0% and a recall of 40.9%. Further analysis
reveals that the model (UN) uses all relevant rows
In 27% cases, (B) uses incorrect or no rows as
evidence in 52% of occurrences, E (C) is only
partially accurate in identifying relevant rows in
the remaining 21% of examples. Upon further
analyzing the cases in (B), we observed that the
model actually ignores premises completely in
88% (Di 52%) of cases. This accounts for 46%
(absolute) of all occurrences. In comparison, In
the human-annotated data, such cases only amount
A < 2%.
Although, the model’s predictions are 70% cor-
rect in the 4,600 examples, only 21% can be
attributed to using all relevant evidence. The cor-
rect label in 37% of the 4,600 examples is from
irrelevant rows, with the remaining 12% of correct
predictions use some, but not all, relevant rows.
We can conclude from the findings in this sec-
tion that the model does not seem to need all the
relevant evidence to arrive at its predictions, rais-
ing questions about trust in its predictions.
6 Probing with Counterfactual Examples
Since INFOTABS is a dataset of facts based on
Wikipedia, pre-trained language models such as
RoBERTa, trained on Wikipedia and other pub-
licly available text, may have already encountered
information in INFOTABS during pre-training. As a
result, NLI models built on top of RoBERTaL can
learn to infer a hypothesis using the knowledge
of the pre-trained language model. More specifi-
cally, the model may be relying on ‘‘confirmation
bias’’, in which it selects evidence/patterns from
both premise and hypothesis that matches its prior
knowledge. While world knowledge is necessary
for table NLI (Neeraja et al., 2021), models should
still treat the premise as the primary evidence.
Counterfactual examples can help test whether
the model is grounding its inference on the ev-
idence provided in the tabular premise. In such
examples, the tabular premise is modified such
that the content does not reflect the real world. In
9We did not include the 2,400 NEUTRAL examples pairs
and the ambiguous 200 ENTAIL or CONTRADICT examples that
had no relevant rows as per the consensus annotation.
Figure 4: Changes in model predictions after deletion
of relevant rows. Red lines represent invalid transitions
while black lines represent valid transitions. The di-
rected edges are labeled in the same manner as they are
in Figure 3.
Dataset
ENTAIL
NEUTRAL
CONTRADICT
α1
75.41
8.39
77.02
α2
74.70
6.58
81.10
α3
Average
77.31
8.01
77.80
75.80
7.66
78.64
Average
53.60
54.14
54.35
Table 6: Percentage of invalid transitions follow-
ing deletion of relevant rows. For an ideal model,
all these numbers should be zero.
times, it changes incorrectly), indicates that the
model is likely utilizing spurious statistical pat-
terns in the data for making the prediction.
We summarize the combined invalid transitions
for each label in Table 6. We see that the percent-
age of invalid transitions is considerably higher
compared to random row deletion in Figure 3.8
The large percentage of invalid transitions in the
ENTAIL and CONTRADICT cases indicates a rather
high utilization of spurious statistical patterns by
the model to arrive at its answers.
5.3 Human vs Model Evidence Selection
We further analyze the model’s capability for se-
lecting relevant evidence by comparing it with
human annotators. All rows that alter the model
predictions during automatic row deletion are con-
sidered as model relevant rows and are compared
to the human-annotated relevant rows. We only
consider the subset of 4,600 (from 7,200 anno-
tated dev/test sets pairs) hypothesis-table pairs
8Note that the dashed black lines from Figure 3 are now
red in Figure 4, indicating invalid transitions.
666
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
t
l
a
c
_
a
_
0
0
4
8
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
t
l
a
c
_
a
_
0
0
4
8
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 5: Counterfactual table-hypothesis pair created
from Figure 1 and Figure 2. Only the values of ‘Length’
rows are swapped, rest of the rows from Figure 1 are
copied over.
this study, we limit ourselves to modifying only
the ENTAIL and CONTRADICT examples. We omit
the NEUTRAL cases because the majority of them in
INFOTABS involve out-of-table information; pro-
ducing counterfactuals for them is much harder
and involves the laborious creation of new rows
with the right information.
The task of creating counterfactual tables pre-
sents two challenges. First, the modified tables
should not be self-contradictory. Second, we need
to determine the labels of the associatedx hypo-
theses after the table is modified. We employ a
simple approach to generate counterfactuals that
addresses both challenges. We use the evidence se-
lection data (§5.2) to gather all premise-hypothesis
pairs that share relevant keys such as ‘‘Born’’,
‘‘Occupation’’, and so forth. Counterfactual ta-
bles are generated by swapping the values of rel-
evant keys from one table to another.10
Figure 5 shows an example. We create counter-
factuals from the premises in Figure 1 and Figure 2
by swapping their Length rows. We also swap the
hypotheses (H1 and H5) aligned to the Length
rows in both premises by replacing the title ex-
pression Bridesmaids in H5 with Breakfast in
America and vice versa. The simple procedure en-
10There may still be a few cases of self-contradiction, but
we expect that such invalid cases would not exist in the rows
that are relevant to the hypothesis.
Figure 6: A counterfactual tabular premise and the as-
sociated hypotheses created from Figures 1 and 2. The
hypotheses (cid:2)H1 is entailed by the premise, (cid:2)H2 con-
tradicts it, and (cid:2)H3 and (cid:2)H4 are neutral.
sures that the hypotheses labels are left unchanged
in the process, resulting in high-quality data.
In addition, we also generated counterfactuals
by swapping the table title and associated expres-
sions in the hypotheses with the title of another ta-
ble, resulting in a counterfactual table-hypothesis
pair, as in the row swapping strategy. Figure 6
shows an example created from the premises in
Figure 1 and Figure 2 by swapping the title rows
Breakfast in America and Bridesmaids. The title
expression in all hypotheses in Figure 1 are also
replaced by Bridesmaids. This strategy also pre-
serves the hypothesis label similar to row swapping.
The above approaches are Label Preserving as
they do not alter the entailment labels. Counter-
factual pairs with flipped labels are important for
filtering out the contribution of artifacts or other
spurious correlations that originate from a hypo-
thesis (see §4). So, in addition, we also created
counterfactual table-hypothesis pairs where the
original labels are flipped. These counterfactual
cases are, however, non-trivial to generate auto-
matically, and are therefore created manually.
To create the Label-Flipped counterfactual data,
three annotators manually modified tables from
667
Model
Original
mean(stdev)
Label
Preserved
Label
Flipped
T rain Set (without NEUTRAL)
Prem+Hypo
Hypo-Only
Prem+Hypo
Hypo-Only
78.53(0.65)
94.38(0.39)
99.94(0.06)
82.23(0.65)
α1 Set (without NEUTRAL)
69.65(0.78)
71.99(0.69)
58.19(0.91)
60.89(0.76)
48.70(0.72)
00.06(0.01)
44.01(0.72)
27.68(0.65)
Table 7: Results of the Hypothesis-only and
Prem+Hypo models on the gold and counterfac-
tual examples.
the T rain and α1 datasets corresponding to ENTAIL
and CONTRADICT labels, producing 885 counter-
factual examples from the T rain set and 942
from the α1 set. The annotators cross-checked the
labels to determine annotation accuracy, which
was 88.45% for the T rain set and 86.57% for
the α1 set.
Results and Analysis We tested both hypothesis-
only and full
(Prem+Hypo) models on the
counterfactual examples created above, without
fine-tuning on a subset of these examples. The re-
sults are presented in Table 7 where each cell rep-
resents average accuracy and standard deviation
(subscript) over 100 sets of 80% randomly sam-
pled counterfactual examples. We see that the
(Prem+Hypo) model is not robust to counter-
factual perturbations. On the label-flipped coun-
terfactuals, the performance drops down to close
to a random prediction (48.70% for T rain and
44.01% for α1). The performance on the label-
preserved counterfactuals is relatively better which
leads us to conjecture that the model largely ex-
ploits artifacts in hypotheses.
Due to over-fitting, the T rain set has a larger
drop of 15.85%, compared to only 2.70% on the
α1 set on the label-preserved examples. Moreover,
the drop in performance for both Prem+Hypo and
Hypo-Only models is comparable to their perfor-
mance drop on the original table-hypothesis pairs.
This shows that, regardless of whether the rele-
vant information in the premise is accurate, both
models rely substantially on hypothesis artifacts.
On the Label-Flipped counterfactuals, the large
drop in accuracy could be due to both ambiguous
hypothesis artifacts or counterfactual information.
To disentangle these two factors, we can take
advantage of the fact that the counterfactual ex-
amples are constructed from, and hence paired
Prem+Hypo
C-THP O-THP
Hypo-Only
O-Hypo
Dataset
T rain
0.00
0.00
3.57
49.36
α1
11.43
11.79
6.48
33.12
Table 8: Performance of the full and hypothesis-
only models on the original and counterfactual
examples. O-THP and C-THP represent original
and counterfactual table-hypothesis pairs; O-Hypo
represents hypotheses from the original data; ✓
represents correct predictions and ✗
incorrect predictions.
represents
with, the original examples. This allows us to
examine pairs of examples where the full model
makes an incorrect prediction on one, but not the
other. Especially of interest are the cases where
the full model makes a correct prediction on the
original example, but not on the corresponding
counterfactual example.
Table 8 shows the results of this analysis.
Each row represents a condition corresponding
to whether the full and the hypothesis-only mod-
els are correct on the original example. The two
cases of interest, described above, correspond to
the second and fourth rows of the table. The sec-
ond row shows the case where the full model is
correct on the original example (and not on the
counter-factual example), but the hypothesis-only
model is not. Since we can discount the impact of
hypothesis bias in these examples, the error in the
counter-factual version could be attributed to re-
liance on pre-trained knowledge. Unsurprisingly,
there are no such examples in the training set. In
the α1 set, we see a substantial fraction of coun-
terfactual examples (11.79%) belong to this cat-
egory. The last row considers the case where the
hypothesis-only model is correct. We see that this
accounts for a larger fraction of the counterfac-
tual errors, both in the training and the α1 sets.
Among these examples, despite the (albeit unfor-
tunate) fact that the hypothesis alone can support
a correct prediction, the model’s reliance on its
pre-trained knowledge leads to errors in the coun-
terfactual cases.
The results, taken in aggregate, suggest that the
model produces predictions based on hypothesis
artifacts and pre-trained knowledge rather than
668
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
t
l
a
c
_
a
_
0
0
4
8
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
✓
✗
✗
✗
✓
✗
✓
✗
✓
✗
✓
✓
the evidence presented to it, thus impacting its
robustness and generalization.
7
Inoculation by Fine-Tuning
Our probing experiments demonstrate that the
models, trained on the INFOTABS training set, failed
along all three dimensions that we investigated.
This leads us to the following question: Can addi-
tional fine-tuning with perturbed examples help?
Liu et al. (2019a) point out that poor perfor-
mance on challenging datasets can be ascribed to
either a weakness in the model, a lack of diversity
in the dataset used for training, or information
leakage in the form of artifacts.11 They suggest
that models can be further fine-tuned on a few
challenging examples to determine the possible
source of degradation. Inoculation can lead to one
of three outcomes: (a) Outcome 1: The perfor-
mance gap between the challenge and the original
test sets reduces, possibly due to addition of di-
verse examples, (b) Outcome 2: Performance on
both the test sets remains unchanged, possibly be-
cause of the model’s inability to adapt to the new
phenomena or the changed data distribution, or,
(c) Outcome 3: Performance degrades on the test
set, but improves on the challenge set, suggesting
that adding new examples introduces ambiguity
or contradictions.
We conducted two sets of inoculation experi-
ments to help categorize performance degradation
of our models into one of these three categories.
For each experiment described below, we gen-
erated additional inoculation datasets with 100,
200, and 300 examples to inoculate the origi-
nal task-specific RoBERTaL models trained on
both premises and hypotheses. As in the original
inoculation work, we created these adversarial
the
datasets by sub-sampling inclusively,
smaller datasets are subsets of the larger ones. Fol-
lowing the training protocol in Liu et al. (2019a),
we tried learning rates of 10−4, 5 × 10−5, and
10−5. We performed inoculation for a maximum
of 30 epochs with early stopping based on the
development set accuracy. We found that with the
first two learning rates, the model does not con-
verge, and underperforms on the development set.
The model performance was best with the learn-
ing rate of 10−5, which we used throughout the
inoculation experiments. The standard deviation
i.e.,
11Model weakness is the inherent inability of a model (or
a model family) to handle certain linguistic phenomena.
#Samples
0 (w/o Ino)
100
200
300
α1
74.88
67.44
67.34
67.24
α2
65.55
62.17
61.88
61.84
α3
64.94
58.51
58.61
58.62
Table 9: Performance of the inoculated
models on the original INFOTABS test sets.
#Samples Original Label Preserved Label Flipped
T rain Set (w/o NEUTRAL)
0 (w/o Ino)
100
200
300
0 (w/o Ino)
100
200
300
99.44
97.24
97.24
97.24
92.98
95.58
95.65
95.64
α1 Set (w/o NEUTRAL)
68.94
68.05
68.37
68.36
69.56
65.67
66.29
66.29
53.92
79.25
78.75
78.74
51.48
57.91
57.49
57.49
Table 10: Performance of the inoculated models
on the hypothesis perturbed INFOTABS sets.
over 100 sample splits for all experiments was
≤ 0.91.
Annotation Artifacts Table 9 shows the per-
formance of the inoculated models on the original
INFOTABS test sets, and Table 10 shows the re-
sults on the hypothesis-perturbed examples (from
§4). We see that fine-tuning on the hypothesis-
perturbed examples decreases performance on the
original α1, α2, and α3 test sets, but performance
improves on the more difficult label-flipped ex-
amples of the hypothesis-perturbed test set.
Counterfactual Examples Tables 11 and 12
show the performance of models inoculated on
the original INFOTABS test sets and the counterfac-
tual examples from §6, respectively. Once again,
we see that fine-tuning on counterfactual examples
improves performance on the adversarial coun-
terfactual examples test set, at the cost of per-
formance on the original test sets.
Analysis We see that both experiments above
belong to Outcome 3, where the performance im-
proves on the challenge set, but degrades on the
test set(s). The change in the distribution of inputs
hurts the model: we conjecture that this may
be because the RoBERTaL model exploits data
artifacts in the original dataset but fails to do so
for the challenge dataset and vice versa.
669
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
t
l
a
c
_
a
_
0
0
4
8
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
#Samples
0 (w/o Ino)
100
200
300
α1
74.88
69.72
69.88
67.34
α2
65.55
63.88
63.78
62.23
α3
64.94
59.66
58.89
57.58
Table 11: Performance after inoculation by
fine-tuning on the original INFOTABS test sets.
#Samples Original Label Preserved Label Flipped
T rain Set (w/o NEUTRAL)
0 (w/o Ino)
100
200
300
0 (w/o Ino)
100
200
300
94.38
91.82
92.46
91.08
78.53
84.61
84.92
83.54
α1 Set (w/o NEUTRAL)
71.99
66.05
65.86
65.59
69.65
75.03
75.03
74.23
48.70
57.62
59.43
63.58
44.01
50.40
50.57
52.09
Table 12: Performance after inoculation fine-tuning
on the INFOTABS counterfactual example sets.
We expect our model to handle both original and
challenge datasets, at least after fine-tuning (i.e.,
it should belong to Outcome 1). Its failure points
to the need for better models or training regimes.
8 Discussion and Related Work
What Did We Learn? Firstly, through system-
atic probing, we have shown that despite good
performance on the evaluation sets, the model for
tabular NLI fails at reasoning. From the analysis
of hypothesis perturbations (§4), we show that the
model heavily relies on correlations between a hy-
pothesis’ sentence structure and its label. Models
should be systematically evaluated on adversarial
sets like α2 for robustness and sensitivity. This
observation is concordant with multiple studies
that probe deep learning models on adversarial
examples in a variety of tasks such as question
answering, sentiment analysis, document classifi-
cation, natural language inference, and so forth.
(e.g., Ribeiro et al., 2020; Richardson et al., 2020;
Goel et al., 2021; Lewis et al., 2021: Tarunesh
et al., 2021).
Secondly, the model does not look at correct
evidence required for reasoning, as is evident
from the evidence-selection probing (§5). Rather,
it leverages spurious patterns and statistical cor-
relations to make predictions. A recent study by
Lewis et al. (2021) on question-answering shows
that models indeed leverage spurious patterns to
answer a large fraction (60–70%) of questions.
Thirdly, from counterfactual probes (§6), we
found that the model relies on knowledge of pre-
trained language models than on tabular evidence
as the primary source of knowledge for making
predictions. This is in addition to the spurious
patterns or hypothesis artifacts leveraged by the
model. Similar observations are made by Clark
and Etzioni (2016), Jia and Liang (2017), Kaushik
et al. (2020), Huang et al. (2020), Gardner et al.
(2020), Tu et al. (2020), Liu et al. (2021), Zhang
et al. (2021), and Wang et al. (2021b) for un-
structured text.
Finally, from the inoculation study (§7), we
found that fine-tuning on challenge sets improves
model performance on challenge sets but degrades
on the original α1, α2, and α3 test sets. That is,
changes in the data distribution during training
have a negative impact on model performance.
This adds weight to the argument that the model
relies excessively on data artifacts.
Benefit of Tabular Data Unlike unstructured
data, where creating challenge datasets may be
more difficult (e.g., Ribeiro et al., 2020; Goel et al.,
2021; Mishra et al., 2021), we can analyze semi-
structured data more effectively. Although con-
nected with the title, the rows in the table are still
independent, linguistically and otherwise. Thus,
controlled experiments are easier to design and
study. For example, the analysis done for evidence
selection via multiple table perturbation opera-
tions such as row deletion and insertion is possible
mainly due to the tabular nature of the data. Such
granularity and component-independence is gen-
erally absent for raw text at the token, sentence,
and even paragraph level. As a result, designing
suitable probes with sufficient coverage can be
a challenging task, and can require more man-
ual effort.
Additionally, probes defined on one tabular
dataset (INFOTABS in our case) can be easily ported
to other tabular datasets such as WikiTableQA
(Pasupat and Liang, 2015), TabFact (Chen et al.,
2020b), HybridQA (Chen et al., 2020c; Zayats
et al., 2021; Oguz et al., 2020), OpenTableQA
(Chen et al., 2021), ToTTo (Parikh et al., 2020),
Turing Tables (Yoran et al., 2021), and Logic-
Table (Chen et al., 2020a). Moreover, such probes
670
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
t
l
a
c
_
a
_
0
0
4
8
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
can be used to better understand the behavior
of various tabular reasoning models (e.g., M¨uller
et al., 2021; Herzig et al., 2020; Yin et al., 2020;
Iida et al., 2021; Pramanick and Bhattacharya,
2021; Glass et al., 2021; and others).
Interpretability for NLI Models For classifi-
cation tasks such as NLI, correct predictions do
not always mean that the underlying model is em-
ploying correct reasoning. More work is needed to
make models interpretable, either through expla-
nations or by pointing to the evidence that is used
for predictions (e.g., Feng et al., 2018; Serrano and
Smith, 2019; Jain and Wallace, 2019; Wiegreffe
and Pinter, 2019; DeYoung et al., 2020; Paranjape
et al., 2020; Hewitt and Liang, 2019; Richardson
and Sabharwal, 2020; Niven and Kao, 2019;
Ravichander et al., 2021). Many recent shared
tasks on reasoning over semi-structured tabular
data (such as SemEval 2021 Task 9 [Wang et al.,
2021a] and FEVEROUS [Aly et al., 2021]) have
highlighted the importance of, and the challenges
associated with, evidence extraction for claim
verification.
Finally, NLI models should be tested on mul-
tiple test sets in adversarial settings (e.g., Ribeiro
et al., 2016, 2018a,b; Alzantot et al., 2018; Iyyer
et al., 2018; Glockner et al., 2018; Naik et al.,
2018; McCoy et al., 2019; Nie et al., 2019;
Liu et al., 2019a) focusing on particular prop-
erties or aspects of reasoning, such as perturbed
premises for evidence selection, zero-shot trans-
fer (α3), counterfactual premises or alternate facts,
and contrasting hypotheses via perturbation (α2).
Such behavioral probing by evaluating on multi-
ple test-only benchmarks and controlled probes is
essential to better understand both the abilities and
the weaknesses of pre-trained language models.
9 Conclusion
This paper presented a targeted probing study
to highlight the limitations of tabular inference
models using a case study on a tabular NLI task
on INFOTABS. Our findings show that despite good
performance on standard splits, a RoBERTa-based
tabular NLI model, fine-tuned on the existing
pre-trained language model, fails to select the
correct evidence, makes incorrect predictions on
adversarial hypotheses, and is not grounded in pro-
vided evidence–counterfactual or otherwise. We
expect that insights from the study can help in
designing rationale selection techniques based on
structural constraints for tabular inference and
other tasks. While inoculation experiments showed
partial success, diverse data augmentation may
help mitigate challenges. However, annotation of
such data can be expensive. It may also be possi-
ble to train models to satisfy domain-based con-
straints (e.g., Li et al., 2020) to improve model
robustness. Finally, probing techniques described
here may be adapted to other NLP tasks involv-
ing tables such as tabular question answering and
tabular text generation.
Acknowledgments
We thank the reviewing team for their valuable
feedback. This work is partially supported by NSF
grants #1801446 and #1822877.
References
Rami Aly, Zhijiang Guo, Michael
Sejr
Schlichtkrull, James Thorne, Andreas Vlachos,
Christos Christodoulopoulos, Oana Cocarascu,
and Arpit Mittal. 2021. The fact extraction and
VERification over unstructured and structured
information (FEVEROUS) shared task. In Pro-
ceedings of the Fourth Workshop on Fact Ex-
traction and VERification (FEVER), pages 1–13,
Dominican Republic. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.fever-1.1
Moustafa Alzantot, Yash Sharma, Ahmed Elgohary,
Bo-Jhang Ho, Mani Srivastava, and Kai-Wei
Chang. 2018. Generating natural language ad-
versarial examples. In Proceedings of the 2018
Conference on Empirical Methods in Natu-
ral Language Processing, pages 2890–2896,
Brussels, Belgium. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D18-1316
Samuel R. Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015. A
large annotated corpus for learning natural lan-
guage inference. In Proceedings of the 2015
Conference on Empirical Methods in Natural
Language Processing, pages 632–642, Lisbon,
Portugal. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/D15-1075
Wenhu Chen, Ming-Wei Chang, Eva Schlinger,
William Yang Wang, and William W. Cohen.
671
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
t
l
a
c
_
a
_
0
0
4
8
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
2021. Open question answering over tables and
text. In International Conference on Learning
Representations.
Wenhu Chen, Jianshu Chen, Yu Su, Zhiyu Chen,
and William Yang Wang. 2020a. Logical natu-
ral
language generation from open-domain
tables. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Lin-
guistics, pages 7929–7942, Online. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/2020.acl-main.708
Wenhu Chen, Hongmin Wang, Jianshu Chen,
Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou
Zhou, and William Yang Wang. 2020b. Tab-
fact: A large-scale dataset for table-based fact
verification. In International Conference on
Learning Representations.
Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan
Xiong, Hong Wang, and William Yang Wang.
2020c. HybridQA: A dataset of multi-hop ques-
tion answering over tabular and textual data. In
Findings of the Association for Computational
Linguistics: EMNLP 2020, pages 1026–1036,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/2020.findings-emnlp.91
Peter Clark and Oren Etzioni. 2016. My computer
is an honor student—but how intelligent is
it? Standardized tests as a measure of AI. AI
Magazine, 37(1):5–12. https://doi.org
/10.1609/aimag.v37i1.2636
Ido Dagan, Dan Roth, Mark Sammons, and Fabio
Massimo Zanzotto. 2013. Recognizing textual
entailment: Models and applications. Synthesis
Lectures on Human Language Technologies,
6(4):1–220. https://doi.org/10.2200
/S00509ED1V01Y201305HLT023
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
the 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Com-
putational Linguistics.
Jay DeYoung, Sarthak Jain, Nazneen Fatema
Rajani, Eric Lehman, Caiming Xiong, Richard
Socher, and Byron C. Wallace. 2020. ERASER:
A benchmark to evaluate rationalized NLP mod-
els. In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguis-
tics, pages 4443–4458, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.acl-main.408
Julian Eisenschlos, Syrine Krichene, and Thomas
M¨uller. 2020. Understanding tables with inter-
mediate pre-training. In Findings of the Asso-
ciation for Computational Linguistics: EMNLP
2020, pages 281–296, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.findings-emnlp.27
Shi Feng, Eric Wallace, Alvin Grissom II,
Mohit Iyyer, Pedro Rodriguez, and Jordan
Boyd-Graber. 2018. Pathologies of neural
models make interpretations difficult. In Pro-
ceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing,
pages 3719–3728, Brussels, Belgium. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/D18-1407
Matt Gardner, Yoav Artzi, Victoria Basmov,
Jonathan Berant, Ben Bogin, Sihao Chen,
Pradeep Dasigi, Dheeru Dua, Yanai Elazar,
Ananth Gottumukkala, Nitish Gupta, Hannaneh
Hajishirzi, Gabriel Ilharco, Daniel Khashabi,
Kevin Lin, Jiangming Liu, Nelson F. Liu,
Phoebe Mulcaire, Qiang Ning, Sameer Singh,
Noah A. Smith, Sanjay Subramanian, Reut
Tsarfaty, Eric Wallace, Ally Zhang, and Ben
Zhou. 2020. Evaluating models’ local decision
boundaries via contrast sets. In Findings of
the Association for Computational Linguistics:
EMNLP 2020, pages 1307–1323, Online. Asso-
ciation for Computational Linguistics. https://
doi.org/10.18653/v1/2020.findings
-emnlp.117
Mor Geva, Yoav Goldberg, and Jonathan Berant.
2019. Are we modeling the task or the annota-
tor? an investigation of annotator bias in natural
language understanding datasets. In Proceed-
ings of
the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 1161–1166, Hong Kong, China. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1107
672
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
t
l
a
c
_
a
_
0
0
4
8
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Michael Glass, Mustafa Canim, Alfio Gliozzo,
Saneem Chemmengath, Vishwajeet Kumar,
Rishav Chakravarti, Avi Sil, Feifei Pan, Samarth
Bharadwaj, and Nicolas Rodolfo Fauceglia.
2021. Capturing row and column semantics
in transformer based question answering over
tables. In Proceedings of the 2021 Conference
of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human
Language Technologies, pages 1212–1224,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/2021.naacl-main.96
Max Glockner, Vered Shwartz,
and Yoav
Goldberg. 2018. Breaking NLI systems with
sentences that require simple lexical inferences.
In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguis-
tics (Volume 2: Short Papers), pages 650–655,
Melbourne, Australia. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/P18-2103
Karan Goel, Nazneen Fatema Rajani, Jesse
Vig, Zachary Taschdjian, Mohit Bansal, and
Christopher R´e. 2021. Robustness gym: Uni-
fying the NLP evaluation landscape. In Pro-
ceedings of the 2021 Conference of the North
American Chapter of
the Association for
Computational Linguistics: Human Language
Technologies: Demonstrations, pages 42–55,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/2021.naacl-demos.6
Vivek Gupta, Maitrey Mehta, Pegah Nokhiz,
and Vivek Srikumar. 2020. INFOTABS: In-
ference on tables as semi-structured data. In
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguis-
tics, pages 2309–2324, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.acl-main.210
Suchin Gururangan, Swabha Swayamdipta, Omer
Levy, Roy Schwartz, Samuel Bowman, and
Noah A. Smith. 2018. Annotation artifacts in
natural language inference data. In Proceedings
of the 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 107–112, New
Orleans, Louisiana. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/N18-2017
Jonathan Herzig, Pawel Krzysztof Nowak,
Thomas M¨uller, Francesco Piccinno, and Julian
Eisenschlos. 2020. TaPas: Weakly supervised
table parsing via pre-training. In Proceedings of
the 58th Annual Meeting of the Association for
Computational Linguistics, pages 4320–4333,
Online. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.acl-main.398
John Hewitt and Percy Liang. 2019. Designing and
interpreting probes with control tasks. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 2733–2743, Hong Kong, China. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1275
William Huang, Haokun Liu, and Samuel R.
Bowman. 2020. Counterfactually-augmented
SNLI training data does not yield better gener-
alization than unaugmented data. In Proceed-
ings of the First Workshop on Insights from
Negative Results in NLP, pages 82–87, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.insights-1.13
Hiroshi Iida, Dung Thai, Varun Manjunatha, and
Mohit Iyyer. 2021. TABBIE: Pretrained rep-
resentations of tabular data. In Proceedings
of the 2021 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 3446–3456, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2021.naacl-main.270
Mohit Iyyer, John Wieting, Kevin Gimpel, and
Luke Zettlemoyer. 2018 Adversarial example
generation with syntactically controlled para-
phrase networks. In Proceedings of the 2018
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long Papers), pages 1875–1885, New Orleans,
Louisiana. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/N18-1170
673
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
t
l
a
c
_
a
_
0
0
4
8
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Sarthak Jain and Byron C. Wallace. 2019. At-
tention is not explanation. In Proceedings of
the 2019 Conference of the North American
Chapter of
the Association for Computa-
tional Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers),
pages 3543–3556, Minneapolis, Minnesota.
Association for Computational Linguistics.
Robin Jia and Percy Liang. 2017. Adversarial ex-
amples for evaluating reading comprehension
systems. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language
Processing, pages 2021–2031, Copenhagen,
Denmark. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/D17-1215
Divyansh Kaushik, Eduard Hovy, and Zachary
Lipton. 2020. Learning the difference that makes
a difference with counterfactually-augmented
data. In International Conference on Learning
Representations.
Patrick Lewis, Pontus Stenetorp, and Sebastian
Riedel. 2021. Question and answer test-train
overlap in open-domain question answering
datasets. In Proceedings of the 16th Conference
of the European Chapter of the Association
for Computational Linguistics: Main Volume,
pages 1000–1008, Online. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2021.eacl-main.86
Tao Li, Parth Anand Jawale, Martha Palmer, and
Vivek Srikumar. 2020. Structured tuning for
semantic role labeling. In Proceedings of the
58th Annual Meeting of the Association for
Computational Linguistics, pages 8402–8412,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2020.acl-main.744
the 2019 Conference of
Nelson F. Liu, Roy Schwartz, and Noah A. Smith.
2019a. Inoculation by fine-tuning: A method
for analyzing challenge datasets. In Proceed-
ings of
the North
American Chapter of the Association for Com-
putational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers),
pages 2171–2179, Minneapolis, Minnesota.
Association for Computational Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019b. RoBERTa: A robustly op-
timized BERT pretraining approach. arXiv
preprint arXiv:1907.11692. Version 1.
Zeyu Liu, Yizhong Wang, Jungo Kasai, Hannaneh
Hajishirzi, and Noah A. Smith. 2021. Probing
across time: What does RoBERTa know and
when? In Findings of
the Association for
Computational Linguistics: EMNLP 2021,
pages 820–842, Punta Cana, Dominican Republic.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2021
.findings-emnlp.71
Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.
Right for the wrong reasons: Diagnosing syn-
tactic heuristics in natural language inference.
In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 3428–3448, Florence, Italy. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/P19-1334
Anshuman Mishra, Dhruvesh Patel, Aparna
Vijayakumar, Xiang Lorraine Li, Pavan
Kapanipathi, and Kartik Talamadupula. 2021.
Looking beyond sentence-level natural language
inference for question answering and text sum-
marization. In Proceedings of the 2021 Con-
ference of the North American Chapter of the
Association for Computational Linguistics: Hu-
man Language Technologies, pages 1322–1336,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2021.naacl-main.104
Thomas M¨uller, Julian Eisenschlos, and Syrine
Krichene. 2021. TAPAS at SemEval-2021 task
9: Reasoning over tables with intermediate
pre-training. In Proceedings of the 15th In-
ternational Workshop on Semantic Evalua-
tion (SemEval-2021), pages 423–430, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2021
.semeval-1.51
Aakanksha Naik, Abhilasha Ravichander, Norman
Sadeh, Carolyn Rose, and Graham Neubig.
2018. Stress test evaluation for natural lan-
guage inference. In Proceedings of the 27th
International Conference on Computational
Linguistics, pages 2340–2353, Santa Fe, New
Mexico, USA. Association for Computational
Linguistics.
674
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
t
l
a
c
_
a
_
0
0
4
8
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
J. Neeraja, Vivek Gupta, and Vivek Srikumar.
2021. Incorporating external knowledge to en-
hance tabular reasoning. In Proceedings of
the 2021 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 2799–2809, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2021.naacl-main.224
Yixin Nie, Yicheng Wang,
and Mohit
Bansal. 2019. Analyzing compositionality-
sensitivity of NLI models. Proceedings of the
AAAI Conference on Artificial Intelligence,
33(01):6867–6874. https://doi.org/10
.1609/aaai.v33i01.33016867
Timothy Niven and Hung-Yu Kao. 2019. Prob-
ing neural network comprehension of natural
language arguments. In Proceedings of
the
57th Annual Meeting of the Association for
Computational Linguistics, pages 4658–4664,
Florence, Italy. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/P19-1459
Barlas Oguz, Xilun Chen, Vladimir Karpukhin,
Stan Peshterliev, Dmytro Okhonko, Michael
Schlichtkrull, Sonal Gupta, Yashar Mehdad,
and Scott Yih. 2020. Unified open-domain
question answering with structured and un-
structured knowledge. arXiv preprint arXiv:
2012.14610, Version 2.
Bhargavi
Joshi,
Paranjape, Mandar
John
Thickstun, Hannaneh Hajishirzi, and Luke
Zettlemoyer. 2020. An information bottleneck ap-
proach for controlling conciseness in rationale
extraction. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 1938–1952,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2020.emnlp-main.153
Ankur
Parikh, Xuezhi Wang,
Sebastian
Gehrmann, Manaal Faruqui, Bhuwan Dhingra,
Diyi Yang, and Dipanjan Das. 2020. ToTTo:
A controlled table-to-text generation dataset.
In Proceedings of
the 2020 Conference on
Empirical Methods
in Natural Language
Processing (EMNLP), pages 1173–1186, On-
line. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.89
Panupong Pasupat and Percy Liang. 2015. Com-
positional semantic parsing on semi-structured
tables. In Proceedings of
the 53rd Annual
Meeting of the Association for Computational
Linguistics and the 7th International Joint
Conference on Natural Language Processing
(Volume 1: Long Papers), pages 1470–1480,
Beijing, China. Association for Computational
Linguistics. https://doi.org/10.3115
/v1/P15-1142
Adam Poliak,
Jason Naradowsky, Aparajita
Haldar, Rachel Rudinger, and Benjamin Van
Durme. 2018. Hypothesis only baselines in
natural language inference. In Proceedings of
the Seventh Joint Conference on Lexical and
Computational Semantics, pages 180–191, New
Orleans, Louisiana. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/S18-2023
Aniket Pramanick and Indrajit Bhattacharya.
2021. Joint learning of representations for web-
tables, entities and types using graph convo-
lutional network. In Proceedings of the 16th
Conference of the European Chapter of the
Association for Computational Linguistics:
Main Volume, pages 1197–1206, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2021
.eacl-main.102
Abhilasha Ravichander, Yonatan Belinkov, and
Eduard Hovy. 2021. Probing the probing
paradigm: Does probing accuracy entail task
relevance? In Proceedings of the 16th Confer-
ence of the European Chapter of the Asso-
ciation for Computational Linguistics: Main
Volume, pages 3363–3377, Online. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/2021.eacl-main.295
Marco Tulio Ribeiro, Sameer Singh, and Carlos
Guestrin. 2016. ‘‘why should i trust you?’’:
Explaining the predictions of any classifier. In
Proceedings of the 22nd ACM SIGKDD Inter-
national Conference on Knowledge Discovery
and Data Mining, KDD ’16, pages 1135–1144,
New York, NY, USA. Association for Com-
puting Machinery. https://doi.org/10
.1145/2939672.2939778
Marco Tulio Ribeiro, Sameer Singh, and Carlos
Guestrin. 2018a. Anchors: High-precision model-
675
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
t
l
a
c
_
a
_
0
0
4
8
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
agnostic explanations. Proceedings of the AAAI
Conference on Artificial Intelligence, 32(1).
Marco Tulio Ribeiro, Sameer Singh, and Carlos
Guestrin. 2018b. Semantically equivalent ad-
versarial rules for debugging NLP models.
In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguis-
tics (Volume 1: Long Papers), pages 856–865,
Melbourne, Australia. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/P18-1079
Marco Tulio Ribeiro, Tongshuang Wu, Carlos
Guestrin, and Sameer Singh. 2020. Beyond ac-
curacy: Behavioral testing of NLP models with
CheckList. In Proceedings of the 58th Annual
the Association for Computa-
Meeting of
tional Linguistics, pages 4902–4912, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-main.442
Kyle Richardson, Hai Hu, Lawrence Moss, and
Ashish Sabharwal. 2020. Probing natural lan-
guage inference models through semantic frag-
ments. Proceedings of the AAAI Conference
on Artificial Intelligence, 34(05):8713–8721.
https://doi.org/10.1609/aaai.v34i05
.6397
Kyle Richardson and Ashish Sabharwal. 2020.
What does my QA model know? Devising con-
trolled probes using expert knowledge. Trans-
actions of the Association for Computational
Linguistics, 8:572–588. https://doi.org
/10.1162/tacl_a_00331
Sofia Serrano and Noah A. Smith. 2019. Is at-
tention interpretable? In Proceedings of the
57th Annual Meeting of the Association for
Computational Linguistics, pages 2931–2951,
Florence, Italy. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/P19-1282
Ishan Tarunesh, Somak Aditya, and Monojit
Choudhury. 2021. Trusting RoBERTa over
BERT: Insights from checklisting the natural
language inference task. arXiv preprint arXiv:
2107.07229. Version 1.
language models. Transactions of the Associa-
tion for Computational Linguistics, 8:621–633.
https://doi.org/10.1162/tacl a 00335
Nancy Xin Ru Wang, Diwakar Mahajan,
Marina Danilevsky, and Sara Rosenthal. 2021a.
Semeval-2021 task 9: Fact verification and ev-
idence finding for tabular data in scientific
documents (sem-tab-facts). In SemEval@ACL/
IJCNLP, pages 317–326.
Siyuan Wang, Wanjun Zhong, Duyu Tang,
Zhongyu Wei, Zhihao Fan, Daxin Jiang, Ming
Zhou, and Nan Duan. 2021b. Logic-driven
context extension and data augmentation for
logical reasoning of text. arXiv preprint arXiv:
2105.03659. Version 1.
Sarah Wiegreffe and Yuval Pinter. 2019. Atten-
tion is not not explanation. In Proceedings of
the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 11–20,
Hong Kong, China. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D19-1002
Pengcheng Yin, Graham Neubig, Wen-tau Yih,
and Sebastian Riedel. 2020. TaBERT: Pretrain-
ing for joint understanding of textual and tabular
data. In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguis-
tics, pages 8413–8426, Online. Association for
Computational Linguistics.
Ori Yoran, Alon Talmor, and Jonathan Berant.
2021. Turning tables: Generating examples
from semi-structured tables for endowing lan-
guage models with reasoning skills. arXiv
preprint arXiv:2107.07261. Version 1.
Vicky Zayats, Kristina Toutanova, and Mari
Ostendorf. 2021. Representations for question
answering from documents with tables and
text. In Proceedings of the 16th Conference
of the European Chapter of the Association
for Computational Linguistics: Main Volume,
pages 2895–2906, Online. Association for
Computational Linguistics. https://doi
.org/10.18653/v1/2021.eacl-main
.253
Lifu Tu, Garima Lalwani, Spandana Gella, and
He He. 2020. An empirical study on robust-
ness to spurious correlations using pre-trained
Chong Zhang, Jieyu Zhao, Huan Zhang, Kai-Wei
Chang, and Cho-Jui Hsieh. 2021. Double per-
turbation: On the robustness of robustness and
676
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
t
l
a
c
_
a
_
0
0
4
8
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
counterfactual bias evaluation. In Proceedings
of the 2021 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 3899–3916, Online. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2021.naacl-main.305
Appendix
In Section 5.1, we defined four types of row-
agnostic table modifications:(a) row deletion, (b)
row insertion, (c) row-value update, and (d) row
permutation and presented the first one there. We
present details of the rest here along with the
respective impact on the α1, α2, and α3 test sets.
Row Insertion. When we insert new informa-
tion that does not contradict an existing table,12
original predictions should be retained in almost
all cases. Very rarely, NEUTRAL labels may change
to ENTAIL or CONTRADICT. For example, adding
the Singles row below to our running example
table doesn’t change labels for any hypothesis
except
the H4 label (see Figure 1) changing
to CONTRADICT with the additional information.
Singles The Logical Song; Breakfast in
America; Goodbye Stranger; Take
the Long Way Home
Figure 7 shows the possible label changes after
new row insertion as a directed labeled graph, and
the results are summarized in Table 13. Note that
all transitions from NEUTRAL are valid upon row
insertion, although not all may be accurate.
Row Update.
In case of row update, we only
change a portion of a row value. Whole row value
substitutions are examined separately as compos-
ite operations of deletion followed by insertion.
Unlike a whole row update, changing only a por-
tion of a row is non-trivial. We must ensure that
the updated value is appropriate for the key in
question and also avoid self-contradictions. To
satisfy these constraints, we update a row with a
value from a random table with the same key and
only update values in multi-valued rows. A row
update operation may have an effect on all labels.
12To ensure that the information added is not contradictory
to existing rows, we only add rows with new keys instead of
changing values for the existing keys.
Figure 7: Changes in model predictions after new row
insertion. (Notation similar to Figure 3).
Dataset
ENTAIL
NEUTRAL
CONTRADICT
Average
α1
2.81
0
6.77
3.19
α2
4.99
0
6.54
3.84
α3
2.51
0
6.35
2.95
Average
3.44
0
6.55
–
Table 13: Percentage of invalid transitions after
new row insertion. For an ideal model, all these
numbers should be zero.
Though feasible, we consider the transitions from
CONTRADICT to ENTAIL to be prohibited. Unlike
ENTAIL to CONTRADICT transitions, these transitions
would be extremely rare as values are updated
randomly, regardless of their semantics. For ex-
ample, if we substitute pop in the multi-valued
key Genre in our running example with another
genre, the hypothesis H1 is likely to change to
CONTRADICT.
Since we are updating a single value from a
multi-valued key, the changes to the table are
minimal and may not be perceived by the model.
As a result, we should expect row updates to have
lower impact on model predictions. This appears
to be the case, as evidenced by the results in
Figure 8, which show that the labels do not change
drastically after update. The results in Figure 8
are summarized in Table 14.
Row Permutation. By design of the premises,
the order of their rows should have no effect on hy-
potheses labels. In other words, the labels should
be invariant to row permutation. However, from
Figure 9, it is evident that even a simple shuffling
of rows, where no information has been tampered
with, can have a notable effect on performance.
677
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
t
l
a
c
_
a
_
0
0
4
8
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Dataset
α1
α2
ENTAIL
NEUTRAL
CONTRADICT
9.25
7.1
11.6
12.2
6.8
8.76
α3
14.6
12.5
13.7
Average
12.02
8.79
11.36
Average
9.34
9.26
13.6
–
Table 15: Percentage of invalid transitions after
row permutations. For an ideal model, all these
numbers should be zero.
Figure 8: Changes in model predictions after row value
update. (Notation similar to Figure 3).
Dataset
ENTAIL
NEUTRAL
CONTRADICT
Average
α1
0.08
0.12
0.49
0.23
α2
0.22
0.11
0.30
0.21
α3
0.12
0.09
0.19
0.13
Average
0.14
0.11
0.33
–
Table 14: Percentage of invalid transitions after
row value update. For an ideal model, all these
numbers should be zero.
Figure 9: Changes in model predictions after shuffling
of table rows. (Notation similar to Figure 3.)
This shows that the model is relying on row po-
sitions incorrectly, while the semantics of a table
is order invariant. We summarize the combined
invalid transitions from Figure 9 in Table 15.
Irrelevant Row Deletion.
Ideally, deletion of
an irrelevant row should have no effect on a
hypothesis label. The results in Figure 10 and in
Table 16 show that even irrelevant rows have an
effect on model predictions. This further illustrates
that the seemingly accurate model predictions are
not appropriately grounded on evidence.
678
Figure 10: Change in model predictions after deletion
of an irrelevant row. (Notation similar to Figure 3.)
Dataset
ENTAIL
NEUTRAL
CONTRADICT
α1
5.14
3.9
5.94
α2
6.97
3.54
5.09
Average
4.99
5.2
α3
6.09
5.01
6.91
6.01
Average
6.07
4.15
5.98
–
Table 16: Percentage of invalid transitions after
deletion of irrelevant rows. For an ideal model,
all these numbers should be zero.
Composition of Perturbation Operations
In
addition to probing individual operations, we can
also study their compositions. For example, we
could delete a row, and insert a different row, and
so on. The composition of these operations have
interesting properties with respect to the allowed
transitions. For example, when an operation is
composed with itself (e.g., two deletions), the
set of valid label changes is the same as for the
operation. A particularly interesting composition
is deletion followed by an insertion, since this can
be viewed as a row update. In Figure 11, we show
the transition graph for the composition operation
of row deletion followed by insertion and the
summary of the possible transitions is presented
in Table 17.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
t
l
a
c
_
a
_
0
0
4
8
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Dataset
ENTAIL
NEUTRAL
CONTRADICT
α1
3.02
0.00
9.81
α2
6.53
0.00
7.88
α3
4.16
0.00
6.71
Average
4.28
4.80
3.63
Average
4.57
0.00
8.13
–
Table 17: Percentage of invalid transitions after
deletion followed by an insertion operation.
For an ideal model, all these numbers should
be zero.
Figure 11: Changes in model predictions after deletion
followed by an insert operation. (Notation similar to
Figure 3.)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
2
2
0
2
8
8
4
7
/
/
t
l
a
c
_
a
_
0
0
4
8
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
679
Scarica il pdf