An Empirical Study on Robustness to Spurious Correlations using

An Empirical Study on Robustness to Spurious Correlations using
Pre-trained Language Models

Lifu Tu1 ∗ Garima Lalwani2 Spandana Gella2 He He3 ∗

1Toyota Technological Institute at Chicago 2Amazon AI

3New York University

lifu@ttic.edu, {glalwani, sgella}@amazon.com, hehe@cs.nyu.edu

Astratto

Recent work has shown that pre-trained lan-
guage models such as BERT improve robust-
ness to spurious correlations in the dataset.
Intrigued by these results, we find that the
key to their success is generalization from a
small amount of counterexamples where the
spurious correlations do not hold. When such
minority examples are scarce, pre-trained mod-
els perform as poorly as models trained from
scratch. In the case of extreme minority, we
propose to use multi-task learning (MTL) A
improve generalization. Our experiments on
natural
language inference and paraphrase
identification show that MTL with the right
auxiliary tasks significantly improves perfor-
mance on challenging examples without hurt-
ing the in-distribution performance. Further,
we show that the gain from MTL mainly comes
from improved generalization from the minor-
ity examples. Our results highlight the impor-
tance of data diversity for overcoming spurious
correlations.1

1 introduzione

A key challenge in building robust NLP models
is the gap between limited linguistic variations in
the training data and the diversity in real-world
languages. Thus models trained on a specific
dataset are likely to rely on spurious correlations:
prediction rules that work for
the majority
examples but do not hold in general. Per esempio,
in natural language inference (NLI) compiti, previous
work has found that models learned on notable
benchmarks achieve high accuracy by associating

∗Most work was done during first author’s internship and

last author’s work at Amazon AI.

high word overlap between the premise and the
hypothesis with entailment (Dasgupta et al., 2018;
McCoy et al., 2019). Consequently, these models
perform poorly on the so-called challenging or
adversarial datasets, where such correlations no
longer hold (Glockner et al., 2018; McCoy et al.,
2019; Nie et al., 2019; Zhang et al., 2019). Questo
issue has also been referred to as annotation arti-
facts (Gururangan et al., 2018), dataset bias (Lui
et al., 2019; Clark et al., 2019), and group shift
(Oren et al., 2019; Sagawa et al., 2020) in the
literature.

Most current methods rely on prior knowledge
of spurious correlations in the dataset and tend to
suffer from a trade-off between in-distribution
accuracy on the independent and identically
distributed (i.i.d.) test set and robust accuracy2 on
the challenging dataset. Nevertheless, recent em-
pirical results have suggested that self-supervised
pre-training improves robust accuracy, while not
using any task-specific knowledge nor incurring
in-distribution accuracy drop (Hendrycks et al.,
2019, 2020).

in questo documento, we aim to investigate how and
when pre-trained language models such as BERT
improve performance on challenging datasets. Nostro
key finding is that pre-trained models are more
robust to spurious correlations because they can gen-
eralize from a minority of training examples that
counter the spurious pattern, per esempio., non-entailment
examples with high premise-hypothesis word
sovrapposizione. Specifically, removing these counterex-
amples from the training set significantly hurts
their performance on the challenging datasets.
Inoltre, larger model size, more pre-training
dati, and longer fine-tuning further improve robust
accuracy. Nevertheless, pre-trained models still
suffer from spurious correlations when there are
too few counterexamples. In the case of extreme

1Code is available at https://github.com/lifu-

2We use the term ‘‘robust accuracy’’ from now on to refer

tu/Study-NLP-Robustness.

to the accuracy on challenging datasets.

621

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 8, pag. 621–633, 2020. https://doi.org/10.1162/tacl a 00335
Redattore di azioni: Yoav Goldberg. Lotto di invio: 2/2020; Lotto di revisione: 6/2020; Pubblicato 10/2020.
C(cid:13) 2020 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
3
5
1
9
2
3
5
0
6

/

/
T

l

UN
C
_
UN
_
0
0
3
3
5
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Dataset

Size

Heuristic

Input

Label

Natural language inference

Train MNLI

Test

HANS

393k

high word overlap
⇒ entailment
high word overlap
30k ; entailment

P: The doctor mentioned the manager who ran.
H: The doctor mentioned the manager.
P: The actors who advised the manager saw the tourists.
H: The manager saw the tourists.

entailment

non-entailment

Paraphrase Identification

Train QQP

Test

PAWSQQP

364k

same bag-of-words S1: Bangkok vs Shanghai?
⇒ paraphrase
S2: Shanghai vs Bangkok?
same bag-of-words S1: Are all dogs smart or can some be dumb?
677 ; paraphrase
S2: Are all dogs dumb or can some be smart?

paraphrase

non-paraphrase

Tavolo 1: Representative examples from the training datasets (MNLI and QQP) and the challenging/test datasets
(HANS and PAWSQQP). Overlaping text spans are highlighted for NLI examples and swapped words are
highlighted for paraphrase identification examples. The word overlap-based heuristic that works for typical
training examples fails on the test data.

minority, we empirically show that multi-task
apprendimento (MTL) improves robust accuracy by
improving generalization from the minority exam-
ples, even though preivous work has suggested
that MTL has limited advantage in i.i.d. settings
(Søgaard and Goldberg, 2016; Hashimoto et al.,
2017).

This work sheds light on the effectiveness of
pre-training on robustness to spurious correlations.
Our results highlight
the importance of data
diversity (even if the variations are imbalanced).
The improvement from MTL also suggests that
traditional techniques that improve generalization
in the i.i.d. setting can also improve out-of-
distribution generalization through the minority
examples.

2 Challenging Datasets

In a typical supervised learning setting, we test
the model on held-out examples drawn from the
same distribution as the training data, cioè., the in-
distribution or i.i.d. test set. To evaluate if the
model latches onto known spurious correlations,
challenging examples are drawn from a different
distribution where such correlations do not hold. In
practice, these examples are usually adapted from
the in-distribution examples to counter known
spurious correlations on notable benchmarks.
Poor performance on the challenging dataset is
considered an indicator of a problematic model
that relies on spurious correlations between inputs
and labels. Our goal is to develop robust models
that have good performance on both the i.i.d. test
set and the challenging test set.

2.1 Datasets

We focus on two natural language understanding
compiti, NLI and paraphrase identification (PI).
Both have large-scale benchmarking datasets with
around 400k examples. Although recent models
have achieved near-human performance on these
benchmarks,3 the challenging datasets exploiting
spurious correlations bring down the performance
of state-of-the-art models below random guessing.
We summarize the datasets used for our analysis
in Table 1.

NLI. Given a premise sentence and a hypothesis
sentence, the task is to predict whether the hy-
pothesis is entailed by, neutral with, or contradicts
the premise. MultiNLI (MNLI) (Williams et al.,
2017) is the most widely used benchmark for NLI,
and it is also the most thoroughly studied in terms
of spurious correlations. It was collected using the
same crowdsourcing protocol as its predecessor
SNLI (Bowman et al., 2015) but covers more
domini. Recentemente, McCoy et al. (2019) exploit
high word overlap between the premise and the
hypothesis for entailment examples to construct
a challenging dataset called HANS. They use
syntactic rules to generate non-entailment (neutro
or contradicting) examples with high premise-
hypothesis overlap. The dataset is further split
into three categories depending on the rules
used: lexical overlap, subsequence,
and constituent.

3See the leaderboard at https://gluebenchmark.

com.

622

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
3
5
1
9
2
3
5
0
6

/

/
T

l

UN
C
_
UN
_
0
0
3
3
5
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Model

Non pre-trained baselines
BERTscratch
ESIM

pre-trained models
BERTBASE(prior)
BERTBASE(ours)
BERTLARGE
RoBERTaBASE
RoBERTaLARGE

Trained on MNLI

Trained on QQP

In-distribution
MNLI-m

Challenging
HANS

In-distribution
QQP

Challenging
PAWSQQP

67.9 (0.5)
78.1UN

84.0C
84.5 (0.1)
86.2 (0.2)
87.4 (0.2)
89.1 (0.1)

49.9 (0.2)
49.1UN

53.8C
62.5 (3.4)
71.4 (0.6)
74.1 (0.9)
77.1 (1.6)

83.0 (0.7)
85.3B

90.5D
90.8 (0.3)
91.3 (0.3)
91.5 (0.2)
89.0 (3.1)

40.6 (1.9)
38.9B

33.5D
36.1 (0.8)
40.1 (1.8)
42.6 (1.9)
39.5 (4.8)

Tavolo 2: Accuracies (with standard deviation) on the in-distribution datasets, MNLI-matched (MNLI-m)
and QQP dev sets, as well as the challenging datasets, HANS and PAWSQQP. Pre-trained transformers
improve accuracies on both the in-distribution and challenging datasets over non pre-trained models,
except on PAWSQQP. Our models, fine-tuned for more epochs, further improve prior results on the
challenging data. Results taken from prior work: a He et al. (2019), b Zhang et al. (2019), c McCoy
et al. (2019), d Zhang et al. (2019).

PI. Given two sentences, the task is to predict
whether they are paraphrases or not. On Quora
Question Pairs (QQP) (Iyer et al., 2017), one of
the largest PI dataset, Zhang et al. (2019) show
that very few non-paraphrase pairs have high word
sovrapposizione. They then created a challenging datasets
called PAWS that contains sentence pairs with
high word overlap but different meanings through
word swapping and back-translation. Inoltre
to PAWSQQP, which is created from sentences in
QQP, they also released PAWSWiki, created from
Wikipedia sentences.

3 Pre-training Improve Robust Accuracy

Recent results have shown that pre-trained models
appear to improve performance on challenging
examples over models trained from scratch
(Yaghoobzadeh et al., 2019; He et al., 2019;
Kaushik et al., 2020). In this section, we confirm
this observation by thorough experiments on
different pre-trained models and motivate our
inquiries.

Modelli. We compare pre-trained models of
different sizes and using different amounts of pre-
training data. Specifically, we use the BERTBASE
(340M
(110M parameters) and BERTLARGE
parameters) models implemented in GluonNLP
(Guo et al., 2020) pre-trained on 16GB of text

(Devlin et al., 2019).4 To investigate the effect
of size of the pre-training data, we also experiment
with the RoBERTaBASE and RoBERTaLARGE
models (Liu et al., 2019D),5 which have the same
architecture as BERT but were trained on ten
times as much text (about 160GB). To ablate the
effect of pre-training, we also include a BERTBASE
model with random initialization, BERTscratch.

Fine-Tuning. We fine-tuned all models for 20
epochs and selected the best model based on
the in-distribution dev set. We used the Adam
optimizer with a learning rate of 2e-5, L2 weight
decay of 0.01, and batch sizes of 32 E 16 for
base and large models, rispettivamente. Weights
of BERTscratch and the last layer (classifier) Di
pre-trained models are initialized from a normal
distribution with zero mean and 0.02 variance.
All experiments are run with 5 random seeds and
the average values are reported.

Observations and Inquiries.
In Table 2, we
show results for NLI and PI, rispettivamente. As

4 The book corpus wiki en uncased model from
https://gluon-nlp.mxnet.io/model zoo/bert/
index.html.

5The openwebtext ccnews stories books
cased model from https://gluon-nlp.mxnet.io/
model zoo/bert/index.html.

623

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
3
5
1
9
2
3
5
0
6

/

/
T

l

UN
C
_
UN
_
0
0
3
3
5
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
3
5
1
9
2
3
5
0
6

/

/
T

l

UN
C
_
UN
_
0
0
3
3
5
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Figura 1: Accuracy on the in-distribution data (MNLI dev and QQP dev) and the challenging data (HANS and
PAWSQQP) after each fine-tuning epoch using BERTBASE. The performance plateaus on the in-distribution data
quickly around epoch 3, although, accuracy on the challenging data keeps increasing.

expected,
they improve performance on in-
distribution test sets significantly.6 On the chal-
lenging datasets, we make two key observations.
Primo, although pre-trained models improve
the performance on challenging datasets,
IL
improvement is not consistent across datasets.
Specifically, the improvement on PAWSQQP are
less promising than HANS. Whereas larger
models (large vs. base) and more training
dati (RoBERTa vs. BERT) yield a further im-
provement of 5 A 10 accuracy points on HANS,
the improvement on PAWSQQP is marginal.

Secondo, even though three to four epochs of
fine-tuning is typically sufficient for in-distribu-
longer fine-tuning
tion data, we observe that
improves results on challenging examples signi-
ficantly (see BERTBASE ours vs. prior in Table 2).
Come mostrato in figura 1, although the accuracy
on MNLI and QQP dev sets saturate after three
epochs, the performance on the corresponding
challenging datasets keeps increasing until around
the tenth epoch, with more than 30% improvement.
The above observations motivate us to ask the

following questions:

1. How do pre-trained models generalize to

out-of-distribution data?

2. When do they generalize well given the

inconsistent improvements?

3. What role does longer fine-tuning play?

We provide empirical answers to these questions
in the next section and show that the answers are
all related to a small number of counterexamples
in the training data.

4 Generalization from Minority

Esempi

4.1 Pre-training Improves Robustness to

Data Imbalance

One common impression is that the diversity in
large amounts of pre-training data allows pre-
trained models to generalize better to out-of-
distribution data. Here we show that although
pre-training improves generalization, they do not
enable extrapolation to unseen patterns. Invece,
they generalize better from minority patterns in
the training set.

Importantly, we notice that examples in HANS
and PAWS are not completely uncovered by the
training data, but belong to the minority groups.7
Per esempio, in MNLI, there are 727 HANS-like
non-entailment examples where all words in the
hypothesis also occur in the premise; in QQP, there
are 247 PAWS-like non-paraphrase examples
where the two sentences have the same bag of
parole. We refer to these examples that counter
the spurious correlations as minority examples.
We hypothesize that pre-trained models are more
robust to group imbalance, thus generalizing well
from the minority groups.

6 The lower performance of RoBERTaLARGE compared
with RoBERTaBASE is partly due with its high variance in
our experiments.

7Following Sagawa et al. (2020), we loosely define group
as a distribution of examples with similar patterns, per esempio., high
premise-hypothesis overlap and non-entailment.

624

Figura 2: Accuracy on HANS when a small fraction of training data is removed. Removing non-entailment
examples with high premise-hypothesis overlap significantly hurt performance compared with removing examples
uniformly at random.

To verify our hypothesis, we remove minority
examples during training and observe their effect
on robust accuracy. Specifically, for NLI we
sort non-entailment (contradiction and neutral)
examples in MNLI by their premise-hypothesis
sovrapposizione, which is defined as the percentage of
hypothesis words that also appear in the premise.
We then remove increasing amounts of these
examples in the sorted order.

Come mostrato in figura 2, all models have
significantly worse accuracy on HANS as more
counterexamples are removed, while maintaining
the original accuracy when the same amounts
of random training examples are removed. Con
6.4% of counterexamples removed, the perfor-
mance of most pretrained models is near-random,
as poor as non-pretrained models. È interessante notare,
larger models with more pre-training data
(RoBERTaLARGE) appear to be slightly more
robust with increased level of imbalance.

Takeaway. These results reveal that pre-training
improves robust accuracy by improving the i.i.d.
accuracy on minority groups, highlighting the
importance of increasing data diversity when
creating benchmarks. Further, pre-trained models
still suffer from suprious correlations when the
minority examples are scarce. To enable extra-
polation, we might need additional inductive bias
(Nye et al., 2019) or new learning algorithms
(Arjovsky et al., 2019).

4.2 Minority Patterns Require Varying

Amounts of Training Data

Given that pre-trained models generalize better
from minority examples, why do we not see similar
improvement on PAWSQQP even though QQP
also contains counterexamples? Unlike HANS
examples that are generated from a handful of

625

Figura 3: Learning curves of models trained on HANS
and PAWSQQP. Accuracy on PAWSQQP increases
slowly, whereas all models quickly reach 100% accu-
racy on HANS.

templates, PAWS examples are generated by
swapping words in a sentence followed by human
inspection. They often require recognizing nuance
syntactic differences between two sentences with
a small edit distance. Per esempio, compare
‘‘What’s classy if you’re poor, but trashy if you’re
ricco?’’ and ‘‘What’s classy if you’re rich, Ma
trashy if you’re poor?’.’ Therefore, we posit
that more samples are needed to reach good
performance on PAWS-like examples.

To test the hypothesis, we plot learning curves
by fine-tuning pre-trained models on the challeng-
ing datasets directly (Liu et al., 2019B). Specifi-
cally, we take 11,990 training examples from
PAWSQQP, and randomly sample the same number
of training examples from HANS;8 the rest is used
as dev/test set for evaluation. In Figure 3, we see
that all models reach 100% accuracy rapidly on
HANS. Tuttavia, on PAWS, accuracy increases
slowly and the models struggle to reach around
90% accuracy even with the full training set. Questo
suggests that the amount of minority examples in
QQP might not be sufficient for reliably estimating
the model parameters.

8HANS has more examples in total (30,000), Perciò

we sub-sample it to control for the data size.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
3
5
1
9
2
3
5
0
6

/

/
T

l

UN
C
_
UN
_
0
0
3
3
5
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Figura 4: Accuracy of BERTBASE and RoBERTaBASE on PAWSQQP decreases with increasing sentence length and
parse tree height.

Figura 5: The average losses and accuracies of the examples in the training and dev set when fine-tuning BERTBASE
on MNLI. We show plots for the whole training set and the minority examples separately. The minority examples
are non-entailment examples with at least 80% premise-hypothesis overlap. Accuracy of minority examples takes
longer to plateau.

To have a qualitative understanding of why
PAWS examples are difficult to learn, we compare
sentence length and constituency parse tree height
of examples in HANS and PAWS.9 We find that
PAWS contains longer and syntactically more
complex sentences, with an average length of 20.7
words and parse tree height of 11.4, compared
A 9.2 E 7.5 on HANS. Figura 4 shows that
the accuracy of BERTBASE and RoBERTaBASE on
PAWSQQP decreases as the example length and
the parse tree height increase.

Takeaway. We have shown that the inconsistent
improvement on different challenging datasets
result from the same mechanism: Pre-trained
models improve robust accuracy by generalizing
from minority examples, although, perhaps unsur-

9We use the off-the-shelf constituency parser

from
Stanford CoreNLP (Manning et al., 2014). For each example,
we compute the maximum length (number of words) E
parse tree height of the two sentences.

prisingly, different minority patterns may require
varying amounts of training data. This also poses
a potential challenge in using data augmentation
to tackle spurious correlations.

4.3 Minority Examples Require Longer

Fine-Tuning

In the previous section, we have shown in
Figura 1 that longer fine-tuning improves accuracy
on challenging examples, even though the in-
distribution accuracy saturates pretty quickly. A
understand the result from the perspective of
minority examples, we compare the loss on all
examples and the minority examples during fine-
tuning. Figura 5 shows the loss and accuracy
at each epoch on all examples and HANS-like
examples in MNLI separately.

Primo, we see that the training loss of minority
examples decreases more slowly than the average

626

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
3
5
1
9
2
3
5
0
6

/

/
T

l

UN
C
_
UN
_
0
0
3
3
5
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

loss, taking more than 15 epochs to reach near-
zero loss. Secondo, the dev accuracy curves show
that the accuracy of minority examples plateaus
Dopo, around epoch 10, whereas the average
accuracy stops increasing around epoch 5. In
aggiunta, it appears that BERT does not overfit
with additional fine-tuning based on the accuracy
curves.10 Similary, a concurrent work (Zhang
et al., 2020) has found that longer fine-tuning
improves few-sample performance.

Takeaway. Although longer fine-tuning does
not help in-distribution accuracy, we find that it
improves performance on the minority groups.
This suggests that selecting models or early
stopping based on the i.i.d. dev set performance
is insufficient, and we need new model selection
criteria for robustness.

5 Improve Generalization Through

Multi-task Learning

Our results on minority examples show that
increasing the number of counterexamples to
spurious correlations helps to improve model
robustness. Then, an obvious solution is data
augmentation; Infatti, both McCoy et al. (2019)
and Zhang et al. (2019) show that adding a
small number of challenging examples to the
training set significantly improves performance
on HANS and PAWS. Tuttavia, these methods
often require task-specific knowledge on spurious
correlations and heavy use of rules to generate
the counterexamples. Instead of adding examples
with specific patterns, we investigate the effect of
aggregating generic data from various sources
through MTL. It has been shown that MTL
reduces the sample complexity of individual tasks
compared to single-task learning (Caruana, 1997;
Baxter, 2000; Maurer et al., 2016), thus it may
further improve the generalization capability of
pre-trained models, especially on the minority
groups.

5.1 Multi-task Learning

We learn from datasets from different sources
jointly, where one is the target dataset to be

10 We find that the average accuracy stays almost the
same while the dev loss is increasing. Guo et al. (2017) had
similar observations. One possible explanation is that the
model prediction becomes less confident (hence larger log
loss), but the argmax prediction is correct.

evaluated on, and the rest are auxiliary datasets.
The target dataset and the auxiliary dataset can
belong to either the same task, per esempio., MNLI and
SNLI, or different but related tasks, per esempio., MNLI
and QQP.

All datasets share the representation given
by the pre-trained model, and we use separate
linear classification layers for each dataset. IL
learning objective is a weighted sum of average
losses on each dataset. We set the weight to be
1 for all datasets, equivalent to sampling examples
from each dataset proportional to its size.11 During
training, we sample mini-batches from each data-
set sequentially and use the same optimization
hyperparameters as in single-task fine-tuning
(Sezione 3) except for smaller batch sizes due
to memory constraints.12

Auxiliary Datasets. We consider NLI and PI
as related tasks because they both require under-
standing and comparing the meaning of two
sentences. Therefore, we use both benchmark
datasets and challenging datasets for NLI and PI
as our auxiliary datasets. The hope is that bench-
mark data from related tasks help transfer
useful knowledge across tasks, thus improving
generalization on minority examples, and the
challenging datasets countering specific spurious
correlations further improve generalization on the
corresponding minority examples. We analyze the
contribution of the two types of auxiliary data in
Sezione 5.2. The MTL training set up is shown
in Table 4.13 Details on the auxiliary datasets are
described in Section 2.1.

5.2 Results

MTL Improves Robust Accuracy. Our main
MTL results are shown in Table 3. MTL increases
accuracies on the challenging datasets across tasks
without hurting the in-distribution performance,

11 Prior work has shown that the mixing weights may
impact the final results in MTL, especially when there is a
risk of overfitting to low-resource tasks (Raffel et al., 2019).
Given the relatively large dataset sizes in our experiments
(Tavolo 4), we did not see significant change in the results
when varying the mixing weights.

12The minibatch size of the target dataset is 16. For the
auxiliary dataset, it is proportional to the dataset size and not
larger than 16, such that the total number of examples in a
batch is at most 32.

13 For MNLI, we did not include other PI datasets such as
STS-B (Cer et al., 2017) and MPRC (Dolan and Brockett,
2005) because their sizes (3.7k and 7k) are too small compared
with QQP and other auxiliary tasks.

627

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
3
3
5
1
9
2
3
5
0
6

/

/
T

l

UN
C
_
UN
_
0
0
3
3
5
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Task = MNLI

Task = QQP

Model

BERTBASE

RoBERTaBASE

In-distribution Challenging In-distribution Challenging Challenging
PAWSQQP PAWSWiki
QQP

HANS

Algo. MNLI-m

STL
MTL

STL
MTL

84.5 (0.1)
83.7 (0.3)

87.4 (0.2)
86.4 (0.2)

62.5 (0.2)
68.2 (1.8)

74.1 (0.9)
72.8 (2.4)

90.8 (0.3)
91.3 (.07)

91.5 (0.2)
91.7 (.04)

36.1 (0.8)
45.9 (2.1)

42.6 (1.9)
51.7 (1.2)

46.9 (0.3)
52.0 (1.9)

49.6 (1.9)
57.7 (1.5)

Tavolo 3: Comparison between models fine-tuned with multi-task (MTL) and single-task (STL)
apprendimento. MTL improves robust accuracy on challenging datasets. We ran t-tests for the mean
accuracies of STL and MTL on five runs and the larger number is bolded when they differ
significantly with a p < 0.001. Target Model Algo. HANS-O HANS-C HANS-S Auxiliary Datasets MNLI SNLI QQP PAWSQQP+Wiki HANS Size 393k 549k 364k 60k 30k NLI X X X PI X X X the Table 4: Auxiliary dataset sizes for different target datasets from two tasks: NLI and PI. 75.8 (4.9) 59.1 (4.8) 52.7 (1.2) STL BERTBASE MTL 89.5 (1.9) 61.9 (2.3) 53.1 (1.1) BERTBASE RoBERTaBASE STL 88.5 (2.0) 70.0 (2.3) 63.9 (1.4) RoBERTaBASE MTL 90.3 (1.2) 64.8 (3.1) 63.5 (4.9) Table 5: MTL Results on different categories on HANS: lexical overlap (O), con- stituent (C), and subsequence (S). Both auxiliary data (MTL) and larger pre-training data (RoBERTa) improve accuracies mainly on lexical overlap. especially when the minority examples in the target dataset is scarce (e.g., PAWS). Whereas prior work has shown limited success of MTL when tested on in-distribution data (Søgaard and Goldberg, 2016; Hashimoto et al., 2017; Raffel et al., 2019), our results demonstrate its value for out-of-distribution generalization. On HANS, MTL improves the accuracy signif- icantly for BERTBASE but not for RoBERTaBASE. To confirm the result, we additionally experi- mented with RoBERTaLARGE and obtained consis- tent results: MTL achieves an accuracy of 75.7 (2.1) on HANS, similar to the STL result, 77.1 (1.6). One potential explanation is that RoBERTa is already sufficient for providing good generaliza- tion from minority examples in MNLI. In addition, both MTL and RoBERTaBASE yield the biggest improvement on lexical overlap, as shown in the results on HANS by category (Table 5), We believe the reason is that lexi- cal overlap is the most representative pattern among high-overlap and non-entailment training examples. In fact, 85% of the 727 HANS-like examples belong to lexical overlap. This suggests that further improvement on HANS may require better data coverage on other categories. On PAWS, MTL consistently yields large im- provement across pre-trained models. Given that QQP has fewer minority examples resembling the patterns in PAWS, which is also harder to learn (Section 4.2), the results show that MTL is an effective way to improve generalization when the minority examples are scarce. Next, we investigate why MTL is helpful. Improved Generalization from Minority Examples. We are interested in finding how MTL helps generalization from minority exam- ples. One possible explanation is that the chal- lenging data in the auxiliary datasets prevent the model from learning suprious patterns. However, the ablation studies on auxiliary datasets in Table 6 and Table 7 show that the challenging datasets are not much more helpful than benchmark datasets. The other possible explanation is that MTL reduces sample complexity for learning from the minority examples in the target dataset. To verify this, we remove minority examples from both the 628 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 3 5 1 9 2 3 5 0 6 / / t l a c _ a _ 0 0 3 3 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Removed MNLI-m HANS ∆ Removed QQP PAWSQQP ∆ None PAWSQQP+Wiki QQP SNLI 83.7 (0.3) 83.5 (0.3) 83.2 (0.3) 84.3 (0.2) 68.2 (1.8) − 64.6 (3.5) −3.6 63.2 (3.7) −5.0 66.9 (1.5) −1.3 Table 6: Results of the ablation study on auxiliary datasets using BERTBASE on MNLI (the target task). While the in-distribution performance is hardly affected when a specific auxiliary dataset is excluded, performance on the challenging data varies (difference shown in ∆). Removed QQP PAWSQQP ∆ None HANS MNLI SNLI 91.3 (.07) 91.5 (.06) 91.2 (.11) 91.3 (.09) 45.9 (2.1) − 45.3 (1.8) −0.6 42.3 (1.8) −3.6 44.2 (1.3) −1.7 Table 7: Results of the ablation study on auxiliary datasets using BERTBASE on QQP (the target task). While the in-distribution performance is hardly affected when a specific auxiliary dataset is excluded, performance on the challenging data varies (difference shown in ∆). auxiliary and the target datasets, and compare their effect on the robust accuracy. We focus on PI because MTL shows the largest improvement there. In Table 8, we show the results after removing minority examples in the target dataset, QQP, and the auxiliary dataset, MNLI, respectively. We also add a control baseline where the same amounts of randomly sampled examples are removed. The results confirm our hypothesis: Without the minority examples in the target dataset, MTL is only marginally better than STL on PAWSQQP. In contrast, removing minority examples in the auxiliary dataset has a similar effect to removing random examples; both do not cause significant performance drop. Therefore, we conclude that MTL improves robust accuracy by improving generalization from minority examples in the target dataset. Takeaway. These results suggest that both pre- training and MTL do not enable extrapolation; instead, they improve generalization from minor- ity examples in the (target) training set. Thus it is important to increase coverage of diverse patterns None 91.3 (.07) 45.9 (2.1) − random examples QQP MNLI minority examples QQP MNLI 91.3 (.03) 44.3 (.31 ) −1.6 91.4 (.02) 45.0 (1.5 ) −0.9 91.3 (.09) 91.3 (.08) 38.2 (.73) −7.7 44.3 (2.0) −1.6 Table 8: Ablation study on the effect of minority examples in the auxiliary (MNLI) and the target (QQP) datasets in MTL with BERTBASE. For MNLI, we removed 727 non-entailment examples with 100% overlap. For QQP, we removed 228 non-paraphrase examples with 100% overlap. We also removed equal amounts of random examples in the control experiments. We ran t-tests for the mean accuracies after minority removal and random removal based on five runs, and numbers with a significant difference (p < 0.001) are bolded. The improvement from MTL mainly comes from better generalization from minority examples in the target dataset. in the data to improve robustness to spurious correlations. 6 Related Work Pre-training and Robustness. Recently, there has been an increasing amount of interest in studying the effect of pre-training on robustness. Hendrycks et al. (2019, 2020) show that pre- training improves model robustness to label noise, class imbalance, and out-of-distribution detection. In cross-domain question-answering, Li et al. (2019) show that the ensemble of different pre- trained models significantly improves perfor- mance on out-of-domain data. In this work, we answer why pre-trained models appear to improve out-of-distribution robustness and point out the importance of minority examples in the training data. Data Augmentation. The most straightforward way to improve model robustness to out-of- distribution data is to augment the training set with examples from the target distribution. Recent work has shown that augmenting syntactically rich examples improves robust accuracy on NLI (Min et al., 2020). Similarly, counterfactual 629 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 3 5 1 9 2 3 5 0 6 / / t l a c _ a _ 0 0 3 3 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 augmentation aims to identify parts of the input that impact the label when intervened upon, thus avoiding learning spurious features (Goyal et al., 2019; Kaushik et al., 2020). Finally, data recom- bination has been used to achieve compositional generalization (Jia and Liang, 2016; Andreas, 2020). However, data augmentation techniques largely rely on prior knowledge of the spurious correlations or human efforts. In addition, as shown in Section 4.2 and a concurrent work (Jha et al., 2020), it is often unclear how much augmented data is needed for learning a pattern. Our work shows promise in adding generic pre- training data or related auxiliary data (through MTL) without target distribution. assumptions on the Robust Learning Algorithms. Serveral recent papers propose new learning algorithms that are robust to spurious correlations in NLI datasets (He et al., 2019; Clark et al., 2019; Yaghoobzadeh et al., 2019; Zhou and Bansal, 2020; Sagawa et al., 2020; Mahabadi et al., 2020). They rely on prior knowledge to focus on ‘‘harder’’ examples that do not enable shortcuts during training. One weakness of these methods is their arguably strong assumption on knowing the spurious correlations a priori. Our work provides evidence that large amounts of generic data can be used to improve out-of-distribution generalization. Similarly, re- cent work has shown that semi-supervised learning with generic auxiliary data improves model ro- bustness to adversarial examples (Schmidt et al., 2018; Carmon et al., 2019). Transfer Learning. Robust learning is also related to domain adaptation or transfer learning because both aim to learn from one distribution and achieve good performance on a different but related target distribution. Data selection and reweighting are common techniques used in domain adaptation. Similar to our findings on minority examples, source examples similar to the target data have been found to be helpful to transfer (Ruder and Plank, 2017; Liu et al., 2019a). In addition, much work has shown that MTL improves model performance on out-of- domain datasets (Ruder, 2017; Li et al., 2019; Liu et al., 2019c). A concurrent work (Akula et al., 2020) shows that MTL improves robustness on advesarial examples in visual grounding. In this work, we further connect the effectiveness of MTL to generalization from minority examples. 7 Discussion and Conclusion Our study is motivated by recent observations on the robustness of large-scale pre-trained trans- formers. Specifically, we focus on robust accuracy on challenging datasets, which are designed to expose spurious correlations learned by the model. Our analysis reveals that pre-training improves robustness by better generalizing from a minority of examples that counter dominant spurious pat- terns in the training set. In addition, we show that more pre-training data, larger model size, and additional auxiliary data through MTL further improve robustness, especially when the amount of minority examples is scarce. Our work suggests that it is possible to go beyond the robustness–accuracy trade-off with more data. However, the amount of improvement is still limited by the coverage of the training data because current models do not extrapolate to unseen patterns. Thus, an important future direc- tion is to increase data diversity through new crowdsourcing protocols or efficient human-in- the-loop augmentation. Although our work provides new perspectives on pre-training and robustness, it only scratches the surface of the effectiveness of pre-trained models and leaves many questions open, for example: why pre-trained models do not overfit to the minority examples; how different initialization (from different pre-trained models) influences optimization and generalization. Understanding these questions are key to designing better pre- training methods for robust models. Finally, the difference between results on HANS and PAWS calls for more careful thinking on the formulation and evaluation of out-of- distribution generalization. Semi-manually con- structed challenging data often cover only a specific type of distribution shift, thus the results may not generalize to other types. A more com- prehensive evaluation will drive the develop- ment of principled methods for out-of-distribution generalization. Acknowledgments We would like to thank the Lex and Comprehend groups at Amazon Web Services AI for helpful 630 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 3 5 1 9 2 3 5 0 6 / / t l a c _ a _ 0 0 3 3 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 discussions, and the reviewers for their insightful comments. We would also like to thank the GluonNLP team for the infrastructure support. Evaluating compositionality in sentence em- beddings. In Annual Meeting of the Cognitive Science Society, CogSci 2018. References Arjun R. Akula, Spandana Gella, Yaser Al- Onaizan, Song-Chun Zhu, and Siva Reddy. 2020. Words aren’t enough, their order matters: On the robustness of grounding visual referring expressions. In Association for Computational Linguistics (ACL). J. Andreas. 2020. Good-enough compositional data augmentation. In Association for Com- putational Linguistics (ACL). M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz. 2019. Invariant risk minimization. arXiv preprint arXiv:1907.02893v2. J. Baxter. 2000. A model of inductive bias learning. Journal of Artificial Intelligence Re- search (JAIR), 12:149–198. S. Bowman, G. Angeli, C. Potts, and C. D. Manning. 2015. A large annotated corpus for learning natural language inference. In Empi- rical Methods in Natural Language Processing (EMNLP). Y. Carmon, A. Raghunathan, L. Schmidt, P. Liang, and J. C. Duchi. 2019. Unlabeled data improves adversarial robustness. In Advances in Neural Information Processing Systems (NeurIPS). Rich Caruana. 1997. Multitask learning. Machine Learning, 28(1):41–75. D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia. 2017. SemEval-2017 task 1: Seman- tic textual similarity - multilingual and cross- lingual focused evaluation. In Proceedings of the Eleventh International Workshop on Se- mantic Evaluations. C. Clark, M. Yatskar, and L. Zettlemoyer. 2019. Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In Empirical Methods in Natural Language Processing (EMNLP). Ishita Dasgupta, Demi Guo, Andreas Stuhlm¨uller, Samuel Gershman, and Noah Goodman. 2018. J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL). W. B. Dolan and C. Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing. M. Glockner, V. Shwartz, and Y. Goldberg. 2018. Breaking NLI systems with sentences that require simple lexical inferences. In Association for Computational Linguistics (ACL). Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, and S. Lee. 2019. Counterfactual visual explana- tions. In International Conference on Machine Learning (ICML). C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. 2017. On calibration of modern neural net- works. In International Conference on Machine Learning (ICML). J. Guo, H. He, T. He, L. Lausen, M. Li, H. Lin, X. Shi, C. Wang, J. Xie, S. Zha, A. Zhang, H. Zhang, Z. Zhang, Z. Zhang, S. Zheng, and Y. Zhu. 2020. Gluoncv and gluonnlp: Deep learning in computer vision and natural lan- guage processing. Journal of Machine Learning Research (JMLR), 21:1–7. S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith. 2018. Annotation artifacts in natural language inference data. In North American Association for Computational Linguistics (NAACL). K. Hashimoto, C. Xiong, Y. Tsuruoka, and R. Socher. 2017. A joint many-task model: Growing a neural network for multiple NLP tasks. In Empirical Methods in Natural Lan- guage Processing (EMNLP). H. He, S. Zha, and H. Wang. 2019. Unlearn dataset bias for natural language inference by fitting the residual. In Proceedings of the EMNLP Workshop on Deep Learning for Low-Resource NLP. 631 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 3 5 1 9 2 3 5 0 6 / / t l a c _ a _ 0 0 3 3 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 D. Hendrycks, K. Lee, and M. Mazeika. 2019. Using pre-training can improve model robustness and uncertainty. In International Conference on Machine Learning (ICML). D. Hendrycks, X. Liu, E. Wallace, A. Dziedzic, R. Krishnan, and D. Song. 2020. Pretrained transformers out-of-distribution robustness. In Association for Computational Linguistics (ACL). improve S. Iyer, N. Dandekar, and K. Csernai. 2017. First quora dataset release: Question pairs. Accessed online at. https://www.quora. com/q/quoradata/First-Quora- Dataset-Release-Question-Pairs R. Jha, C. Lovering, and E. Pavlick. 2020. When does data augmentation help generalization in NLP? In Association for Computational Linguistics (ACL). R. Jia and P. Liang. 2016. Data recombination for neural semantic parsing. In Association for Computational Linguistics (ACL). D. Kaushik, E. Hovy, and Z. C. Lipton. 2020. Learning the difference that makes a difference with counterfactually-augmented data. In Inter- national Conference on Learning Representa- tions (ICLR). Hongyu Li, Xiyuan Zhang, Yibing Liu, Yiming Zhang, Quan Wang, Xiangyang Zhou, Jing Liu, Hua Wu, and Haifeng Wang. 2019. D-net: A pre-training and fine-tuning framework for improving the generalization of machine read- ing comprehension. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 212–219. Miaofeng Liu, Yan Song, Hongbin Zou, and Tong Zhang. 2019a. Reinforced training data selection for domain adaptation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1957–1968. N. F. Liu, R. Schwartz, and N. A. Smith. 2019b. Inoculation by fine-tuning: A method for an- alyzing challenge datasets. In North American Association for Computational Linguistics (NAACL). Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019c. Multi-task deep neural 632 networks for natural language understanding. In Association for Computational Linguistics (ACL). Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019d. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. R. K. Mahabadi, Y. Belinkov, and J. Henderson. 2020. End-to-end bias mitigation by modelling biases in corpora. In Association for Compu- tational Linguistics (ACL). Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Associa- tion for Computational Linguistics (ACL) System Demonstrations. A. Maurer, M. Pontil, and B. Romera-Paredes. 2016. The benefit of multitask representation learning. Journal of Machine Learning Re- search (JMLR), 17:1–32. R. T. McCoy, E. Pavlick, and T. Linzen. 2019. the wrong reasons: Diagnosing language heuristics Right syntactic inference. arXiv preprint arXiv:1902.01007. natural for in J. Min, R. T. McCoy, D. Das, E. Pitler, and T. Linzen. 2020. Syntactic data augmentation increases robustness to inference heuristics. In Association for Computational Linguistics (ACL). Yixin Nie, Yicheng Wang, and Mohit Bansal. 2019. Analyzing compositionality-sensitivity of NLI models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6867–6874. M. I. Nye, A. Solar-Lezama, J. B. Tenenbaum, and B. M. Lake. 2019. Learning compositional rules via neural program synthesis. In Advances Information Processing Systems in Neural (NeurIPS). Y. Oren, S. Sagawa, T. B. Hashimoto, and P. Liang. 2019. Distributionally robust language modeling. In Empirical Methods in Natural Language Processing (EMNLP). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 3 5 1 9 2 3 5 0 6 / / t l a c _ a _ 0 0 3 3 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Sebastian Ruder and Barbara Plank. 2017. Learning to select data for transfer learning with bayesian optimization. arXiv preprint arXiv:1707.05246. S. Ruder. 2017. An overview of multi-task learn- ing in deep neural networks. arXiv preprint arXiv:1706.05098. S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang. 2020. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations (ICLR). L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry. 2018. Adversarially robust gen- eralization requires more data. In Advances in Neural Information Processing Systems (NeurIPS), pages 5014–5026. A. Søgaard and Y. Goldberg. 2016. Deep multi- task learning with low level tasks supervised at lower layers. In Association for Computational Linguistics (ACL). A. Williams, N. Nangia, and S. R. Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. Yadollah Yaghoobzadeh, Remi Tachet des Combes, Timothy J. Hazen, and Alessandro Sordoni. 2019. Robust natural language infe- rence models with example forgetting. CoRR, abs/1911.03861. T. Zhang, F. Wu, A. Katiyar, K. Q. Weinberger, and Y. Artzi. 2020. Revisiting few-sample BERT fine-tuning. arXiv preprint arXiv:2006. 05987. Y. Zhang, J. Baldridge, and L. He. 2019. PAWS: Paraphrase adversaries from word scrambling. In North American Association for Computational Linguistics (NAACL). X. Zhou and M. Bansal. 2020. Towards robust- ifying NLI models against lexical dataset biases. In Association for Computational Linguistics (ACL). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 3 5 1 9 2 3 5 0 6 / / t l a c _ a _ 0 0 3 3 5 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 633An Empirical Study on Robustness to Spurious Correlations using image

Scarica il pdf