Cross-functional Analysis of Generalization in Behavioral Learning

Cross-functional Analysis of Generalization in Behavioral Learning

Pedro Henrique Luz de Araujo1,2 and Benjamin Roth1,3
1Faculty of Computer Science, University of Vienna, Vienna, Austria
2UniVie Doctoral School Computer Science, Vienna, Austria
3Faculty of Philological and Cultural Studies, University of Vienna, Vienna, Austria
{pedro.henrique.luz.de.araujo, benjamin.roth}@univie.ac.at

Abstrait

In behavioral testing, system functionalities
underrepresented in the standard evaluation
setting (with a held-out test set) are validated
through controlled input-output pairs. Opti-
mizing performance on the behavioral tests
during training (behavioral learning) would
improve coverage of phenomena not suffi-
ciently represented in the i.i.d. data and could
lead to seemingly more robust models. Comment-
jamais, there is the risk that the model narrowly
captures spurious correlations from the behav-
ioral test suite, leading to overestimation and
misrepresentation of model performance—
one of the original pitfalls of traditional eval-
uation.

In this work, we introduce BELUGA, an anal-
ysis method for evaluating behavioral learning
considering generalization across dimensions
of different granularity levels. We optimize
behavior-specific loss functions and evaluate
models on several partitions of the behav-
ioral test suite controlled to leave out specific
phenomena. An aggregate score measures gen-
eralization to unseen functionalities (or over-
fitting). We use BELUGA to examine three
representative NLP tasks (sentiment analysis,
paraphrase identification, and reading compre-
hension) and compare the impact of a diverse
set of regularization and domain generalization
methods on generalization performance.1

1

Introduction

The standard paradigm for evaluating natural lan-
guage processing (NLP) models is to compute
correctness metrics on a held-out test set from
the same distribution as the training set (Linzen,
2020). If the test set is large and diverse, this may
be a good measure of average performance, mais
it fails to account for the worst-case performance
(Sagawa et al., 2020). By exploiting correlations

1Our code is available on https://github.com

/peluz/beluga.

in the training data, models work well in most
cases but fail in those where the correlations do
not hold (Niven and Kao, 2019; McCoy et al.,
2019; Zellers et al., 2019), leading to overestima-
tion of model performance in the wild (Ribeiro
et coll., 2020). En outre, standard evaluation
does not indicate the sources of model failure (Wu
et coll., 2019) and disregards important model prop-
erties such as fairness (Ma et al., 2021).

Behavioral testing (R¨ottger et al., 2021; Ribeiro
et coll., 2020) has been proposed as a complemen-
tary evaluation framework, where model capa-
bilities are systematically validated by examining
its responses to specific stimuli. This is done
through test suites composed of input-output pairs
where the input addresses specific linguistic or
social phenomena and the output is the expected
behavior given the input. The suites can be seen
as controlled challenge datasets (Belinkov and
Verre, 2019) aligned with human intuitions about
how the agent should perform the task (Linzen,
2020).

In this work, we understand test suites as a
hierarchy of functionality classes, functionalities,
and test cases (R¨ottger et al., 2021). Functionality
classes stand at the highest level, capturing sys-
tem capabilities like fairness, robustness and ne-
gation. They are composed of functionalities that
target finer-grained facets of the capability. Pour
example, a test suite for sentiment analysis can
include the functionality ‘‘negation of positive
statement should be negative’’ inside the Negation
class. Enfin, each functionality is composed of
test cases, the input-output pairs used to validate
model behavior. For the functionality above, un
example test case could be the input ‘‘The movie
was not good’’ and the expected output ‘‘nega-
tive’’, under the assumption that the non-negated
sentence is positive.

Though behavioral test suites identify model
weaknesses, the question of what to do with such

1066

Transactions of the Association for Computational Linguistics, vol. 11, pp. 1066–1081, 2023. https://doi.org/10.1162/tacl a 00590
Action Editor: Mihai Surdeanu. Submission batch: 1/2023; Revision batch: 5/2023; Published 8/2023.
c(cid:2) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
5
9
0
2
1
5
4
4
7
0

/

/
t

je

un
c
_
un
_
0
0
5
9
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

feedback is not trivial. While test suite creators
argue that these tools can aid the development of
better models (R¨ottger et al., 2021) and lead to
improvements in the tested tasks (Ribeiro et al.,
2020), how to act on the feedback concretely is
not discussed.

One common approach is fine-tuning on data
targeting the failure cases, which previous work
has shown can improve performance in these
same cases (Malon et al., 2022; Liu et al., 2019;
McCoy et al., 2019). But this practice overlooks
the possibility of models overfitting to the cov-
ered tests and consequently overestimates model
performance. Even if one takes care to split the
behavioral test cases into disjoint sets for train-
ing and testing, models can still leverage data
artifacts such as word-label co-occurrences to
achieve seemingly good performance that is over-
optimistic and does not align with out-of-
distribution (OOD) performance.

This creates the following dilemma: Either one
does not use the feedback from test suites for
model development and loses the chance to im-
prove model trustworthiness; or one uses it to
address model shortcomings (par exemple., by training on
similar data)—and run the risk of overfitting to
the covered cases. Prior work (Luz de Araujo
and Roth, 2022; Rozen et al., 2019) has addressed
this in part by employing structured cross-
validation, where a model is trained and evalu-
ated on different sets of phenomena. Cependant,
the analyses have been so far restricted to limited
settings where only one task, training configura-
tion and test type is examined. De plus, ces
studies have not examined how different regular-
ization and generalization mechanisms influence
generalization.

In this paper, we introduce BELUGA, a gen-
eral method for Behavioral Learning Unified
Generalization Analysis. By training and evaluat-
ing on several partitions of test suite and i.i.d. data,
we measure model performance on unseen phe-
nomena, such as held-out functionality and func-
tionality classes. This structured cross-validation
approach yields scores that better characterize
model performance on uncovered behavioral tests
than the ones obtained by over-optimistic i.i.d.
evaluation.

Our main contributions are:

(1) We design BELUGA, an analysis method
to measure the effect of behavioral learn-

ing. It handles different kinds of behav-
ior measures, operationalized by labeled
or perturbation-based tests. To that end we
propose loss functions that optimize the ex-
pected behavior of three test types: Mini-
mum functionality, invariance, and directional
expectation tests (Ribeiro et al., 2020).

(2) We extend previous work on behavioral
learning by exploring two training configura-
tions in addition to fine-tuning on suite data
(Luz de Araujo and Roth, 2022; Liu et al.,
2019): Training on a mixture of i.i.d. et
suite data; and training on i.i.d. data followed
by fine-tuning on the data mixture.

(3) We design aggregate metrics that measure
generalization across axes of different levels
of granularity. From finer to coarser: Gen-
eralization within functionalities, to different
functionalities and to different functionality
classes.

(4) We compare the generalization capabilities
of a range of regularization techniques and
domain generalization algorithms for three
representative NLP tasks (sentiment anal-
ysis, paraphrase identification, and reading
comprehension).

This work is not a recommendation to train on
behavioral test data, but an exploration of what
happens if data targeting the same set of phe-
nomena as the tests is used for model training.
We find that naive optimization and evaluation
do yield over-optimistic scenarios: Fine-tuning
on suite data results in large improvements for
seen functionalities, though at the same time i.i.d.
data and unseen functionalities performance can
degrade, with some models adopting degenerate
solutions that pass the tests but lead to catastrophic
i.i.d. performance. Including i.i.d. as well as test
suite samples was found to prevent this, miti-
gating i.i.d. performance degradation—with even
improvements in particular cases—and yielding
higher scores for unseen functionalities as well.

2 Background

2.1 Behavioral Testing

We consider a joint distribution p over an input
space X , corresponding label space Y, and as-
sume access to an i.i.d. dataset D composed of

1067

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
5
9
0
2
1
5
4
4
7
0

/

/
t

je

un
c
_
un
_
0
0
5
9
0
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

n examples D = {(xi, yi) ∼ p}n
je = 1, xi ∈ X ,
yi ∈ Y, split into disjoint train, validation, et
test sets Dtrain, Dval, and Dtest. We also assume
access to a behavioral test suite T , composed of
m test cases {li}m
i=1 partitioned into nfunc dis-
joint functionalities {Fi}nfunc
je = 1 . Each functional-
ity belongs to one of nclass functionality classes
{Ci}nclass

je = 1 , such that nclass < nfunc < m. Each test case belongs to a functionality, t ∈ Fi, and is described by a pair (X, b), where X is a list with |X| inputs. The expectation function b : R|X|×|Y| → {0, 1} takes a model’s predictions for all |X| inputs and outputs 1 if the model behaves as expected and 0 otherwise. The above taxonomy, by R¨ottger et al. (2021), describes the hierarchy of concepts in behav- ioral testing: Functionality classes correspond to coarse properties (e.g., negation) and are com- posed of finer-grained functionalities; these assess facets of the coarse property (e.g., negation of positive sentiment should be negative) and are op- erationalized by individual input-output pairs, the test cases. These concepts align with two of the generalization axes we explore in this work, func- tionality and functionality class generalization (§ 3.3). We additionally follow the terminology cre- ated by Ribeiro et al. (2020), which defines three test types, according to their evaluation mech- anism: Minimum Functionality, Invariance, and Directional Expectation tests. When used for model training, each of them requires a particular optimization strategy (§ 3.2). Minimum Functionality Test (MFT): MFTs are input-label pairs designed to check specific system behavior: X has only one element, x, and the expectation function checks if the model output given x is equal to some label y. Thus, they have the same form as the i.i.d. examples. Invariance Test (INV): INVs are designed to check for invariance to certain input transforma- tions. The input list X consists of an original input xo and |X|−1 perturbed inputs (xi) obtained by applying label-preserving transformations on xo. Given model predictions ˆY := [ˆyi] for all inputs in X, then b( ˆY ) = 1 if: |X|−1 i=0 |X|−1 i=1 argmax ˆy0 = argmax ˆyi , (1) for all i ∈ {1, . . . , |X| − 1}. That is, the expec- tation function checks if model predictions are invariant to the perturbations. Directional Expectation Test (DIR): The form for input X is similar to the INV case, but instead of label-preserving transformations, xo is per- turbed in a way that changes the prediction in a task-dependent predictable way, e.g., prediction confidence should not increase. Given a task- dependent comparison function δ : R|Y| × R|Y| → {0, 1}, b( ˆY ) = 1 if: δ (ˆy0, ˆy1) ∧ δ (ˆy0, ˆy2) ∧ · · · ∧ δ (cid:2) (cid:3) ˆy0, ˆy|x|−1 . (2) For example, if the expectation is that prediction confidence should not increase, then δ(ˆy0, ˆyi) = 1 if ˆyi[c∗] ≤ ˆy0[c∗], where c∗ := argmax ˆy0 and ˆy[c∗] denotes the predicted probability for class c∗. Evaluation: Given a model family Θ and a loss function (cid:3) : Θ × (X × Y) → R+, the standard learning goal is to find the model ˆθ ∈ Θ that minimizes the loss over the training examples: ˆθ := argmin θ∈Θ 1 |Dtrain| (cid:4) (x,y)∈Dtrain (cid:3)(θ, (x, y)) . (3) Then, general model correctness is evaluated us- ing one or more metrics over the examples in Dtest. The model can be additionally evaluated using test suite T , which gives a finer-grained performance measure over each functionality. 2.2 Behavioral Learning In behavioral learning, samples from T are used for training in a two-step approach: A pre-trained language model (PLM) (Devlin et al., 2019) is first fine-tuned on examples from Dtrain, and then fine-tuned further on examples from T (Luz de Araujo and Roth, 2022; Liu et al., 2019). 3 BELUGA BELUGA is an analysis method to estimate how training on test suite data impacts generalization to seen and unseen phenomena. Given an i.i.d. dataset D, a test suite T , and a training configu- ration χ (§ 3.1), BELUGA trains on several con- trolled splits of suite data and outputs scores that use performance on unseen phenomena as a proxy measure (§ 3.3) for generalization. That is, BELUGA can be formalized as a func- tion f parametrized by D, T , and χ that returns a set of metrics M : M = f (D, T , χ) . (4) 1068 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 0 2 1 5 4 4 7 0 / / t l a c _ a _ 0 0 5 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 By including measures of performance on i.i.d. data and on seen and unseen sets of phenomena, these metrics offer a more comprehensive and realistic view of how the training data affected model capabilities and shed light on failure cases that would be obfuscated by other evaluation schemes. Below we describe the examined training con- figurations (§ 3.1), how BELUGA optimizes the expected behaviors encoded in T (§ 3.2), how it estimates generalization (§ 3.3), and the metrics it outputs (§ 3.4). 3.1 Training Configurations We split T into three disjoint splits Ttrain, Tval, and Ttest, such that each split contains cases from all functionalities, and define four training con- figurations regarding whether and how we use Ttrain: IID: The standard training approach that uses only i.i.d. data for training (Dtrain). It serves as a baseline to contrast performance of the three following suite-augmented configurations. IID→T: A two-step approach where first the PLM is fine-tuned on Dtrain and then on Ttrain. This is the setting examined in prior work on be- havioral learning (§ 2.2), which has been shown to lead to deterioration of i.i.d. dataset (Dtest) performance (Luz de Araujo and Roth, 2022). To assess the impact of including i.i.d. sam- ples in the behavioral learning procedure, we de- fine two additional configurations: IID+T: The PLM is fine-tuned on a mixture of suite and i.i.d. data (Dtrain ∪ Ttrain). IID→(IID+T): The PLM is first fine-tuned on Dtrain and then on Dtrain ∪ Ttrain. By contrasting the performance on Dtest and Ttest of these configurations, we assess the im- pact of behavioral learning on both i.i.d. and test suite data distributions. 3.2 Behavior Optimization Since each test type describes and expects differ- ent behavior, BELUGA optimizes type-specific loss functions: MFT: As MFTs are formally equivalent to i.i.d. data (input-label pairs), they are treated as such: We randomly divide them into mini-batches and optimize the cross-entropy between model pre- dictions and labels. INV: We randomly divide INVs into mini- batches composed of unperturbed-perturbed input pairs. For each training update, we randomly se- lect one perturbed version (of several possible) for each original input.2 We enforce invariance by minimizing the cross-entropy between model pre- dictions over perturbed-unperturbed input pairs: (cid:3)(ˆy0, ˆyi) := − c(cid:4) k=1 ˆy0[k] · log (ˆyi[k]) , (5) where c is the number of classes. This penalizes models that are not invariant to the perturbations (Eq. 1), since the global minimum of the loss is the point where the predictions are the same. DIR: Batch construction follows the INV pro- cedure: The DIRs are randomly divided into mini- batches of unperturbed-perturbed input pairs, the unperturbed input is randomly sampled during training. The optimization objective depends on the com- parison function δ. For a given δ, we define a corresponding error measure (cid:6)δ : R|Y| × R|Y| → [0, 1]. For example, if the expectation is that prediction confidence should not increase, then (cid:6)δ(ˆy0, ˆyi) = max (0, ˆyi[c∗] − ˆy0[c∗]). This way, (cid:6)δ increases with confidence increase and is zero otherwise. We minimize the following loss: (cid:3)(ˆy0, ˆyi, δ) := − log (1 − (cid:6)δ(ˆy0, ˆyi)) . (6) Intuitively, if (cid:6)δ = 0, the loss is zero. Conversely, the loss increases with the error measure (as (cid:6)δ gets closer to 1). 3.3 Cross-functional Analysis Test suites have limited coverage: The set of cov- ered functionalities is only a subset of the phe- nomena of interest: T ⊂ P, where P is the hypothetical set of all functionalities. For exam- ple, the test suite for sentiment analysis provided by Ribeiro et al. (2020) has a functionality that tests for invariance to people’s names—the sen- timent of the sentence ‘‘I do not like Mary’s favourite movie’’ should not change if ‘‘Mary’’ is changed to ‘‘Maria’’. However, the equally valid functionality that tests for invariance to or- ganizations’ names is not in the suite. Training 2Note that any amount of perturbed inputs could be used, but using only one allows fitting more test cases in a mini-batch if its size is kept constant. 1069 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 0 2 1 5 4 4 7 0 / / t l a c _ a _ 0 0 5 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 and evaluating on the same set of functionali- ties can lead to overestimating the performance: Models that overfit to covered functionalities but fail catastrophically on non-covered ones. BELUGA computes several measures of model performance that address generalization from Ttrain to Ttest and from Ttrain to P. We do not assume access to test cases for non-covered phe- nomena, so we use held-out sets of functionali- ties as proxies for generalization to P. i.i.d. Data: To score performance on Dtest, we use the canonical evaluation metric for the spe- cific dataset. We detail the metrics used for each examined task3 in Section 4.1. We denote the i.i.d. score as siid. Test Suite Data: We compute the pass rate sFi of each functionality Fi ∈ T : sFi := 1 |Ftesti | (cid:4) (X,b)∈Ftesti b( ˆY ) , (7) where ˆY are the model prediction given the inputs in X. In other words, the pass rate is simply the proportion of successful test cases. We vary the set of functionalities used for training and testing to construct different evalua- tion scenarios: Unseen Evaluation: No test cases are seen during training. This is equivalent to the use of behavioral test suites without behavioral learning: We compute the pass rates using the predictions of an IID model. Seen Evaluation: Ttrain is used for training. We compute the pass rate on Ttest using the pre- dictions of suite-augmented models. This score measures how well the fine-tuning procedure gen- eralizes to test cases of covered functionalities: Even though all functionalities are seen dur- ing training, the particular test cases evaluated ({t|t ∈ Ttest}) are not the same as the ones used for training (Ttrain ∩ Ttest = ∅). Generalization to Non-Covered Phenomena: To estimate performance on non-covered phe- nomena, we construct a l-subset partition of the set of functionalities U := {Ui}l i=1. For each Ui, we use Ttrain \ Ui for training and then compute the pass rates for Ttest ∩ Ui: {sFunseen|F ∈ Ui}. That is, we fine-tune it on a set of functionalities 3We refer to the i.i.d. data as the dataset as opposed to the task. The task is more abstract, and it comes with a corresponding behavioral test suite. and evaluate it on the remaining (unseen) func- tionalities. Since U is a partition of T , by the end of the procedure there will be a pass rate for each functionality. We consider three different partitions, depend- ing on the considered generalization proxy: (1) Functionality generalization: A partition with nfunc subsets, each corresponding to a held- out functionality: Ui = {Fi}, i ∈ {1, . . . , nfunc}. We consider this a proxy of performance on non-covered functionalities: F ∈ P \ T . (2) Functionality class generalization: A par- tition with nclass subsets, each corresponding to a held-out functionality class: Ui = {Ci}, i ∈ {1, . . . , nclass}. We consider this to be a proxy of performance on non-covered functionality classes: C ⊂ P \ T . (3) Test type generalization: A partition with three subsets, each corresponding to a held-out test type: Ui = {F|F has type i}, i ∈ {MFT, INV, DIR}. We use this measure to examine gen- eralization across different test types. 3.4 Metrics For model comparison purposes, BELUGA out- puts the average pass rate (the arithmetic mean of the nfunc pass rates) as the aggregated metric for test suite correctness. Since one of the moti- vations for behavioral testing is its fine-grained results, BELUGA also reports the individual pass rates. In total, BELUGA computes five aggregated suite scores, each corresponding to an evaluation scenario: sT standard: The baseline score of a model only trained on i.i.d. data: If the other scores are lower, then fine-tuning on test suite data degraded overall model performance. sT seen: Performance on seen functionalities. This score can give a false sense of model per- formance since it does not account for model overfitting to the seen functionalities: Spurious correlations within functionalities and function- ality classes can be exploited to get deceivingly high scores. sT func: Measure of generalization to unseen functionalities. It is a more realistic measure of model quality, but since functionalities correlate within a functionality class, the score may still offer a false sense of quality. sT class: Measure of generalization to unseen functionality classes. This is the most challenging 1070 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 0 2 1 5 4 4 7 0 / / t l a c _ a _ 0 0 5 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Table 1: Examples for each i.i.d. dataset. The number of train/validation/test samples is 67k/436/436, 363k/20k/20k and 87k/5k/5k for SST-2, QQP and SQuAD, respectively. generalization setting, as the model cannot exploit correlations within functionalities and function- ality classes. sT type: Measure of generalization to unseen test types. This score is of a more technical interest: It can offer insights into how different training signals affect each other (e.g., if training with MFTs supports performance on INVs and vice- versa). Comprehensive Generalization Score: Since performance on i.i.d. data and passing the behav- ioral tests are both important, BELUGA provides the harmonic mean of the aggregated pass rates and the i.i.d. score as an additional metric for model comparison: G := 2 sT · siid sT + siid . (8) There are five G scores (Gstandard, Gseen, Gfunc, Gclass, and Gtype), each corresponding to plugging either sT standard, sT seen, sT func, sT class, or sT type into Eq. (8). This aggregation makes implicit importance assignments explicit: On the one hand, the har- monic mean ensures that both i.i.d. and suite per- formance are important due to its sensitivity to low scores; on the other, different phenomena are weighted differently, as i.i.d. performance has a bigger influence on the final score than each sin- gle functionality pass rate. 4 Experiments on Cross-functional Analysis 4.1 Tasks We experiment with three classification tasks that correspond to the test suites made available4 by Ribeiro et al. (2020): Sentiment analysis (SENT), 4https://github.com/marcotcr/checklist. paraphrase identification (PARA), and reading comprehension (READ).5 Tables 1 and 2 sum- marize and show representative examples from the i.i.d. and test suite datasets, respectively. Sentiment Analysis (SENT): As the i.i.d. data- set for sentiment analysis, we use the Stanford Sentiment Treebank (SST-2) (Socher et al., 2013). We use the version made available in the GLUE benchmark (Wang et al., 2018), where the task is to assign binary labels (negative/positive sen- timent) to sentences. The test set labels are not publicly available, so we split the original valida- tion set in half as our validation and test sets. The canonical metric for the dataset is accuracy. The SENT suite contains 68k MFTs, 9k DIRs, and 8k INVs. It covers functionality classes such as semantic role labeling (SRL), named entity recognition (NER), and fairness. The MFTs were template-generated, while the DIRs and INVs were either template-generated or obtained from perturbing a dataset of unlabeled airline tweets. Therefore, there is a domain mismatch between the i.i.d. data (movie reviews) and the suite data (tweets about airlines). There are also label mismatches between the two datasets: The suite contains an additional class for neutral sentiment and the MFTs have the ‘‘not negative’’ label, which admits both positive and neutral predictions. We follow Ribeiro et al. (2020) and consider predictions with probability of positive sentiment within [1/3, 2/3] as neutral.6 5These test suites were originally proposed for model evaluation. Every design choice we describe regarding opti- mization (e.g., loss functions and label encodings) is ours. 6When training, we encode ‘‘neutral’’ and ‘‘not negative’’ labels as [1/2, 1/2] and [1/3, 2/3], respectively. One alterna- tive is to create two additional classes for such cases, but this would prevent the use of the classification head fine-tuned on i.i.d. data (which is annotated with binary labels). 1071 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 0 2 1 5 4 4 7 0 / / t l a c _ a _ 0 0 5 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 0 2 1 5 4 4 7 0 / / t l a c _ a _ 0 0 5 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Table 2: Examples for each test suite. We color-code perturbations as red/green for deletions/additions. The number of train/validation/test samples is 89k/44k/44k, 103k/51k/51k, and 35k/17k/17k for the SENT, PARA and READ test suites, respectively. There are two types of comparison for DIRs, regarding either sentiment or prediction confi- dence. In the former case, the prediction for a perturbed input is expected to be either not more negative or not more positive when compared with the prediction for the original input. In the latter, the confidence of the original prediction is expected to either not increase or not decrease, regardless of the sentiment. For example, when adding an intensifier (‘‘really’’, ‘‘very’’) or a reducer (‘‘a little’’, ‘‘somewhat’’), the confidence of the original prediction should not decrease in the first case and not increase in the second. On the other hand, if a perturbation adds a positive or negative phrase to the original input, the posi- tive probability should not go down (up) for the first (second) case. More formally, each prediction ˆy is a two- dimensional vector where the first and second components are the confidence for negative (ˆy[0]) and positive (ˆy[1]) sentiment, respectively. Let c∗ denote the component with highest confidence in the original prediction: c∗ := argmax ˆy0. Then, the comparison function δ can take one of four forms (not more negative, not more positive, not more confident, and not less confident): δ↑p(ˆy0, ˆyi) = 1 if ˆyi[0] ≤ ˆy0[0] δ↑n(ˆy0, ˆyi) = 1 if ˆyi[1] ≤ ˆy0[1] δ↓c(ˆy0, ˆyi) = 1 if ˆyi[c∗] ≤ ˆy0[c∗] δ↑c(ˆy0, ˆyi) = 1 if ˆyi[c∗] ≥ ˆy0[c∗] Each corresponding to an error measure (cid:6): (cid:6)δ↑p(ˆy0, ˆyi) := max (0, ˆyi[0] − ˆy0[0]) (cid:6)δ↑n(ˆy0, ˆyi) := max (0, ˆyi[1] − ˆy0[1]) (cid:6)δ↓c(ˆy0, ˆyi) := max (0, ˆyi[c∗] − ˆy0[c∗]) (cid:6)δ↑c(ˆy0, ˆyi) := max (0, ˆy0[c∗] − ˆyi[c∗]) We compute the max because only test viola- tions should be penalized. Paraphrase Identification (PARA): We use Quora Question Pairs (QQP) (Iyer et al., 2017) as the i.i.d. dataset. It is composed of question pairs from the website Quora with annotation for whether a pair of questions is semantically equiv- alent (duplicates or not duplicates). The test set labels are not available, hence we split the origi- nal validation set into two sets for validation and testing. The canonical metrics are accuracy and the F1 score of the duplicate class. The PARA suite contains 46k MFTs, 13k DIRs, and 3k INVs, with functionality classes such as co-reference resolution, logic, and negation. All MFTs are template generated,7 while the INVs and DIRs are obtained from perturbing QQP data. The DIRs are similar to MFTs: Perturbed ques- tion pairs are either duplicate or not duplicate. 7The test cases from functionality ‘‘Order does matter for asymmetric relations’’ (e.g., Q1: Is Rachel faithful to Christian?, Q2: Is Christian faithful to Rachel?) were origi- nally labeled as duplicates. This seems to be unintended, so we change their label to not duplicates. 1072 For example, if two questions mention the same location and the perturbation changes the location in one of them, then the new pair is guaranteed not to be semantically equivalent. Thus, the compari- son function δ checks if the perturbed predictions correspond to the expected label; the original prediction is not used for evaluation. So during training, we treat them as MFTs: We construct mini-batches of perturbed samples and corre- sponding labels and minimize the cross-entropy between predictions and labels. Reading Comprehension (READ): The i.i.d. dataset for READ is the Stanford Question An- swering Dataset (SQuAD) (Rajpurkar et al., 2016), composed of excerpts from Wikipedia articles with crowdsourced questions and answers. The task is to, given a text passage (context) and a question about it, extract the context span that contains the answer. Once again, the test set la- bels are not publicly available and we repeat our splitting approach for SENT and PARA. The canonical metrics are exact string match (EM) (percentage of predictions that match ground truth answers exactly) and the more lenient F1 score, which measures average token overlap between predictions and ground truth answers. The READ suite contains 10k MFTs and 2k INVs, with functionality classes such as vocab- ulary and taxonomy. The MFTs are template generated, while the INVs are obtained from perturbing SQuAD data. Invariance training in READ has one compli- cation, since the task is to extract the answer span by predicting the start and end positions. Naively using the originally predicted positions would not work because the answer position may have changed after the perturbation. For exam- ple, let us take the original context-question pair (C: Paul traveled from Chicago to New York, Q: Where did Paul travel to?) and perturb it so that Chicago is changed to Los Angeles. The correct answer for the original input is (5, 6) as the start and end (word) positions, yielding the span ‘‘New York’’. Applying these positions to the perturbed input would extract ‘‘to New’’. In- stead, we only compare the model outputs for the positions that correspond to the common ground of original and perturbed inputs. In the example, the outputs for the tokens ‘‘Paul’’, ‘‘traveled’’, ‘‘from’’, ‘‘to’’, ‘‘New’’ and ‘‘York’’. We mini- mize the cross-entropy between this restricted set of outputs for the original and perturbed inputs. This penalizes changes in prediction for equiva- lent tokens (e.g., the probability of ‘‘Paul’’ being the start of the answer is 0.1 for the original in- put but 0.15 for the perturbed). 4.2 Generalization Methods We use BELUGA to compare several techniques used to improve generalization: L2: We apply a stronger-than-typical (cid:3)2-penalty coefficient of λ = 0.1. Dropout: We triple the dropout rate for all fully connected layers and attention probabilities from the default value of 0.1 to 0.3. LP: Instead of fine-tuning on suite data, we apply linear probing (LP), where the encoder parameters are frozen, and only the classifica- tion head parameters are updated. Previous work (Kumar et al., 2022) has found this to generalize better than full fine-tuning. LP-FT: We experiment with linear probing fol- lowed by fine-tuning, which Kumar et al. (2022) have shown to combine the benefits of fine-tuning (in-distribution performance) and linear-probing (out-of-distribution performance). Invariant Risk Minimization (IRM) (Arjovsky et al., 2019), a framework for OOD generali- zation that leverages different training environ- ments to learn feature-label correlations that are invariant across the environments, under the as- sumption that such features are not spuriously correlated with the labels. Group Distributionally Robust Optimization (Group-DRO) (Sagawa et al., 2020), an algo- rithm that minimizes not the average training loss, but the highest loss across the different train- ing environments. This is assumed to prevent the model from adopting spurious correlations as long as such correlations do not hold on one of the environments. Fish (Shi et al., 2022), an algorithm for do- main generalization that maximises the inner product between gradients from different train- ing environments, under the assumption that this leads models to learn features invariant across environments. For the last three methods, we treat the differ- ent functionalities as different environments. For the IID+T and IID→(IID+T) settings, we con- sider the i.i.d. data as an additional environment. In the multi-step training configurations (IID→T and IID→(IID+T)), we only apply the techniques 1073 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 0 2 1 5 4 4 7 0 / / t l a c _ a _ 0 0 5 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 0 2 1 5 4 4 7 0 / / t l a c _ a _ 0 0 5 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Table 3: i.i.d. test set performance and generalization measures (in %) of each examined method for all tasks and training configurations. The Avg. column shows the average G score across all tasks and generalization measures. We show scores significantly above and below the IID baseline (first row, suite scores are Gstandard) in green and red, respectively, and write the best score for each column in bold weight. When the score is not significantly different from the baseline counterpart, we show it in black. We use two-tailed binomial testing when comparing the i.i.d. performances, and randomization testing (Yeh, 2000) when comparing G scores, setting 0.05 as the significance level. during the second step: When training only with i.i.d. data we employ vanilla gradient descent, since we are interested in the generalization effect of using suite data. 4.3 Experimental Setting We use pre-trained BERT models (Devlin et al., 2019) for all tasks. We follow Ribeiro et al. (2020) and use BERT-base for SENT and PARA and BERT-large for READ. All our experiments use AdamW (Loshchilov and Hutter, 2019) as the op- timizer. When fine-tuning on i.i.d. data, we use the same hyper-parameters as the ones reported for models available on Hugging Face’s model zoo.8 When fine-tuning on test suite data, we run a grid search over a range of values for batch 8Available on https://huggingface.co/. The model names are textattack/bert-base-uncased-SST-2 (SENT), size, learning rate and number of epochs.9 We se- lect the configuration that performed best on Tval. To maintain the same compute budget across all methods, we do not tune method-specific hyper- parameters. We instead use values shown to work well in the original papers and previous work (Dranker et al., 2021). 5 Results and Observations 5.1 i.i.d. and Generalization Scores Table 3 exhibits i.i.d. and aggregate G scores for all tasks, training configurations, and generalization textattack/bert-base-uncased-QQP (PARA), and bert-large- uncased-whole-word-masking-finetuned-squad (READ). 9Batch size:{2, 3} for READ and {8, 16} for the others; learning rate: {2e − 5, 3e − 5, 5e − 5}; number of epochs: {1, 2, 3}. 1074 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 0 2 1 5 4 4 7 0 / / t l a c _ a _ 0 0 5 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 1: Average and individual pass rates for all tasks, methods, and training configurations. From first to third row: Results for SENT, PARA, and READ. From first to fourth column: Seen evaluation, functionality generalization, functionality class generalization, and test type generalization scores. The y-axis correspond to all training configuration-method pairs; the x-axis shows the average functionality pass rate followed by the individual pass rates. The blue horizontal and vertical lines demarcate different training configurations and functionality classes, respectively. The colors in the x-axis designate the different test types: Blue for MFTs, red for INVs, and green for DIRs. methods. Figure 1 presents pass rates of indi- vidual functionalities. Seen Performance: Fine-tuning on test suite data led to improvements for all tasks: The Gseen scores are generally higher than the baseline scores (first row in Table 3). That is, models were able to generalize across test cases from covered functionalities (from Ttrain to Ttest) while retaining reasonable i.i.d. data per- formance. In some specific training configuration- method combinations this was not the case. We discuss this below when we compare methods and report the degenerate solutions. Generalization Performance: For any given configuration-method pair, Gseen is higher than Gfunc, Gclass, and Gtype, indicating a generaliza- tion gap between seen and unseen functionalities. Furthermore, for all tasks, average (across meth- ods) Gfunc is higher than average Gclass, which is higher than average Gtype,10 indicating that gen- eralization gets harder as one moves from unseen functionalities to unseen functionality classes and test types. This aligns with previous work (Luz de Araujo and Roth, 2022), in which hate speech detection models are found to generalize within— but not across—functionality classes. Improvements over the IID baseline were task-dependent. Almost all configuration-method pairs achieved Gfunc (22 of 24) and Gclass (20 of 24) scores significantly higher that the IID base- line for SENT, with improvements over the base- line as high as 18.44 and 12.84 percentage points (p.p.) for each metric, respectively. For PARA, improving over Gclass proved much harder—only seven configuration-method pairs could do so. 10SENT: 85.97/78.15/69.54, PARA: 75.04/72.22/71.55, READ: 49.23/46.66/43.46. 1075 Increases in score were also less pronounced, the best Gfunc and Gclass scores being 6.91 and 2.19 p.p. above the baseline. READ was the one with both rarer and subtler improvements, with a third of the approaches significantly improving functionality and none significantly improving functionality class generalization. Improvements in each case were as high as 4.70 and 0.51 per- centage points over the baseline. i.i.d. Performance: Fine-tuning on test suite data only (IID→T configuration) reduced perfor- mance for all tasks’ i.i.d. test sets. Fine-tuning on both suite and i.i.d. examples (IID+T and IID→ (IID+T)) helped retain—or improve—performance in some cases, but decreases were still more common. The IID→(IID+T) configuration was the most robust regarding i.i.d. scores, with an average change (compared to the IID baseline) of −1.43/−0.50/−1.73 for SENT/PARA/READ. 5.2 Training Configuration and Method Comparison Using a mixture of i.i.d. and suite samples proved essential to retain i.i.d. performance: The over- all scores (average over methods and i.i.d. test sets) for each configuration are 67.52, 76.33, and 87.98 for IID→T, IID+T, and IID→(IID+T), respectively. That said, the environment-based generaliza- tion algorithms (IRM, DRO, and Fish) struggled in the IID+T configuration, underperforming when compared with the other methods. We hypothe- size that in these scenarios models simply do not see enough i.i.d. data, as we treat it as just one more environment among many others (reaching as much as 54 in PARA). LP also achieves subpar scores, even though i.i.d. data is not undersam- pled. The problem here is the frozen feature en- coder, as BERT features are not good enough without fine-tuning on i.i.d. task data—as was done in the other configurations, with clear ben- efits for LP. No individual method performed best for all scores and tasks. That said, IID→(IID+T) with L2, LP, LP-FT or Fish was able to achieve Gfunc and Gclass scores higher or not significantly dif- ferent from the baseline in all though IID→(IID+T) with dropout was the best when tasks and general- score is averaged over all ization measures. Considering this same metric, IID→(IID+T) was the most consistently good tasks, configuration, with all methods improving over the average IID baseline. 5.3 DIR Applicability We have found that DIRs, as used for SENT, have limited applicability for both testing and training. The reason for that is that models are generally very confident about their predictions: The average prediction confidence for the test suite predictions is 0.97 for the IID model. On the evaluation side, this makes some DIRs impossible to fail: The confidence cannot get higher and fail ‘‘not more confident’’ expectations. On the training side, DIRs do not add much of a training signal, as the training loss is near zero from the very beginning.11 We see an additional problem with DIRs in the SENT setting: They confuse prediction con- fidence with sentiment intensity. Though pre- diction confidence may correlate with sentiment intensity, uncertainty also signals difficulty and ambiguousness (Swayamdipta et al., 2020). Con- sequently, sentiment intensity tests may not be measuring the intended phenomena. One alterna- tive would be to disentangle the two factors: Using prediction values only for confidence-based tests, and sentiment intensity tests only for sentiment analysis tasks with numeric or fine-grained labels. 5.4 Negative Transfer Though Gclass scores are generally lower than Gfunc scores, this is not always the case for the pass rates of individual functionalities. When there are contrastive functionalities within a class—those whose test cases have similar surface form but entirely different expected behaviors—it is very difficult to generalize from one to the other. For example, the SRL class in PARA contains the functionalities ‘‘order does not matter for symmetric relations’’ and ‘‘order does matter for asymmetric relations’’ (functionalities 41 and 42 in the second row of Figure 1). Their test cases are generated by nearly identical templates where the only change is the relation placeholder. Ex- amples from the first and second functionalities would include (Q1: Is Natalie dating Sophia? Q2: Is Sophia dating Natalie?) and (Q1: Is Matthew lying to Nicole? Q2: Is Nicole lying to Matthew?) respectively. Though their surface forms are 11Confidence regularization (Yu et al., 2021) could poten- tially increase DIR’s usefulness for training and evaluation purposes. 1076 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 0 2 1 5 4 4 7 0 / / t l a c _ a _ 0 0 5 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 similar, they have opposite labels: duplicate and not duplicate. To compute sT func, a model is trained with samples from one functionality and evaluated on samples from the other. Consequently, the surface form will be spuriously correlated with the label seen during training and models may blindly as- sign it to the question pairs that fit the template. This would work well for the seen functionality, but samples from the unseen one would be en- tirely misclassified. Conversely, when computing the sT class score, the model will not have been trained on either of the functionalities and will not have the chance to adopt the heuristic, lead- ing to better unseen pass rates. 5.5 Degenerate Solutions Settings where the Gtype score is higher than the baseline are much rarer than for the other measures, happening only in one case for SENT (IID→T with dropout) and never for READ. One explanation is that training only on perturbation- based tests (with no MFTs) can lead to degener- ate solutions, such as passing all tests by always predicting the same class. To assess if that was the case, we examined the predictions on the SST-2 test set of the IID→T vanilla model fine-tuned only on DIRs and INVs. We have found that 95.18% of the i.i.d. data points were predicted as negative, though the ground truth frequency for that la- bel is 47.25%. When examining the predictions for MFTs, the results are even more contrasting: 0.29% of the predictions were negative, with the ground truth frequency being 43.42%. These re- sults show that the model has, indeed, adopted the degenerate solution. Interestingly, it predicts different classes depending on the domain, al- most always predicting negative for i.i.d. data and positive for suite data. The gap between Gclass and Gtype scores in PARA is not as severe, possibly due to the super- vised signal in its DIRs. Since these tests expect inputs to correspond to specific labels—as op- posed to DIRs for SENT, which check for changes in prediction confidence—always predicting the same class would not be a good solution. In- deed, when examining the predictions on the QQP test set of the vanilla IID→T model fine-tuned with no MFT data, we see that 58.70% of ques- tion pairs are predicted as not duplicate, which is similar to the ground truth frequency, 63.25%. The same is true when checking the predictions for MFTs: 64.47% of the data points are predicted as not duplicate, against a ground truth frequency of 52.46%. The READ scenario is more complex—instead of categories, spans are extracted. Manual inspec- tion showed that some IID→T models adopted degenerate solutions (e.g., extracting the first word, a full stop or the empty span as the answer), even when constrained by the MFT supervised sig- nal. Interestingly, the degenerate solutions were applied only for INV tests (where such invariant predictions work reasonably) and i.i.d. examples (where they do not). On the other hand, these models were able to handle the MFTs well, ob- taining near perfect scores and achieving high sT seen scores even though i.i.d. performance is catastrophic. The first grid of the third row in Figure 1 illustrates this: The high sT seen scores are shown on the first column, and the MFT pass rates on the columns with blue x-axis numbers. 5.6 Summary Interpretation of the Results Figure 1 Figure 1 supports fine-grained anal- yses that consider performance on individual functionalities in each generalization scenario. One can interpret it horizontally to assess the functionality pass rates for a particular method. For example, the bottom left grid, representing seen results for READ, shows that IID+T with LP behaves poorly on almost all functionalities, confirming the importance of fine-tuning BERT pre-trained features (§ 5.2). Alternatively, one can interpret it vertically to assess performance and generalization trends for individual functionalities. For example, mod- els generalized well to functionality 21 of the READ suite (second grid of the bottom row), with most methods improving over the IID baseline. However, under the functionality class evaluation scenario (third grid of the bottom row), improve- ments for functionality 21 are much rarer. That is, the models were able to generalize to functionality 21 as long as they were fine-tuned on cases from functionalities from the same class (20 and 22).12 Such fine-grained analyses show the way for more targeted explorations of generalization (e.g., 12These functionalities assess co-reference resolution ca- pabilities: 20 and 21 have test cases with personal and pos- sessive pronouns, respectively; 22 tests whether the model distinguishes ‘‘former’’ from ‘‘latter’’. 1077 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 0 2 1 5 4 4 7 0 / / t l a c _ a _ 0 0 5 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 why do models generalize to functionality 21 but not to functionality 20?), which can guide subsequent data annotation, selection and crea- tion efforts, and shed light on model limitations. Table 3 For i.i.d. results, we refer to the SST2, QQP, and SQuAD columns. These show that the suite-augmented configuration and methods (all rows below and including IID→T Vanilla) generally hurt i.i.d. performance. However, im- provements can be found for some methods in the IID+T and IID→(IID+T). Takeaway: Fine- tuning on behavioral tests degrades model general performance, which can be mitigated by jointly fine-tuning on i.i.d. samples and behavioral tests. For performance concerning seen functionali- ties, we refer to the Gseen columns. Generalization scores concerning unseen functionalities, func- tionality classes, and test types can be found in the Gfunc, Gclass, and Gtype columns. Across all tasks, training configurations, and methods, the Gseen scores are higher than the others. Takeaway: Evaluating only on the seen func- tionalities (Liu et al., 2019; Malon et al., 2022) is overoptimistic—improving performance on seen cases may come at the expense of degra- dation on unseen cases. This is detected by the underperforming generalization scores. Previous work on generalization in behav- ioral learning (Luz de Araujo and Roth, 2022; Rozen et al., 2019) corresponds to the IID→T Vanilla row. It shows deterioration of i.i.d. scores, poor generalization in some cases, and lower average performance compared with the IID base- line. However, our experiments with additional methods (all rows below IID→T Vanilla), show that some configuration-method combinations improve the average performance. Takeaway: While naive behavioral learning generalizes poorly, more sophisticated algorithms can lead to improvements. BELUGA is a method that detects and measures further algorithmic improvements. 6 Related Work Traditional NLP benchmarks (Wang et al., 2018, 2019) are composed of text corpora that re- flect the naturally occurring language distribution, which may fail to sufficiently capture rarer, but important phenomena (Belinkov and Glass, 2019). Moreover, since these benchmarks are commonly split into identically distributed train and test sets, spurious correlations in the former will gener- ally hold for the latter. This may lead to the ob- fuscation of unintended behaviors, such as the adoption of heuristics that work well for the data distribution but not in general (Linzen, 2020; McCoy et al., 2019). To account for these short- comings, complementary evaluations methods have been proposed, such as using dynamic bench- marks (Kiela et al., 2021) and behavioral test suites (Kirk et al., 2022; R¨ottger et al., 2021; Ribeiro et al., 2020). A line of work has explored how training on challenge and test suite data affects model perfor- mance by fine-tuning on examples from specific linguistic phenomena and evaluating on other sam- ples from the same phenomena (Malon et al., 2022; Liu et al., 2019). This is equivalent to our seen evaluation scenario, and thus cannot distin- guish between models with good generalization and those that have overfitted to the seen phe- nomena. We account for that with our additional generalization measures, computed using only data from held-out phenomena. Other efforts have also used controlled data splits to examine generalization: McCoy et al. (2019) have trained and evaluated on data from disjoints sets of phenomena relevant for Natural Language Inference (NLI); Rozen et al. (2019) have split challenge data according to sentence length and constituency parsing tree depth, cre- ating a distribution shift between training and evaluation data; Luz de Araujo and Roth (2022) employ a cross-functional analysis of generaliza- tion in hate speech detection. Though these works address the issue of overfitting to seen phenom- ena, their analyses are restricted to specific tasks and training configurations. Our work gives a more comprehensive view of generalization of behavioral learning by examining different tasks, training configurations, test types, and metrics. Additionally, we use this setting as an opportunity to compare the generalization impact of both sim- ple regularization mechanisms and state-of-the- art domain generalization algorithms. 7 Conclusion We have presented BELUGA, a framework for cross-functional analysis of generalization in NLP systems that both makes explicit the desired system traits and allows for quantifying and 1078 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 0 2 1 5 4 4 7 0 / / t l a c _ a _ 0 0 5 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 examining several axes of generalization. While in this work we have used BELUGA to analyze data from behavioral suites, it can be applied in any setting where one has access to data structured into meaningful groups (e.g., demographic data, linguistic phenomena, domains). We have shown that, while model perfor- mance for seen phenomena greatly improves after fine-tuning on test suite data, the generalization scores reveal a more nuanced view, in which the actual benefit is less pronounced and depends on the task and training configuration-method combination. We have found the IID→(IID+T) configuration to result in the most consistent im- provements. Conversely, some methods struggle in the IID→T and IID+T settings by overfitting to the suite or underfitting i.i.d. data, respectively. In these cases, a model both practically aces all tests and fails badly for i.i.d. data, which rein- forces the importance of considering both i.i.d. and test suite performance when comparing sys- tems, which is accounted for by BELUGA’s ag- gregate scores. These results show that naive behavioral learn- ing has unintended consequences, which the IID→(IID+T) configuration mitigates to some degree. There is still much room for improvement, though, especially if generalization to unseen types of behavior is desired. Through BELUGA, progress in that direction is measurable, and further algorithmic improvements might make be- havioral learning an option to ensure desirable behaviors and preserve general performance and generalizability of the resulting models. We do not recommend training on behavioral tests in the current technological state. Instead, we show a way to improve research on reconciling the qual- itative guidance of behavioral tests with desired generalization in NLP models. Acknowledgments We thank the anonymous reviewers and action editors for the helpful suggestions and detailed comments. We also thank Matthias Aßenmacher, Luisa M¨arz, Anastasiia Sedova, Andreas Stephan, Lukas Thoma, Yuxi Xia, and Lena Zellinger for the valuable discussions and feedback. This re- search has been funded by the Vienna Science and Technology Fund (WWTF) [10.47379/VRG19008] ‘‘Knowledge-infused Deep Learning for Natural Language Processing’’. References Mart´ın Arjovsky, L´eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2019. Invariant risk min- imization. CoRR, abs/1907.02893v3. https:// doi.org/10.48550/arXiv.1907.02893 Yonatan Belinkov and James Glass. 2019. Anal- ysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72. https:// doi.org/10.1162/tacla00254 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/N19-1423 Yana Dranker, He He, and Yonatan Belinkov. 2021. IRM—when it works and when it doesn’t: A test case of natural language inference. In Advances in Neural Information Process- ing Systems, volume 34, pages 18212–18224. Curran Associates, Inc. Shankar Iyer, Nikhil Dandekar, and Korn´el Csernai. 2017. First quora dataset release: Question pairs. Available online at https:// quoradata.quora.com/First-Quora -Dataset-Release-Question-Pairs. Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. 2021. Dynabench: Rethinking benchmarking in NLP. In Pro- ceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4110–4124, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021 .naacl-main.324 pages Hannah Kirk, Bertie Vidgen, Paul Rottger, Tristan Thrush, and Scott Hale. 2022. Hatemoji: A test suite and adversarially-generated dataset 1079 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 0 2 1 5 4 4 7 0 / / t l a c _ a _ 0 0 5 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 for benchmarking and detecting emoji-based hate. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, pages 1352–1368, Seattle, United States. Association for Computational Linguistics. https://doi.org/10.18653 /v1/2022.naacl-main.97 Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. 2022. Fine-tuning can distort pretrained fea- tures and underperform out-of-distribution. In the 10th International Con- Proceedings of ference on Learning Representations. Online. OpenReview.net. Tal Linzen. 2020. How can we accelerate progress towards human-like linguistic generalization? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, pages 5210–5217. Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2020.acl-main.465 Nelson F. Liu, Roy Schwartz, and Noah A. Smith. 2019. Inoculation by fine-tuning: A method for analyzing challenge datasets. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2171–2179, Minneapolis, Minnesota. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/N19-1225 Ilya Loshchilov and Frank Hutter. 2019. Decou- pled weight decay regularization. In Proceed- ings of the 7th International Conference on Learning Representations, New Orleans, LA, USA. OpenReview.net. Pedro Henrique Luz de Araujo and Benjamin Roth. 2022. Checking HateCheck: A cross- functional analysis of behaviour-aware learn- ing for hate speech detection. In Proceedings of NLP Power! The First Workshop on Ef- ficient Benchmarking in NLP, pages 75–83, Dublin, Ireland. Association for Computational Linguistics. https://doi.org/10.18653 /v1/2022.nlppower-1.8 Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya Jain, Ledell Wu, Robin Jia, Christopher Potts, Adina Williams, and Douwe Kiela. 2021. Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking. In Advances in Neural Information Processing Systems, volume 34, pages 10351–10367. Curran Associates, Inc.. Christopher Malon, Kai Li, and Erik Kruus. 2022. Fast few-shot debugging for NLU test suites. In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowl- edge Extraction and Integration for Deep Learning Architectures, pages 79–86, Dublin, Ireland and Online. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/2022.deelio-1.8 Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syn- tactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics. https://doi .org/10.18653/v1/P19-1334 Timothy Niven and Hung-Yu Kao. 2019. Prob- ing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4658–4664, Florence, Italy. Association for Computational Linguistics. https://doi.org/10.18653 /v1/P19-1459 Pranav Rajpurkar, text. In Proceedings of Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehen- the 2016 sion of Conference on Empirical Methods in Natural Language Processing, 2383–2392, Austin, Texas. Association for Computational Linguistics. https://doi.org/10.18653 /v1/D16-1264 pages Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond testing of NLP mod- accuracy: Behavioral els with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /2020.acl-main.442 Paul R¨ottger, Bertie Vidgen, Dong Nguyen, Zeerak Waseem, Helen Margetts, and Janet Pierrehumbert. 2021. HateCheck: Functional 1080 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 0 2 1 5 4 4 7 0 / / t l a c _ a _ 0 0 5 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 tests for hate speech detection models. In Pro- ceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), pages 41–58, Online. Association for Computational Linguistics. https://doi .org/10.18653/v1/2021.acl-long.4 Ohad Rozen, Vered Shwartz, Roee Aharoni, and Ido Dagan. 2019. Diversify your datasets: An- alyzing generalization via controlled variance in adversarial datasets. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 196–205, Hong Kong, China. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/K19-1019 Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. 2020. Distri- butionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In Proceedings of the 8th International Conference on Learn- ing Representations, Addis Ababa, Ethiopia. OpenReview.net. Yuge Shi, Jeffrey Seely, Philip H. S. Torr, N. Siddharth, Awni Hannun, Nicolas Usunier, and Gabriel Synnaeve. 2022. Gradient matching for domain generalization. In Proceedings of the 10th International Conference on Learning Representations, Virtual. OpenReview.net. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natu- ral Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics. Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. 2020. Data- set cartography: Mapping and diagnosing data- sets with training dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2020.emnlp-main.746 Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Su- perglue: A stickier benchmark for general- purpose language understanding systems. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and language un- analysis platform for natural the 2018 In Proceedings of derstanding. EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics. https://doi .org/10.18653/v1/W18-5446 Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel Weld. 2019. Errudite: Scal- able, reproducible, and testable error analysis. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 747–763, Florence, Italy. Association for Computational Linguistics. https://doi .org/10.18653/v1/P19-1073 Alexander Yeh. 2000. More accurate tests for result differ- the statistical significance of ences. In COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics. Yue Yu, Simiao Zuo, Haoming Jiang, Wendi Ren, Tuo Zhao, and Chao Zhang. 2021. Fine-tuning pre-trained language model with weak supervi- sion: A contrastive-regularized self-training ap- proach. In Proceedings of the 2021 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, pages 1063–1077, Online. Association for Computational Lin- https://doi.org/10.18653 guistics. /v1/2021.naacl-main.84 Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics. https://doi .org/10.18653/v1/P19-1472 1081 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 9 0 2 1 5 4 4 7 0 / / t l a c _ a _ 0 0 5 9 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3Cross-functional Analysis of Generalization in Behavioral Learning image

Télécharger le PDF