Sensitivity as a Complexity Measure for Sequence Classification Tasks
Michael Hahn
Université de Stanford,
États-Unis
mhahn2@stanford.edu
Dan Jurafsky
Université de Stanford,
États-Unis
jurafsky@stanford.edu
Richard Futrell
Université de Californie, Irvine,
États-Unis
rfutrell@uci.edu
Abstrait
We introduce a theoretical framework for
understanding and predicting the complexity
of sequence classification tasks, using a novel
extension of the theory of Boolean function
sensitivity. The sensitivity of a function, given
a distribution over input sequences, quan-
tifies the number of disjoint subsets of the
input sequence that can each be individually
changed to change the output. We argue that
standard sequence classification methods are
biased towards learning low-sensitivity func-
tion, so that tasks requiring high sensitivity
are more difficult. To that end, we show ana-
lytically that simple lexical classifiers can only
express functions of bounded sensitivity, et
we show empirically that low-sensitivity func-
tions are easier to learn for LSTMs. We then
estimate sensitivity on 15 NLP tasks, finding
that sensitivity is higher on challenging tasks
collected in GLUE than on simple text classi-
fication tasks, and that sensitivity predicts the
performance both of simple lexical classifiers
and of vanilla BiLSTMs without pretrained
contextualized embeddings. Within a task,
sensitivity predicts which inputs are hard for
such simple models. Our results suggest that
the success of massively pretrained contextual
representations stems in part because they pro-
vide representations from which information
can be extracted by low-sensitivity decoders.
1
Introduction
What makes some tasks harder and others easier
for modern machine learning methods?1 In NLP,
simple models based on lexical classifiers provide
good performance on some tasks, while strong
performance on other tasks has been attained only
recently with massive pretrained models. Comment-
jamais, there is no unified theoretical framework
for understanding these difficulty differences
between tasks, or what models might be more or
less effective.
1Code: h t t p s : / / g i t h u b . c o m / m – h a h n
/sensitivity.
891
Existing complexity metrics provide limited
practical insight. The Chomsky Hierarchy (Chomsky,
1956) is a prominent classification of formal lan-
guages by complexity, but it describes asymptotic
worst-case complexity and does not provide a
measure of how hard it is to achieve high accuracy
on realistic task distributions. Kolmogorov com-
plexity (Li and Vitányi, 1993) is uncomputable
and becomes well-defined only in the asymptotic
limit. Psycholinguistic complexity metrics such
as surprisal (Hale, 2001) and dependency length
(Gibson, 1998) only capture formal features of
the input, without regard to the task.
We propose sensitivity as a theory of com-
que
plexity for sequence classification tasks,
est, any task involving learning a function from
sequences to labels. The sensitivity of a function,
given a distribution over input sequences, quan-
tifies the number of disjoint subsets of the input
sequence that can each be individually changed
in such a way as to change the output. Intuitively,
high-sensitivity functions are complex because
a single change in the input, in many differ-
ent places, can completely change the output;
low-sensitivity functions are simpler because the
output is predictable from redundant information
in many subsets of the input. We will argue that
sensitivity predicts what tasks are easy or hard for
modern machine learning methods to learn.
Our notion of sensitivity is grounded in a well-
studied theory for Boolean functions (O’Donnell,
2014), which we generalize to natural language.
Unlike measures like Kolmogorov complexity,
sensitivity can be estimated on real datasets and
single inputs without asymptotic approximations,
only requiring a generalized language model such
as XLNet (Yang et al., 2019) and a strong model
of the task.
In this paper, we argue that sensitivity captures
informal notions of complexity both at the level
of architectures and on the level of tasks. D'abord, nous
show that sensitivity quantifies architectural lim-
itations and inductive biases of various machine
Transactions of the Association for Computational Linguistics, vol. 9, pp. 891–908, 2021. https://doi.org/10.1162/tacl a 00403
Action Editor: Annie Louis. Submission batch: 2/2021; Revision batch: 4/2021; Published 8/2021.
c(cid:2) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
je
un
c
_
un
_
0
0
4
0
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
learning architectures used in NLP, including both
lexical classifiers and vanilla LSTMs without
pretrained contextualized embeddings (Section 3).
Deuxième, in a survey of 15 major NLP tasks, nous
find that sensitivity quantitatively predicts how
difficult a task is for simple lexical classifiers
and neural models, both across tasks and across
different inputs for a single task (Section 4). Le
validity of our methods for quantifying sensitivity
is verified using human experiments in Section 5.
Section 6 discusses the relationship of sensitivity
to previous theories of complexity and brittleness
in neural networks, and implications for NLP
pratique. Section 7 concludes.
2 Sensitivity
2.1 Analysis of Boolean Functions
We build on notions of sensitivity developed for
Boolean functions (Kahn et al., 1988; Hatami
et coll., 2010; O’Donnell, 2014). Analysis of
Boolean functions is a powerful and rigorous
theory with wide-ranging applications in theoreti-
cal computer science (O’Donnell, 2014). We first
introduce the relevant notions, and then explain
how these concepts can be generalized to the set-
ting of fully general sequence classification. Le
sensitivity of a Boolean function f : {−1, 1}n →
{−1, 1} at a bitstring x ∈ {−1, 1}n is defined as:
s(F, X) =
n(cid:2)
je = 1
1F (X)(cid:5)=f (x⊕i),
(1)
where x⊕i is the result of flipping the i-th bit of x.
This describes how many bits of x can be flipped
individually to change f , or equivalently, comment
many Hamming neighbors of x have the opposite
value of f .
(cid:3)
The highest possible sensitivity is attained
n
by the PARITY function fParity(X) :=
i=1 xi.
Given a string of ‘‘1’’s and ‘‘−1’’s, this function
counts whether the number of negative inputs
is even (output +1) or odd (output −1). Pour
instance, fParity(1, 1, 1) = fParity(1, −1, −1) = 1
and fParity(1, −1, 1) = fParity(−1, 1, 1) = −1. Le
function fParity has the property that flipping any
individual bit flips the output. Par exemple, given
the string ‘‘1 1 1’’, changing any of the three input
symbols to ‘‘−1’’ flips the parity of the string
depuis +1 to −1. Donc, for every bitstring
x ∈ {−1, 1}n, we have s(fP arity, X) = n. C'est
impossible to approximate fParity beyond chance
level with linear functions (Minsky and Papert,
1969), or with linear combinations of functions
that contain nonlinear interactions between less
than n input bits (O’Donnell, 2014). Dans ce
sense, the function fParity is maximally nonlinear.
On the other hand, low-sensitivity functions can
be approximated with linear functions or linear
combinations of functions that each only com-
bine a few input bits (O’Donnell, 2014, Thm.
2.38). Sensitivity also has close connections with
other complexity measures such as decision tree
depth (Nisan, 1991) and the degree of a Boolean
function when written as a polynomial.
2.2 Application to sequence classification
We argue that this theory can be brought to bear
to quantify the complexity of sequence classifi-
cation tasks. In this setting, sensitivity measures
the nonlinearity of the decision boundary. Low
sensitivity tasks are those where simple methods
based on linear combinations of local features
are most successful. Par exemple, low sensitivity
tasks can be solved by bag-of-words classifiers
and linear classifiers based on n-gram features,
which have bounded similarity (as we will make
precise in Proposition 1 below). On the other
main, high sensitivity tasks require more sophis-
ticated methods. We expect that tasks that have
proven empirically difficult in the literature, tel
as those requiring reasoning, correspond to those
with high sensitivity, which means that changing
different substrings in an input can easily flip the
label (par exemple., ENTAILMENT ⇒ NONENTAILMENT).
Testing these ideas requires generalizing sen-
sitivity to functions more akin to those relevant
in NLP along several aspects. One aspect can
be dealt with without major changes: NLP tasks
are defined on alphabets Σ with more than two
elements, such as the words of a language. Le
theory can be accommodated to such alphabets,
leading to a generalized definition of sensitivity
applicable when the symbols Xi are distributed
independently and uniformly (rephrased based on
O’Donnell, 2014, Def. 8.22):
n(cid:2)
Var (F (X)|∀j (cid:5)= i : Xj = xj) ,
s(F, X) :=
je = 1
(2)
where the variance measures how much f varies
across strings X ∈ Σn that agree with x on
all except possibly the i-th input. Definition (2)
892
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
je
un
c
_
un
_
0
0
4
0
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 1: Subset sensitivity (3) for sentiment analysis, for two inputs from the SST-2 dev set. For each inputs, nous
select a one-word subsequence (marked in blue, corresponding to sets {2} for Sentence 1, et {3} for Sentence
2), and show 10 possible substitutions sampled using XLNet (see Section 4; ‘‘2×’’ indicates samples appearing
twice). We show sentiment prediction (between −1.0 for negative and +1.0 for positive sentiment), obtained
using RoBERTa (see Section 4), both for the original sentence and each version arising from substituting any
of the other adjectives. In Sentence 1, due to the presence of positive adjectives in the context, the distribution
is concentrated on positive adjectives; F (X(cid:9)) = +1 for each sampled x(cid:9) ∈ x⊕P . Donc, subset sensitivity
s(F, X, P. ) is estimated as 0.0. In Sentence 2, both positive and negative adjectives are plausible substitutions, et
s(F, X, P. ) = 0.58.
reduces to (1) if Σ = {−1, 1} and f : {−1, 1}n →
{−1, 1}.
More challenging is the fact
that symbol
sequences in language are not distributed uni-
formly. Par exemple, in movie review sentiment
classification, most inputs will sound like movie
reviews (rather than tweets or Wikipedia articles),
and almost all will respect the grammatical and
statistical properties of the underlying language.
When defining a generalization of s(F, X) to nat-
ural language, we want to focus on those strings
x and their Hamming neighbors x(cid:9) that are typical
instances of the problem. We next describe an
adaptation of Equations (1) et (2) taking this
into account.
2.3 Formal Definitions
In order to adapt the idea of sensitivity to the
setting of NLP tasks, we introduce a generalized
notion called block sensitivity. Block sensitiv-
ity is the maximum sensitivity over all possible
partitions of the input bits. Block sensitivity has
been studied for Boolean functions as an upper
bound on (1) (Nisan, 1991; Bernasconi, 1996;
Hatami et al., 2010); we construct a probabilistic
version of this notion as a sensitivity measure
appropriate to more sequence classification tasks.
Consider a set Σ (par exemple., the words of a lan-
guage), with an arbitrary distribution Π over
the set Σ∗ of finite sequences of symbols from
Σ. We formalize classification tasks as functions
F : Σ∗ → [−1, 1].2 Such functions could be binary
classifiers f mapping to {−1, 1}, or they could
2For multi-class problems, we take a family of functions
f corresponding to the classes, see Section 4.
output a continuous score. We take the output
space to be [−1, 1] instead of [0, 1] to make our
definitions consistent with those from the analysis
of Boolean functions.
The subset sensitivity of the function f : Σ∗ →
R on the point x ∈ Σn and the set P ⊂ {1, . . . , n}
est
s(F, X, P. ) := Var
F (X)|X ∈ x⊕P
,
(3)
(cid:4)
(cid:5)
where x⊕P denotes the set of all strings x(cid:9) que
agree with x on all indices outside of P :
x⊕P := {X(cid:9) ∈ Σn : X(cid:9)
j = xj for all j ∈ {1, . . . , n}−P },
(4)
and the variance is computed with respect to Π.
If P is a singleton {je}, we recover the term inside
the sum in (2): s(F, X) =
n
i=1 s(F, X, {je}).
(cid:6)
We illustrate this definition in Figure 1 avec
examples from the Stanford Sentiment Treebank
(Socher et al., 2013). Ici, the function f maps
movie reviews to the probability that the review is
positive, scaled to [−1, 1]. For each sentence, nous
select a singleton subset P and show 10 samples
from Π, the distribution over possible substitu-
tion. In Sentence 1, due to the positive adjectives
in the context, the distribution is concentrated
on positive adjectives, and so the sensitivity
s(F, X, P. ) ≈ 0. In Sentence 2, both positive and
negative adjectives are plausible substitutions,
and s(F, X, P. ) ≈ 0.6.
This example shows how (3) differs from the
vanilla definition (1) by accounting for the sta-
tistical dependencies between words in natural
langue: It takes into account that the choice
of possible completions for a set P is often con-
strained by the context given by x. Inputs violating
893
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
je
un
c
_
un
_
0
0
4
0
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
these statistical dependencies (par exemple., ‘a boring,
witty, seductive movie’ for Figure 1) are unlikely
to occur in naturalistic input, and the behavior of f
on such unlikely inputs may not impact the diffi-
culty of representing f with high average fidelity.
This motivates considering the variance of f over
neighboring strings, instead of, say, the entire
range of f over all possible neighboring strings.
Based on subset sensitivity, we introduce the
block sensitivity at x as an analogue to (1):
bs(F, X) := max
k,P1 ˙∪… ˙∪Pk
k(cid:2)
je = 1
s(F, X, Pi),
(5)
over
ranges
the maximization
où
tous
partitionings of {1, . . . , n} into disjoint subsets
P1 ˙∪ . . . ˙∪Pk ( ˙∪ denoting disjoint union). Nous
recover the quantity s(F, X) dans (1) et (2) par
restricting subsets Pi to the singletons {je}; thus,
we have
bs(F, X) ≥ s(F, X).
(6)
Intuitively, bs(F, X) measures the following:
Given an input x, how many disjoint subse-
quences can be changed individually so as to flip
the label? The formal definition modifies this
logic by considering, for each subsequence, pas
whether changing it to flip the label is possible
in principle, but also the probabilities of the dif-
ferent changes. A useful summary statistic is the
average block sensitivity:
(cid:7)bs(F ) = E
x∼Π
[bs(F, X)] .
(7)
Why Consider Subsets? By considering sub-
sets P instead of single indices i, block sensitivity
takes into account that words are composed into
phrases, and that changing a phrase might change
the meaning when changing any individual word
cannot. Par exemple, exchanging the entire phrase
‘a gorgeous, witty, seductive’ (voir la figure 1) avec
something negative can make the review negative,
whereas exchanging any of the individual adjec-
tives cannot, due to the statistical dependencies
between the different words. This definition also
makes the sensitivity measure robust against tok-
enization: a more fine-grained tokenization (par exemple.,
into characters) cannot decrease bs(F, X).
3 Sensitivity Bounds for NLP Methods
Many statistical NLP methods proposed over
the past decades involve linear combinations of
features that look at individual words or groups
of a few words. Proposition 1 shows that such
methods can only express functions of bounded
block sensitivity, with an upper bound quadratic
in the number k of inputs the model looks at
simultaneously, independently of input length n.
Proposition 1. Let f be any function Σ∗ → R
parameterized as follows:
F (X) := h
(cid:8)
1
n
je = 1
n−k(cid:2)
(cid:9)
fi,n(xi, . . . , xi+k)
,
(x ∈ Σn)
(8)
where f1,n, . . . , fn−k,n are functions Σk →
Rd such that maxx∈Σk (cid:16)fi,n(X)(cid:16)2 ≤ C, et
h : Rd → R is L-Lipschitz continuous. Alors,
independently of input length n, we have
bs(F, X) ≤ 2L2C 2k2.
(9)
Proof. Fix a partition P1 ˙∪ . . . ˙∪Pl = {1, . . . , n}.
Write g(X) for the average inside h(·) dans (8).
Changing inputs in Pi affects up to k|Pi| of the
summands in g. Le (cid:2)2 norm of the sum of these
affected terms is bounded by Ck|Pi|
, Et ainsi
n
V ar(F |X ∈x⊕Pi) =
|F (X) − f (Oui )|2
E
X,Y ∈x⊕Pi
1
2
≤ 1
E
2
X,Y ∈x⊕Pi
L2C 2k2|Pi|2
n2
≤2
L2
.
(cid:16)g(X) − g(Oui )(cid:16)2
2
(cid:6)
Given
k
je = 1
|Pi|2 ≤ (
(cid:6)
k
je = 1
|Pi|)2 = n2, we find
je(cid:2)
je = 1
s(F, X, Pi) ≤ 2
L2C 2k2
n2
k(cid:2)
je = 1
|Pi|2 ≤ 2L2C 2k2.
(cid:2)
This result has direct bearing on a wide vari-
ety of methods used in NLP, such as averaging
word embeddings to construct sentence embed-
dings (Wieting et al., 2016; Arora et al., 2017;
Ethayarajh, 2018), CNNs (Kim, 2014) with aver-
age pooling, and log-linear models with n-gram
features. The parameter k equals 1 for models
averaging word embeddings, the Kernel width for
CNNs with average pooling, and n for models
using n-gram features. C describes the norm of
word embeddings, of the output of a CNN kernel,
or of the weights of a linear model. Lipschitz
functions h include the sigmoid function σ used in
894
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
je
un
c
_
un
_
0
0
4
0
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 2: LSTMs are biased towards low-sensitivity functions. (1) Gauche: Distribution of sensitivity of Boolean
functions defined by randomly initialized LSTMs (green and blue) and by the uniform distribution (red) over
functions f : {−1, 1}7 → {−1, 1}. (2) Droite: Losses for an LSTM (128 hidden units) fitting random functions
F : {−1, 1}N → R (N = 7, 10, 15) with given sensitivities, after 102, 103, 104, 105 iterations of training.
logistic regression and its generalization softmax,
which are 1-Lipschitz, and feedforward networks
with Lipschitz activations.
RNNs and LSTMs (Hochreiter and Schmidhuber,
1997) can express functions of any sensitivity,
such as fParity, because they can express all regular
languages (Horne and Hush, 1994). On the other
main, transformers (Vaswani et al., 2017) have
asymptotically bounded sensitivity as the input
length n increases (Hahn, 2020, Lemma 5).
We show that even LSTMs have a learning bias
towards low-sensitivity functions, despite their
theoretical capacity to represent high-sensitivity
les fonctions. We consider functions f : {−1, 1}n →
R where inputs are uniformly distributed over
{−1, 1}n. We first evaluated average block
sensitivity both for randomly initialized LSTMs
and for the uniform distribution over Boolean
les fonctions {−1, 1}7 → {−1, 1}. We constructed
Boolean functions from a randomly initialized
LSTM by obtaining a scalar output and making
this a binary output f based on a threshold
chosen to maximize Var(F ). We initialized the
LSTM’s weights uniformly from [−d−0.5, d−0.5]
or from a Gaussian with σ2 = d−1, where d is
the number of hidden units. Results are shown
in Figure 2, for d = 128 and d = 256. Random
Boolean functions have block sensitivity tightly
concentrated around ≈ 4.5, whereas the randomly
initialized LSTMs consistently show lower block
sensitivity. This suggests that
low-sensitivity
functions are ‘overrepresented’ in the LSTM
parameter space, echoing a theoretical result for
feedforward networks (Palma et al., 2019).
Deuxième, we directly examined learnability on
functions of different sensitivities. As randomly
chosen functions have tightly clustered sensitivity,
(cid:6)
we sampled3 functions with a specific targeted
average sensitivity as(F ) = 1
x∈{−1,1}n s(F, X).
2n
We did this for sequence lengths n = 7, 10, 15.
For each i = 1, . . . , n, we constructed five such
les fonctions, and then trained an LSTM (128 hidden
units) pour 105 iterations with Adam (learning rate
0.003, batch size 32), and recorded average mean
squared error after 102, 103, 104, 105 entraînement
iterations. Training batches and test examples
are sampled uniformly from the 2n elements of
{−1, 1}n, without consideration of out-of-sample
generalization. Results are shown in Figure 2.
For n = 7, we arrange functions by (cid:7)bs(F ); pour
n = 10, 15 we take as(F ) instead as it can be com-
puted efficiently and is strongly correlated with
(cid:7)bs(F ) at n = 7 (R = 0.95). Low-sensitivity func-
tions are learned perfectly with fewer iterations,
whereas high-sensitivity functions are not approx-
imated much better than chance even after 105
training iterations. We note that this is a result on
the ability to simply fit a function of 2n inputs, pas
le (harder) task of generalizing to unseen input.
4 Sensitivity and Difficulty of NLP Tasks
In Section 3, we provided evidence that sensi-
tivity describes how hard a function is to learn
and represent for simple machine learning archi-
tectures that do not include pretrained contextual
embeddings. Dans cette section, we argue empirically
that sensitivity is successful at capturing intuitive
notions of task difficulty: Low-sensitivity tasks
are those on which simple classifiers as described
in Proposition 1, and vanilla LSTMs without
3For each i = 1, . . . , N , we sampled functions f where
the Fourier spectrum is entirely concentrated on degrees
{i − 1, je, je + 1}. By O’Donnell (2014, Thm. 2.38), comme(F ) ≈ i.
895
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
je
un
c
_
un
_
0
0
4
0
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
pretraining, are relatively successful. More chal-
lenging tasks such as those collected in the GLUE
suite (Wang et al., 2019b) have higher sensitivity.
Estimating block sensitivity (5) requires two
ingredients: an estimate of the distributions of
neighboring strings Π(X|X ∈ x⊕P ), and an
estimate of f on this set. We approximate Π via a
language model, and f via a trained model that is
known to attain strong performance on the task.
C'est, we estimate the sensitivity of a task f
by measuring the sensitivity of a model f (cid:9) que
is known to provide close fit to f on the task’s
input distribution. In Section 5, we report human
annotation studies that justify this approximation.
Sampling Neighboring Strings For estimating
Π(X|X ∈ x⊕P ), we leverage the ability of XLNet
(Yang et al., 2019) and u-PMLM (Liao et al., 2020)
to model prediction in any order. We use the pre-
trained xlnet-large-cased model provided
in Wolf et al. (2019), and a pretrained u-PMLM
model trained on the 1 Billion Word benchmark
(Chelba et al., 2014). As these models take input
on the level of subword tokenizations, we require
all samples to consist of the same number of
subword symbols as in the span covered by P .
To enable meaningful comparison with traditional
tokenization and with human intuitions, we only
consider subsets P that respect whitespace. Nous
prendre 10 samples for each P . For tasks with short
inputs (text classification and CoLA), we finetune
XLNet on the training set to produce completions
more in line with the task-specific input distri-
bution. Due to compute availability, we did not
apply this procedure to other tasks. Finetuning
XLNet slightly increased estimated sensitivity;
as we applied it to those tasks expected to have
low sensitivity, this procedure potentially makes
comparison between tasks more conservative.
Tasks First, we consider four text classifica-
tion tasks: movie review sentiment (MR, Pang
and Lee, 2005), sentence subjectivity (SUBJ,
Pang and Lee, 2004), customer reviews sen-
timent (CR, Hu and Liu, 2004), and opinion
polarity (MPQA, Wiebe et al., 2005). On these
tasks,
low-sensitivity models such as CNNs
are known to achieve good performance (Kim,
2014). To approximate the functions f , we fine-
tune roberta.large.mnli using fairseq
for each of the tasks using a single set of
hyperparameters.
Deuxième, we selected all tasks of the GLUE
challenge suite (Wang et al., 2019b), designed
to require a good amount of nontrivial language
understading. GLUE contains inference, similar-
ville, and paraphrase tasks (MNLI, Williams et al.
(2018); MRPC, Dolan and Brockett (2005); QNLI,
Rajpurkar et al. (2016); QQP; STS-B, Cer et al.
(2017); RTE, Dagan et al. (2009)), an NLI ver-
sion of the Winograd schema challenge (Levesque
et coll., 2012), linguistic acceptability judgments
(CoLA, Warstadt et al., 2019), and the Stanford
sentiment treebank (SST-2, Socher et al., 2013).
On many of these tasks, simple BOW baselines
perform essentially at chance (Wang et al., 2019b).
We obtain predictions by finetuning RoBERTa
(roberta.large.mnli)
using fairseq
(Ott et al., 2019) using provided hyperparam-
eters.4 RoBERTa provides performance close
to or exceeding estimated human performance
on all GLUE tasks. For the Winograd schema
challenge, we took the WSC version from Super-
GLUE (Wang et al., 2019un) instead of the NLI
reformulation (WNLI) used in GLUE; we used
the pretrained model roberta.large.wsc.
Unlike WNLI, WSC is a single-span task, reducing
the number of subsets P considered.
Troisième, we considered sequence classification
formulations of POS tagging and syntactic pars-
ing. Pour 150 dev sentences in the English Web
Treebank (Silveira et al., 2014), we considered the
word at the median position of the sentence, et
estimated sensitivity of identifying (1) its POS
tag in the universal tagset (Petrov et al., 2012),
(2) its Universal Dependencies label (Nivre et al.,
2016), et (3) the relative position of its head, comme
an integer. All three tasks are formalized as multi-
class classification problems. We estimated all
three computations using the pretrained English
dependency parser provided in Stanza (Qi et al.,
2018; Qi et al., 2020).
Fourth, we considered two datasets probing syn-
tactic knowledge of anaphor licensing (Marvin
and Linzen, 2018; Hu et al., 2020), namely, tasks
248 et 260 in SyntaxGym (Gauthier et al.,
2020). These tasks ask a model to choose a sin-
gular (himself ) or plural (themselves) reflexive
after a context where only one is grammatical, mais
identifying the right reflexive requires syntactic
4https://github.com/pytorch/fairseq/blob
/master/examples/roberta/README.glue.md,
retrieved June 1, 2020.
896
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
je
un
c
_
un
_
0
0
4
0
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
connaissance. We modeled f using the medium-
sized GPT2 model (Radford et al., 2019). Nous
chose this task because it could be formalized
as binary classification problem, and because
GPT2 performed better on this task than on the
more familiar subject-verb agreement (and on the
feminine version with herself ).
For each task, we estimated sensitivity for at
least 150 dev examples, determined by compute
availability. For the syntactic tasks, we estimated
sensitivity on the full dataset, as language models
are evaluated on these tasks without finetuning.
dans
We considered continuous predictions
[−1, 1] for binary classification tasks, et en
[−1, 1]d for multiclass tasks with d classes,
obtained from the sigmoid or softmax layer of the
relevant models. For STS-B, we rescale continu-
ous similarity scores to [−1, 1]. For parsing and
WSC, we used the discrete output labels provided
by the pretrained models, represented as one-hot
vectors ∈ {−1, 1}d or binary labels ∈ {−1, 1}.
For multivariate output f (X) ∈ [−1, 1]d, we define
s(F, X, P. ) by computing it for each of the d coordi-
nates of f (X), and taking the maximum value over
ces. The resulting sensitivity estimates describe
the behavior of the coordinate of f that has the
most nonlinear decision boundary around x.
Lower Bound Approximation Calculating
block sensitivity (5) requires calculating the vari-
ance for each of the exponentially many subparts
P of the input, intractable for all but short inputs.
We restrict consideration to a polynomial number
of subparts, thus obtaining a lower bound on full
block sensitivity. We only consider (1) subsets
de 1, . . . , 8 adjacent tokens, et (2) unions of sets
{xin/7, . . . , X(i+1)n/7−1} for i = 1, . . . , 7. For the
parsing tasks, we additionally consider all subsets
in a window of 7 tokens around the relevant word.
This bounds the number of subsets by 8n + 256,
compared to 2n for full block sensitivity.
4.1 Results
Across the 15 tasks, XLNet and u-PMLM yielded
very similar estimates of average block sensitiv-
ville (R = 0.87, p = 7 · 10−6). In Figure 3, nous
show block sensitivity across tasks as estimated
by XLNet. The left panels show kernel density
estimates of the distribution over bs(F, X) over
the inputs x from the dev sets. The right panels
show estimated average block sensitivity (cid:7)bs(F ).
Text classification tasks have low estimated block
Chiffre 3: Block sensitivity: For each task, we provide
a smoothed histogram of the block sensitivity per input
(gauche), and average block sensitivity (droite). Estimates
obtained using XLNet; compare Figure 6 for u-PMLM.
sensitivity, with bs(F, X) being concentrated on
values lower than three. For the two syntactic
tasks, sensitivity is slightly higher; in comparison
to the text classification tasks, the histograms show
that these tasks have no datapoints with very low
sensitivity. For parsing, we see a substantial dif-
ference between POS tagging and relation labeling
d'un côté, and head identification on the
other hand. Identifying tags and relations has
lower sensitivity comparable to text classification
tasks, whereas identifying the relative position of
the head has higher sensitivity. This makes sense:
The relative position of the head is sensitive to
intervening words that, while not changing the
syntactic relation, change the numerical distance
between head and dependent. Enfin, for GLUE,
we observe a wide range of sensitivity scores.
SST-2, a sentiment analysis task, has sensitiv-
ity very similar to the (other) text classification
tasks, as do STS-B (semantic similarity) and QQP
(identifying redundant Quora questions). Other
tasks show substantially higher scores; the highest
estimated average block sensitivities are attained
897
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
je
un
c
_
un
_
0
0
4
0
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 4: Two inputs from SST-2. The first one has low
block sensitivity (0.93), as our models find only one
sensitive subset P . We show one completion sampled
from x⊕P that flips the label predicted by RoBERTa
from POSITIVE to NEGATIVE. The second input has higher
block sensitivity (1.88), with three disjoint sensitive
subsets. For each subset, we show a completion
sampled using XLNet that flips the predicted label.
by RTE, MRPC, and WSC, three tasks designed
to require nontrivial reasoning.
To provide insight into these results, we show
examples from SST-2 and RTE, with samples
from XLNet. In Figure 4, we show two examples
from SST-2. The first example has low sensitivity,
as our models find only one sensitive subset. Sur
the second example, our models find three disjoint
sensitive subsets, leading to higher sensitivity. Dans
Chiffre 7, we show an example from RTE, con-
sisting of a premise and a hypothesis. The models
identify five highly sensitivity subsequences, tel
that changing the input on any of these subse-
quences can flip the label from ENTAILMENT to
NOENTAILMENT.
Sensitivity and Sentence Length Sensitivity
might be higher on longer sentences, because
they can be partitioned into more sets P . Fait
this explain away the differences between tasks?
Chiffre 5 shows per-sentence sensitivity (estimated
using XLNet) as a function of sentence length.
The left panel compares sensitivity on simple text
classification tasks and on CoLA, a GLUE task
consisting of short sentences. For the simple text
classification tasks, sensitivity increases sharply
for very short sentences, but then plateaus. Pour
CoLA, it increases with length. The right panel
shows averaged values bs(F, X) across the tasks
in each of the four categories. Encore, sensitivity
increases for GLUE and dependency parsing,
while it plateaus for text classification. The two
Chiffre 5: Per-example block sensitivity as a function
of sentence length. Gauche: Comparing text classification
tasks with CoLA, a single-span GLUE task. Droite:
Block sensitivity across task groups.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
je
un
c
_
un
_
0
0
4
0
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 6: Sensitivity and simple models: Aver-
age block sensitivity as estimated using XLNet
(top) and u-PMLM (bottom) against error reduction
(dans % of previously misclassified examples) of a
Bag-of-Embeddings (BoE) model, a vanilla BiLSTM,
and RoBERTa against the majority class baseline on
the dev set.
syntactic tasks consist of short and tightly con-
trolled sentences; in relation to their lengths, their
sensitivities are particularly high.
Average Block Sensitivity and Simple Mod-
els Based on Section 3, we hypothesized that
tasks with low sensitivity correspond to those
for which bag-of-words models can meaning-
fully outperform the majority class baseline, et
those on which vanilla LSTM models do best. Dans
Chiffre 6, we plot average block sensitivity against
error reduction (dans % of previously misclassified
examples) of a bag-of-embeddings (BoE) model,5
5This model averages GloVE (Pennington et al., 2014)
embeddings and applies a one-layer MLP to derive a
898
Chiffre 7: An example from RTE, consisting of a premise and a hypothesis. In this example, the premise entails
the hypothesis. We show sensitive subsets Pi identified by the models; for each of them, we show one of those
completions created by XLNet that flip the label predicted by RoBERTa from ENTAILMENT to NOENTAILMENT. Dans
this example, five highly sensitive subsequences (two in the premise and three in the hypothesis) were identified.
a vanilla BiLSTM,6 and RoBERTa against the
majority class baseline, on the development sets.
BoE instantiates the model described in Propo-
sition 1 with k = 1; thus, we expect the top
right of this graph to be empty for BoE: Là
can be no high-sensitivity task on which the BoE
model provides strong quantitative performance.
For both BoE and the vanilla BiLSTM, average
sensitivity was negatively associated with error
reduction (XLNet: R = −0.71, p = 0.001 pour
BoE; R = −0.82, p = 0.0002 for BiLSTM.
u-PMLM: R = −0.66, p = 0.005 for BoE;
R = −0.76, p = 0.002 for BiLSTM), while no
association was observed for RoBERTa (XLNet:
R = −0.05, p = 0.87; u-PMLM: R = −0.07,
p = 0.84). We compared sensitivity as a predictor
with label entropy, which showed little association
with error reduction of either BoE or the vanilla
BiLSTM (both p > 0.1).
Which Inputs Have High Sensitivity? Nous
used the Stanford Sentiment Treebank (SST-2,
Socher et al., 2013) to investigate which inputs
have high sensitivity in sentiment classification.
We extracted the 445 dev inputs for which we
had estimated sensitivity (determined by com-
pute availability). The dataset contains syntactic
parses, with human sentiment annotation for each
constituent. We hypothesized that inputs have
high sensitivity when different constituents have
different sentiment. We focus on estimates from
prediction. This model is called CBOW in Wang et al.
(2019b); cependant, we apply BoE to concatenated spans in
the case of multi-span tasks, in line with the definition of
sensitivity.
6The syntax tasks have no training sets and we thus do
not report BiLSTM results; we deduced necessarily at-chance
performance for BoE from the design of the task. We excluded
STS-B because it cannot be evaluated with accuracy.
Chiffre 8: Gauche: Block sensitivity and dispersion
(voir le texte) of sentiment labels of constituents. Droite:
Accuracy as a function of sensitivity in sentiment
analyse.
XLNet for simplicity; results from u-PMLM are
qualitatively identical. We measured the disper-
sion of sentiment
labels over constituents by
enumerating positive (+1) and negative (−1)
labels of all constituents, and computing the stan-
dard deviation of this resulting distribution; this is
1 if as many constituents have positive sentiment
as there are constituents with negative sentiment.
Chiffre 8 (gauche) shows this dispersion measure as a
function of sensitivity. High-sensitivity examples
have higher dispersion. In a linear regression
with dispersion and sentence length as predictors
of sensitivity, dispersion was highly significant
(β = 0.53, p < 1.95 · 10−10), while length was
not (β = −0.00, p = 0.49). This is illustrated
by the examples in Figure 4 discussed above,
where dispersion correlates with sensitivity: The
first example has low block sensitivity (0.93) and
low label dispersion (0.0); the sentence is labeled
positive and no constituent is labeled negative.
The second example has higher block sensitivity
(1.88) and very high label dispersion (0.94): while
the sentence is labeled positive, three constituents
are labeled positive and five negative.
Second, we hypothesized that a BoE clas-
sifier and a vanilla LSTM perform better on
899
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
l
a
c
_
a
_
0
0
4
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
low-sensitivity examples, whereas RoBERTa
should provide better performance also on
higher-sensitivity examples. This is confirmed
by Figure 8 (right), where we show the accuracy
of BoE, BiLSTM, and RoBERTa as a function
of sensitivity. In a logistic regression with sen-
sitivity and sentence length as predictors of BoE
accuracy, sensitivity was again highly significant
(β = −1.22, p = 4.1·10−10). Findings were simi-
lar for the BiLSTM (β = −1.16, p = 1.41 · 10−9).
When predicting the accuracy of RoBERTa,
there was still a measurable effect of sensitivity
(β = −1.37, p = 1.6 · 10−5), but overall Figure 8
shows that RoBERTa provides more accurate
predictions on higher-sensitivity input. Sentence
length was not a significant predictor for accuracy
of any of the three models (all p > 0.05).
If we choose s(F, X) instead of bs(F, X), namely,
restricting to singletons P , there is still a signif-
icant effect of s(F, X) on BoE accuracy (β =
−1.24, p = 1.1 · 10−6), but with inferior model fit
compared to bs(F, X) (ΔDeviance = 20.0), con-
firming block sensitivity as the more appropriate
difficulty measure for simple NLP models.
Role of Task Model We have estimated sensi-
tivity of GLUE and text classification tasks using
a large pretrained transformer model (RoBERTa).
What would happen if we used a model outside
de
the family of massive pretrained contex-
tual embeddings? To answer this, we estimated
bs(F, X) on SST-2 and RTE using the vanilla
BiLSTM to represent f . On SST-2, sensitivity
estimated with the BiLSTM’s correlated with sen-
sitivty estimated with RoBERTa on those inputs
where the BiLSTM provides correct predictions
(R = 0.36, p = 2 · 10−11), but not on those (typi-
cally higher-sensitivity ones) where its predictions
are incorrect (R = 0.15, p = 0.21); a linear
regression confirmed that RoBERTa’s sensitivity
was more predictive of the BiLSTM’s sensitivity
in those cases that the LSTM labeled correctly
(β = 0.2, p = 0.004). On RTE (where the BiL-
STM’s accuracy is at chance), the BiLSTM’s sen-
sitivity was at a constant low value (≈ 0.5) for all
inputs. This illustrates that automatic estimation
of sensitivity requires a strong model that is able
to achieve the sensitivity levels required by a task.
Role of Lower Bound Approximation We
evaluated the role of the lower bound approxi-
mation on 20 inputs from SST-2 of between 8
et 11 words each—long enough to make the
approximation inexact but still allowing consid-
eration of all 2n subsets. We compared estimates
of bs(F, X) based on the approximation (≤ 216
subsets) and the full power set (≤ 211 = 2048 sub-
sets). En moyenne, the approximation decreased
estimates of b(F, X) depuis 1.59 à 1.35. Cependant,
the two estimates were almost perfectly correlated
(R = 0.95, p < 10−10). Even when restricting to
singletons P (up to 11 subsets), the correlation
remained high (R = 0.81, p < 0.0001). Thus,
while the approximation may underestimate the
numerical values of bs(f, x),
it preserves the
relative sensitivities of different inputs.
5 Human Validation
In Section 4, we estimated the sensitivity of NLP
tasks by plugging a model f (cid:9) of the task f into
equation 5. This methodology requires that the
model f (cid:9) provides good labels on the samples
from x⊕P obtained using the language models. As
the language models only approximate the input
distribution, their samples could fall outside of
the data distribution on which f (cid:9) approximates the
true task f at high accuracy. If this were the case,
high estimated sensitivity on tasks such as RTE
might reflect brittleness of large models rather
than true high sensitivity. Here, we show that this
is not the case: Reasoning tasks like RTE have
higher sensitivity than text classification tasks like
SST-2, even when using human labels.
5.1 Experiment 1: Validating Oracle Model
For 60 items from SST2 and 30 items from
RTE each, we collected the subsets P1, . . . , Pk
achieving the maximum in (5), with 6 samples
from XLNet for each subset (we collected fewer
items from RTE because they typically have more
sensitive subsets Pi, making annotation more
expensive). We then recruited naive participants
who labeled these samples; each sample was
labeled by two or three annotators. In addition
to the appropriate labels (‘‘positive’’ and ‘‘nega-
tive’’ for SST-2, ‘‘entails’’ and ‘‘does not entail’’
for RTE), participants were also provided with a
‘‘makes no sense’’ option. We repeated the study
for SST2 both with and without finetuning.
The rate of ‘‘makes no sense’’ responses on
SST-2 was 18% without finetuning and 11%
with finetuning; it was 12% on RTE. The agree-
ment between RoBERTa and the modal human
900
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
l
a
c
_
a
_
0
0
4
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
5.2 Experiment 2: Manual Approximation
Experiment 1 showed that human and model
labels yield similar results in estimating sensi-
tivity. However, we still relied on the subsets Pi
generated by the models. Here, we show that sen-
sitivity, both on the level of individual inputs and
on the level of tasks, relates to human intuitions
about the number of disjoint subsequences that
can be changed to flip the label, which can be
easily estimated without any model.
We asked 30 naive individuals to find disjoint
subsets in inputs from SST-2 and RTE such that
changing the words in any one of them would flip
the label. Each participant worked on 30 items
from one of the tasks. They rewrote sentences
by clicking on words they wanted to replace and
entering text replacing those. After submitting a
rewrite, participants had the option of identify-
ing another subset disjoint from the previously
selected words. They changed at least one subset
for every input, and were provided a bonus for
every additional subset, incentivizing them to find
as many disjoint subsequences as possible. For
both SST-2 and RTE, participants were shown an
example with instructions guiding them to change
three disjoint subsequences. For RTE, we only
allowed participants to modify the premise, as
similar changes are often possible in premise and
hypothesis.
We interpreted the number of disjoint sub-
sets found by participants as a proxy for block
sensitivity. This quantity is different from block
sensitivity (5), as it does not weight the relative
probabilities of all possible changes, and we thus
do not expect the same numerical values for both
quantities. An exact human estimate of block
sensitivity would rely on asking humans both to
create multiple samples from x⊕P for different
subsets P and to then label these, infeasible given
the large number of possible subsets of each input.
In contrast, the task described here only requires
annotation from a few annotators for every input.
Figure 9 (right) shows the average number of
changes made on each input, as a function of the
sensitivity estimated by XLNet+RoBERTa. We
conducted a mixed-effects Poisson regression of
the number of changes made on the inputs, with
random effects for items and subjects. Sensitivity
predicted the number of changes (β = 0.061,
SE = 0.02, p = 0.0023), and there were overall
more changes for RTE than for SST2 (β = 0.39,
Figure 9: Results of Experiments 1 and 2: Left:
Sensitivity on SST-2 and RTE calculated using
RoBERTa’s labels (x-axis) and using human labels
(y-axis). On both tasks, both versions are highly
correlated (R > 0.8 in both tasks). Droite: Results
of Experiment 2: Average number of disjoint subsets
on which participants change inputs to flip the label, comme
a function of estimated sensitivity on SST-2 and RTE.
label was 80% (without finetuning) et 85%
(with finetuning) on SST-2, et 72% on RTE;
compared to 87%, 92%, et 79%, respectivement,
average agreement between a single annotator
and the modal label. Interannotator agreement is
below the human accuracies reported by Nangia
and Bowman (2019); we note that the creators
of RTE specifically excluded items where human
annotators did not agree (Dagan et al., 2009)
and that SST-2 excludes reviews labeled as
neutral (Socher et al., 2013); we thus expect
lower agreement on other strings from the same
domain.
The key question is whether these levels of
agreement guarantee consistent sensitivity esti-
mates. Chiffre 9 (gauche) compares block sensitivity
estimated using RoBERTa with values obtained
by plugging in average human labels for the
function f (·). On both SST-2 and RTE, the values
are strongly correlated (SST-2 with and without
finetuning both R = 0.85; RTE: R = 0.91; tous
p < 2.2 · 10−16). On RTE, human estimates are
numerically lower than automatic estimates, but
the difference in average sensitivity between SST-
2 and RTE was strongly replicated by the human
estimates (β = 1.3, p = 1.3 · 10−14 in a linear
regression). These results indicate that a strong
model of a task leads to results similar to a human
oracle. In particular, the qualitative difference in
sensitivity between SST2 and RTE is replicated
when using human labels.
901
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
l
a
c
_
a
_
0
0
4
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
SE = 0.097, p = 6 · 10−5). Input length was
not predictive (β = −0.0015, SE = 0.002,
p = 0.32). This result shows that a fully manual
annotation task can approximate differences in
sensitivity both between inputs (effect of sensi-
tivity) and between tasks (effects of the contrast
between RTE and SST-2).
6 Discussion
We have proposed sensitivity as a theoreti-
cal framework for studying the complexity of
sequence classification tasks, arguing that it cap-
tures complexity both across several machine
learning architectures (Section 3) and across NLP
tasks (Section 4).
Prior work has studied the ability of RNNs
and transformers to represent and learn languages
in different classes of the Chomsky hierarchy
(e.g., Merrill, 2019). Sensitivity is orthogonal to
the Chomsky hierarchy: The maximally sensitive
function fParity has a two-state finite automaton,
but there are also low-sensitivity functions that are
not even computable. Sensitivity is also distinct
from Kolmogorov complexity and similar descrip-
tion length measures (Li and Vitányi, 1993):
fParity has high sensitivity but very low descrip-
tion length. Whereas Kolmogorov complexity
is uncomputable and can only be approxi-
mated asymptotically, sensitivity can be calculated
for individual inputs, enabling us to explicitly
evaluate it as a predictor of difficulty on NLP tasks.
Implications for NLP Practice Our results in
Section 4 suggest that pretrained contextualized
embeddings have been so successful
in NLP
because they make it possible to learn high-
sensitivity functions with modest amounts of
task-specific training data. We conjecture that,
through large-scale pretraining, models implicitly
learn high-sensitivity operations that are generally
useful for language understanding. Finetuning
such models for classification tasks (Howard and
Ruder, 2018; Peters et al., 2018; Devlin et al.,
2019) amounts to composing a high-sensitivity
model with a low-sensitivity classifier. Some
classical techniques can also be interpreted in this
light, such as aligning parse trees (a potentially
high-sensitivity computation)
and extracting
features from these alignments that then are fed
into an SVM (a low-sensitivity classifier) as an
approach to tasks like RTE (Dagan et al., 2009).
Decision Boundaries in NLP The decision
boundaries of NLP models are commonly stud-
ied to understand their
linguistic knowledge
(e.g., Linzen et al., 2016; Marvin and Linzen,
2018; Futrell et al., 2019; Jeretic et al., 2020).
Kaushik et al. (2020) and Gardner et al. (2020)
propose to improve NLP models and their eval-
uation by specifically considering input pairs
that differ in some part and in their (true) label.
Dattan et al. (2020) propose to quantify the dif-
ficulty of an input by the largest eigenvalue of
the Fisher information matrix of a task model,
finding that it predicts how sensitive classifiers
are to word substitutions.
Sensitivity is different from widely studied pheh
nomena of adversarial brittleness (Szegedy et al.,
2014; Jia and Liang, 2017): The existence of
adversarial examples typically means that natural
examples have some neighbors, possibly outside
of the input distribution, on which model output
changes even though the true label does not. In
contrast, high sensitivity means that there are
many neighboring inputs within the data distribu-
tion on which the true label changes. Sensitivity
may be related to the observation that models often
rely on spurious statistical patterns, such as simple
lexical correlates of the label observed in read-
ing comprehension datasets (e.g., Kaushik and
Lipton, 2018; Gururangan et al., 2018); we expect
that such artifacts decrease task sensitivity as they
make the gold labels correlated with the output of
simple lexical classifiers. Similarly, if the premise
alone is predictive of the label in an entailment
task (Poliak et al., 2018), changing the hypothesis
while staying within the task distribution is less
likely to flip the label, again decreasing sensitivity.
Inductive Biases in Neural Networks There
the
is empirical and theoretical evidence that
generalization capabilities of neural networks are
in part due to a bias towards ‘‘simple’’ functions,
with different formal notions of simplicity (e.g.,
Franco, 2006; Palma et al., 2019; Valle-Perez
et al., 2019). A few studies explicitly propose
notions similar to low sensitivity as describing
simplicity (Franco, 2006; Palma et al., 2019;
Novak et al., 2018). Relatedly, empirical work
shows that neural networks learn low frequen-
cies in the Fourier spectrum of functions first
(Rahaman et al., 2019; Xu et al., 2019; Cao et al.,
2019). As low average sensitivity corresponds
to concentration of Fourier spectrum on low
902
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
l
a
c
_
a
_
0
0
4
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
frequencies (O’Donnell, 2014, Prop. 3.2), this can
be understood as a bias towards low sensitivity.
One aspect distinguishing our results here from
these prior studies is that we measure sensitivity
of realistic functions arising as NLP tasks and on
distributions reflecting the nontrivial statistics of
natural language. Measuring sensitivity or Fourier
spectra on other machine learning tasks is an
interesting problem for future research.
7 Conclusion
We proposed block sensitivity as a complex-
ity measure for functions from sequences to
labels, applying the measure to quantify the com-
plexity of sequence classification tasks in NLP.
Block sensitivity generalizes well-understood
complexity measures from the theory of Boolean
functions to the setting of natural language. We
showed both theoretically and empirically that
low sensitivity characterizes tasks on which sim-
ple models without massive pretraining provide
reasonable performance, and that, in such tasks,
more difficult inputs correspond to those with high
sensitivity. Our results show that pretrained con-
textual embeddings enable models to learn tasks
with higher sensitivity, and suggest designing
challenging tasks by maximizing sensitivity.
Acknowledgments
We thank Judith Degen, Kawin Ethayarajh, Mike
Frank, Noah Goodman, and the members of the
Stanford NLP group for helpful discussion and
feedback. We also thank the anonymous TACL
reviewers for their insightful feedback that helped
improve the paper. We are also grateful to Yi
Liao for providing code and models for u-PMLM.
This work was supported by NSF grant #1947307
to RF.
References
Sanjeev Arora, Yingyu Liang, and Tengyu Ma.
tough-to-beat baseline
2017. A simple but
In ICLR 2017:
for sentence embeddings.
International Conference on Learning Repre-
sentations 2017.
A. Bernasconi. 1996. Sensitivity vs. block sen-
sitivity (an average-case study). Information
Processing Letters, 59(3):151–157. https://
doi.org/10.1016/0020-0190(96)00105-6
Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan
Zhou, and Quanquan Gu. 2019. Towards
understanding the spectral bias of deep learning.
arXiv preprint arXiv:1912.01198.
Daniel M. Cer, Mona T. Diab, Eneko
I˜nigo Lopez-Gazpio,
and Lucia
Agirre,
Specia. 2017. Semeval-2017 task 1: Semantic
textual similarity multilingual and crosslingual
the
focused evaluation.
11th International Workshop on Semantic
Evaluation (SemEval-2017), pages 1–14.
In Proceedings of
Ciprian Chelba, Tomas Mikolov, Mike Schuster,
Qi Ge, Thorsten Brants, Phillipp Koehn, and
Tony Robinson. 2014. One billion word bench-
mark for measuring progress in statistical
INTERSPEECH,
language modeling.
pages 2635–2639.
In
Noam Chomsky. 1956. Three models for the
description of language. IEEE Transactions on
Information Theory, 2(3):113–124. https://
doi.org/10.1109/TIT.1956.1056813
Ido Dagan, Bill Dolan, Bernardo Magnini, and
Dan Roth. 2009. Recognizing textual entail-
ment: Rational, evaluation and approaches. Nat-
ural Language Engineering, 15(4). https://
doi.org/10.1017/S1351324909990209
Debajyoti Datta, Shashwat Kumar, Laura E.
Barnes, and Tom Fletcher. 2020. Geometry
matters: Exploring language examples at the
decision boundary. CoRR, abs/2010.07212.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In NAACL-HLT 2019: Annual
Conference of the North American Chapter of
the Association for Computational Linguistics,
pages 4171–4186.
William B. Dolan and Chris Brockett. 2005.
Automatically constructing a corpus of sen-
the
tential paraphrases.
Third International Workshop on Paraphras-
ing, IWP@IJCNLP 2005, Jeju Island, Korea,
October 2005, 2005. Asian Federation of Nat-
ural Language Processing.
In Proceedings of
903
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
l
a
c
_
a
_
0
0
4
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Kawin Ethayarajh. 2018. Unsupervised random
walk sentence embeddings: A strong but simple
baseline. In Proceedings of The Third Work-
shop on Representation Learning for NLP,
https://doi.org/10
pages
91–100.
.18653/v1/W18-3012
Leonardo Franco. 2006. Generalization ability
of boolean functions implemented in feedfor-
ward neural networks. Neurocomputing, 70(1):
https://doi.org/10.1016
351–361.
/j.neucom.2006.01.025
Richard Futrell, Ethan Wilcox, Takashi Morita,
Peng Qian, Miguel Ballesteros, and Roger
Levy. 2019. Neural language models as psycho-
linguistic subjects: Representations of syntactic
state. In Proceedings of the 2019 Conference
of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and
Short Papers), pages 32–42, Minneapolis,
Minnesota. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/N19-1004
Matt Gardner, Yoav Artzi, Victoria Basmova,
Jonathan Berant, Ben Bogin, Sihao Chen,
Pradeep Dasigi, Dheeru Dua, Yanai Elazar,
Ananth Gottumukkala, Nitish Gupta, Hanna
Hajishirzi, Gabriel Ilharco, Daniel Khashabi,
Kevin Lin, Jiangming Liu, Nelson F. Liu,
Phoebe Mulcaire, Qiang Ning, Sameer Singh,
Noah A. Smith, Sanjay Subramanian, Reut
Tsarfaty, Eric Wallace, Ally Zhang, and Ben
Zhou. 2020. Evaluating NLP models via
contrast sets. In Findings of the Association
for Computational Linguistics: EMNLP 2020,
pages 1307–1323, Online. Association for
Computational Linguistics.
Jon Gauthier, Jennifer Hu, Ethan Wilcox, Peng
Qian, and Roger Levy. 2020. SyntaxGym:
An online platform for targeted evaluation
of language models. In Proceedings of
the
Association for Computational Linguistics:
System Demonstrations (ACL 2020).
Edward Gibson. 1998. Linguistic complexity:
Locality of syntactic dependencies. Cognition,
68(1):1–76. https://doi.org/10.1016
/S0010-0277(98)00034-1
Suchin Gururangan, Swabha Swayamdipta, Omer
Levy, Roy Schwartz, Samuel Bowman, and
Noah A. Smith. 2018. Annotation artifacts in
language inference data. In NAACL
natural
HLT 2018: 16th Annual Conference of
the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, volume 2, pages 107–112.
https://doi.org/10.18653/v1/N18
-2017
Michael Hahn. 2020. Theoretical
limitations
of self-attention in neural sequence models.
Transactions of the Association for Compu-
tational Linguistics, 8:156–171. https://
doi.org/10.1162/tacl a 00306
John T. Hale. 2001. A probabilistic Earley parser
as a psycholinguistic model. In Proceedings
of the Second Meeting of the North American
Chapter of the Association for Computational
Linguistics
Technologies,
pages 1–8. https://doi.org/10.3115
/1073336.1073357
Language
and
Pooya Hatami, Raghav Kulkarni, and Denis
Pankratov. 2010. Variations on the sensitivity
conjecture. Theory of Computing, 4:1–27.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Long short-term memory. Neural Computation,
9(8):1735–1780.
Bill G. Horne and Don R. Hush. 1994. Bounds
on the complexity of recurrent neural network
implementations of finite state machines. In
Advances in Neural Information Processing
Systems, pages 359–366.
language model
Jeremy Howard and Sebastian Ruder. 2018.
fine-tuning for
Universal
text classification. In ACL 2018: 56th Annual
Meeting of
the Association for Computa-
tional Linguistics, volume 1, pages 328–339.
https://doi.org/10.18653/v1/P18
-1031
Jennifer Hu, Sherry Y. Chen, and Roger P. Levy.
2020. A closer look at the performance of
neural language models on reflexive anaphor
licensing. Proceedings of
the Society for
Computation in Linguistics, 3(1):382–392.
Minqing Hu and Bing Liu. 2004. Mining and
summarizing customer reviews. In Proceedings
904
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
l
a
c
_
a
_
0
0
4
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
of
the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining, pages 168–177.
Paloma Jeretic, Alex Warstadt, Suvrat Bhooshan,
and Adina Williams. 2020. Are natural language
inference models IMPPRESsive? Learning IM-
Plicature and PRESupposition. In Proceedings
of the 58th Annual Meeting of the Association
for Computational Linguistics, ACL 2020,
Online, July 5-10, 2020, pages 8690–8705.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-main.768
Robin Jia and Percy Liang. 2017. Adversarial
examples for evaluating reading comprehen-
sion systems. In Proceedings of
the 2017
Conference on Empirical Methods in Natural
Language Processing,
2021–2031.
https://doi.org/10.18653/v1/D17
-1215
pages
Jeff Kahn, Gil Kalai, and Nathan Linial.
1988. The influence of variables on Boolean
functions. In [Proceedings 1988] 29th Annual
Symposium on Foundations of Computer
Science, pages 68–80. https://doi.org
/10.1109/SFCS.1988.21923
Divyansh Kaushik, Eduard Hovy, and Zachary
Lipton. 2020. Learning the difference that
makes a difference with counterfactually-
In ICLR 2020: Eighth
augmented data.
International Conference on Learning Repre-
sentations.
Divyansh Kaushik and Zachary C. Lipton. 2018.
How much reading does reading compre-
investigation of
hension require? A critical
popular benchmarks. In EMNLP 2018: 2018
Conference on Empirical Methods in Natural
Language Processing,
5010–5015.
https://doi.org/10.18653/v1/D18
-1546
pages
for
sentence
classification.
2014. Convolutional
Yoon Kim.
networks
Proceedings of
Empirical Methods
Processing
https://doi.org/10.3115/v1/D14
-1181
neural
In
the 2014 Conference on
in Natural Language
1746–1751.
(EMNLP),
pages
Hector
and
J. Levesque, Ernest Davis,
Leora Morgenstern. 2012. The winograd
schema challenge. In KR’12 Proceedings of
the Thirteenth International Conference on
Principles of Knowledge Representation and
Reasoning, pages 552–561.
Ming Li and Paul Vitányi. 1993. An Introduction
to Kolmogorov Complexity and its Applications,
Springer.
of
autoregressive
Yi Liao, Xin Jiang, and Qun Liu. 2020.
language model
Probabilistically masked
in
generation
capable
arbitrary word order. In Proceedings of the
58th Annual Meeting of the Association for
Computational Linguistics, pages 263–274.
https://doi.org/10.18653/v1/2020
.acl-main.24
Tal Linzen, Emmanuel Dupoux, and Yoav
Goldberg. 2016. Assessing the ability of
to learn syntax-sensitive depen-
LSTMs
dencies. Transactions of
the Association
for Computational Linguistics, 4:521–535.
https://doi.org/10.1162/tacl a
00115
Rebecca Marvin and Tal Linzen. 2018. Targeted
syntactic evaluation of language models. arXiv
preprint arXiv:1808.09031. https://doi
.org/10.18653/v1/D18-1151
William Merrill. 2019. Sequential neural networks
as automata. arXiv preprint arXiv:1906.01615.
https://doi.org/10.18653/v1/W19
-3901
Marvin Minsky and Seymour A. Papert. 1969.
Perceptrons: An Introduction to Computational
Geometry. The MIT Press.
Nikita Nangia and Samuel R. Bowman. 2019.
Human vs. muppet: A conservative esti-
mate of human performance on the glue
the 57th
In Proceedings of
benchmark.
Annual Meeting of the Association for Com-
putational Linguistics,
4566–4575.
https://doi.org/10.18653/v1/P19
-1449
pages
Noam Nisan. 1991. CREW PRAMs and decision
trees. SIAM Journal on Computing, 20(6):
999–1007. https://doi.org/10.1137
/0220062
905
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
l
a
c
_
a
_
0
0
4
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Joakim Nivre, Marie-Catherine de Marneffe,
Jan Hajic,
Filip Ginter, Yoav Goldberg,
Christopher D. Manning, Ryan T. McDonald,
Slav Petrov, Sampo Pyysalo, Natalia Silveira,
Reut Tsarfaty, and Daniel Zeman. 2016.
Universal dependencies v1: A multilingual
In Tenth International
treebank collection.
Conference on Language Resources and
Evaluation (LREC 2016).
Roman Novak, Yasaman Bahri, Daniel A.
Abolafia, Jeffrey Pennington, and Jascha Sohl-
Dickstein. 2018. Sensitivity and generalization
in neural networks: an empirical study. In
6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC,
Canada, April 30 - May 3, 2018, Conference
Track Proceedings. OpenReview.net.
Ryan O’Donnell. 2014. Analysis of Boolean
Functions. Cambridge University Press.
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Fan, Sam Gross, Nathan Ng, David Grangier,
fairseq: A fast,
and Michael Auli. 2019.
extensible toolkit for sequence modeling. In
NAACL-HLT 2019: Annual Conference of the
North American Chapter of the Association for
Computational Linguistics, pages 48–53.
Giacomo De Palma, Bobak Toussi Kiani, and Seth
Lloyd. 2019. Random deep neural networks are
biased towards simple functions. In Advances
in Neural Information Processing Systems 32:
Annual Conference on Neural
Information
Processing Systems 2019, NeurIPS 2019,
December 8-14, 2019, Vancouver, BC, Canada,
pages 1962–1974.
Bo Pang and Lillian Lee. 2004. A sentimental
education: Sentiment analysis using subjec-
tivity summarization based on minimum cuts.
In Proceedings of the 42nd Meeting of the
Association for Computational Linguistics
(ACL’04), Main Volume, pages 271–278.
https://doi.org/10.3115/1218955
.1218990
class
relationships
Bo Pang and Lillian Lee. 2005. Seeing stars:
senti-
Exploiting
to rating
ment categorization with respect
the 43rd Annual
scales. In Proceedings of
Meeting of
the Association for Computa-
tional Linguistics (ACL’05), pages 115–124.
for
https://doi.org/10.3115/1219840
.1219855
Jeffrey
Socher,
Pennington, Richard
and
Christopher Manning. 2014. GloVe: Global
In Pro-
for word representation.
vectors
2014 Conference
ceedings
on
in Natural Language
Empirical Methods
Processing
1532–1543.
https://doi.org/10.3115/v1/D14
-1162
(EMNLP),
pages
the
of
Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton
Lee, and Luke Zettlemoyer. 2018. Deep con-
textualized word representations. In NAACL
HLT 2018: 16th Annual Conference of
the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, volume 1, pages 2227–2237.
https://doi.org/10.18653/v1/N18
-1202
Slav Petrov, Dipanjan Das, and Ryan McDonald.
2012. A universal part-of-speech tagset.
In Proceedings of
the Eighth International
Conference on Language Resources and
Evaluation (LREC-2012), pages 2089–2096.
Adam Poliak,
Jason Naradowsky, Aparajita
Haldar, Rachel Rudinger, and Benjamin Van
Durme. 2018. Hypothesis only baselines in
language inference. In Proceedings
natural
of the Seventh Joint Conference on Lexical
and Computational Semantics, pages 180–191.
https://doi.org/10.18653/v1/S18
-2023
Peng Qi, Timothy Dozat, Yuhao Zhang, and
Christopher D. Manning. 2018. Universal de-
pendency parsing from scratch. In Proceedings
of the CoNLL 2018 Shared Task: Multilingual
Parsing from Raw Text to Universal Depen-
dencies, pages 160–170.
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason
Bolton, and Christopher D. Manning. 2020.
Stanza: A Python natural language process-
ing toolkit for many human languages. In
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics:
System Demonstrations, ACL 2020, Online,
July 5-10, 2020, pages 101–108. Association
for Computational Linguistics.
906
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
l
a
c
_
a
_
0
0
4
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI Blog, 1(8):9.
Łukasz Kaiser, and Illia Polosukhin. 2017.
In Advances
Attention is all you need.
in Neural Information Processing Systems,
pages 5998–6008.
Nasim Rahaman, Aristide Baratin, Devansh Arpit,
Felix Draxler, Min Lin, Fred Hamprecht,
Yoshua Bengio, and Aaron Courville. 2019. On
the spectral bias of neural networks. In ICML
2019: Thirty-sixth International Conference on
Machine Learning, pages 5301–5310.
Pranav Rajpurkar,
Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. Squad:
100,000+ questions
for machine compre-
hension of text. In Proceedings of the 2016
Conference on Empirical Methods in Natural
Language Processing,
2383–2392.
https://doi.org/10.18653/v1/D16
-1264
pages
Natalia Silveira, Timothy Dozat, Marie-Catherine
de Marneffe, Samuel Bowman, Miriam Connor,
John Bauer, and Chris Manning. 2014. A
gold standard dependency corpus for English.
the Ninth International
In Proceedings of
Conference on Language Resources and
Evaluation (LREC’14), pages 2897–2904.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D. Manning, Andrew Ng,
and Christopher Potts. 2013. Recursive deep
models for semantic compositionality over a
sentiment treebank. In Proceedings of the 2013
Conference on Empirical Methods in Natural
Language Processing, pages 1631–1642.
Christian Szegedy, Wojciech Zaremba,
Ilya
Sutskever,
Joan Bruna, Dumitru Erhan,
Ian J. Goodfellow, and Rob Fergus. 2014.
Intriguing properties of neural networks. In
2nd International Conference on Learning
Representations,
ICLR 2014, Banff, AB,
Canada, April 14-16, 2014, Conference Track
Proceedings.
Guillermo Valle-Perez, Chico Q. Camargo, and
Ard Louis. 2019. Deep learning generalizes
because the parameter-function map is biased
In ICLR 2019:
towards simple functions.
7th International Conference on Learning
Representations.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill,
Omer Levy, and Samuel R. Bowman. 2019a.
SuperGLUE: A stickier benchmark for general-
purpose language understanding systems. In
Advances in Neural Information Processing
Systems, pages 3266–3280.
Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R.
Bowman. 2019b. GLUE: A multi-task bench-
mark and analysis platform for natural language
understanding. In ICLR 2019: 7th International
Conference on Learning Representations.
https://doi.org/10.1162/tacl a
00290
Alex Warstadt, Amanpreet Singh, and Samuel R.
Bowman. 2019. Neural network acceptability
judgments. Transactions of the Association for
Computational Linguistics, 7:625–641.
Janyce Wiebe, Theresa Wilson, and Claire Cardie.
2005. Annotating expressions of opinions and
emotions in language. Language Resources and
Evaluation, 39(2):165–210.
John Wieting, Mohit Bansal, Kevin Gimpel,
and Karen Livescu. 2016. Towards univer-
sal paraphrastic sentence embeddings. In 4th
International Conference on Learning Repre-
sentations, ICLR 2016, San Juan, Puerto Rico,
May 2-4, 2016, Conference Track Proceedings.
corpus
Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage
chal-
lenge
for
sentence understanding
In NAACL HLT 2018:
through inference.
16th Annual Conference
the North
the Association for
American Chapter of
Computational Linguistics: Human Language
Technologies, volume 1, pages 1112–1122.
https://doi.org/10.18653/v1/N18
-1101
of
Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, R´emi
Louf, Morgan Funtowicz, and Jamie Brew.
907
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
l
a
c
_
a
_
0
0
4
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
2019. Huggingface’s transformers: State-of-
the-art natural language processing. arXiv pre-
print arXiv:1910.03771. https://doi.org
/10.18653/v1/2020.emnlp-demos.6
Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang
Xiao. 2019. Training behavior of deep
In
neural network in frequency domain.
International Conference on Neural
Infor-
mation Processing, pages 264–274. Springer.
https://doi.org/10.1007/978-3-030
-36708-4 22
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime
Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. 2019. XLNet: Generalized autoregressive
pretraining for
In
NeurIPS 2019: Thirty-third Conference on
Neural
Systems,
Information Processing
pages 5753–5763.
language understanding.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
0
3
1
9
5
7
7
0
7
/
/
t
l
a
c
_
a
_
0
0
4
0
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
908
Télécharger le PDF