Sensitivity as a Complexity Measure for Sequence Classification Tasks

Michael Hahn
斯坦福大学,
美国
mhahn2@stanford.edu

Dan Jurafsky
斯坦福大学,
美国
jurafsky@stanford.edu

Richard Futrell
加州大学, 尔湾,
美国
rfutrell@uci.edu

抽象的

We introduce a theoretical framework for
understanding and predicting the complexity
of sequence classification tasks, using a novel
extension of the theory of Boolean function
灵敏度. The sensitivity of a function, 给定
a distribution over input sequences, quan-
tifies the number of disjoint subsets of the
input sequence that can each be individually
changed to change the output. We argue that
standard sequence classification methods are
biased towards learning low-sensitivity func-
系统蒸发散, so that tasks requiring high sensitivity
are more difficult. To that end, we show ana-
lytically that simple lexical classifiers can only
express functions of bounded sensitivity, 和
we show empirically that low-sensitivity func-
tions are easier to learn for LSTMs. We then
estimate sensitivity on 15 NLP tasks, finding
that sensitivity is higher on challenging tasks
collected in GLUE than on simple text classi-
fication tasks, and that sensitivity predicts the
performance both of simple lexical classifiers
and of vanilla BiLSTMs without pretrained
contextualized embeddings. Within a task,
sensitivity predicts which inputs are hard for
such simple models. Our results suggest that
the success of massively pretrained contextual
representations stems in part because they pro-
vide representations from which information
can be extracted by low-sensitivity decoders.

介绍

What makes some tasks harder and others easier
for modern machine learning methods?1 In NLP,
simple models based on lexical classifiers provide
good performance on some tasks, while strong
performance on other tasks has been attained only
recently with massive pretrained models. 如何-
曾经, there is no unified theoretical framework
for understanding these difficulty differences
between tasks, or what models might be more or
less effective.

1Code: h t t p s : / / g i t h u b . c o m / 米 – h a h n

/灵敏度.

891

Existing complexity metrics provide limited
practical insight. The Chomsky Hierarchy (Chomsky,
1956) is a prominent classification of formal lan-
guages by complexity, but it describes asymptotic
worst-case complexity and does not provide a
measure of how hard it is to achieve high accuracy
on realistic task distributions. Kolmogorov com-
plexity (Li and Vitányi, 1993) is uncomputable
and becomes well-defined only in the asymptotic
limit. Psycholinguistic complexity metrics such
as surprisal (黑尔, 2001) and dependency length
(吉布森, 1998) only capture formal features of
输入, without regard to the task.

We propose sensitivity as a theory of com-
那
plexity for sequence classification tasks,
是, any task involving learning a function from
sequences to labels. The sensitivity of a function,
given a distribution over input sequences, quan-
tifies the number of disjoint subsets of the input
sequence that can each be individually changed
in such a way as to change the output. 直观地,
high-sensitivity functions are complex because
a single change in the input, in many differ-
耳鼻喉科场所, can completely change the output;
low-sensitivity functions are simpler because the
output is predictable from redundant information
in many subsets of the input. We will argue that
sensitivity predicts what tasks are easy or hard for
modern machine learning methods to learn.

Our notion of sensitivity is grounded in a well-
studied theory for Boolean functions (O’Donnell,
2014), which we generalize to natural language.
Unlike measures like Kolmogorov complexity,
sensitivity can be estimated on real datasets and
single inputs without asymptotic approximations,
only requiring a generalized language model such
as XLNet (杨等人。, 2019) and a strong model
任务的.

在本文中, we argue that sensitivity captures
informal notions of complexity both at the level
of architectures and on the level of tasks. 第一的, 我们
show that sensitivity quantifies architectural lim-
itations and inductive biases of various machine

计算语言学协会会刊, 卷. 9, PP. 891–908, 2021. https://doi.org/10.1162/tacl 00403
动作编辑器: Annie Louis. 提交批次: 2/2021; 修改批次: 4/2021; 已发表 8/2021.
C(西德:2) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
0
3
1
9
5
7
7
0
7

/
t

我

A
C
_
A
_
0
0
4
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

learning architectures used in NLP, including both
lexical classifiers and vanilla LSTMs without
pretrained contextualized embeddings (部分 3).
第二, in a survey of 15 major NLP tasks, 我们
find that sensitivity quantitatively predicts how
difficult a task is for simple lexical classifiers
and neural models, both across tasks and across
different inputs for a single task (部分 4). 这
validity of our methods for quantifying sensitivity
is verified using human experiments in Section 5.
部分 6 discusses the relationship of sensitivity
to previous theories of complexity and brittleness
in neural networks, and implications for NLP
实践. 部分 7 concludes.

2 Sensitivity

2.1 Analysis of Boolean Functions

We build on notions of sensitivity developed for
Boolean functions (Kahn et al., 1988; Hatami
等人。, 2010; O’Donnell, 2014). Analysis of
Boolean functions is a powerful and rigorous
theory with wide-ranging applications in theoreti-
cal computer science (O’Donnell, 2014). We first
introduce the relevant notions, and then explain
how these concepts can be generalized to the set-
ting of fully general sequence classification. 这
sensitivity of a Boolean function f : {−1, 1}n →
{−1, 1} at a bitstring x ∈ {−1, 1}n is defined as:

s(F, X) =

n(西德:2)

我=1

1F (X)(西德:5)=f (x⊕i),

(1)

where x⊕i is the result of flipping the i-th bit of x.
This describes how many bits of x can be flipped
individually to change f , or equivalently, 如何
many Hamming neighbors of x have the opposite
value of f .

(西德:3)

The highest possible sensitivity is attained
n
by the PARITY function fParity(X) :=
i=1 xi.
Given a string of ‘‘1’’s and ‘‘−1’’s, this function
counts whether the number of negative inputs
is even (输出 +1) or odd (output −1). 为了
实例, fParity(1, 1, 1) = fParity(1, −1, −1) = 1
and fParity(1, −1, 1) = fParity(−1, 1, 1) = −1. 这
function fParity has the property that flipping any
individual bit flips the output. 例如, 给定
the string ‘‘1 1 1’’, changing any of the three input
symbols to ‘‘−1’’ flips the parity of the string
从 +1 to −1. 所以, for every bitstring
x ∈ {−1, 1}n, we have s(fP arity, X) = n. 这是
impossible to approximate fParity beyond chance

level with linear functions (Minsky and Papert,
1969), or with linear combinations of functions
that contain nonlinear interactions between less
than n input bits (O’Donnell, 2014). 在这个
感觉, the function fParity is maximally nonlinear.
另一方面, low-sensitivity functions can
be approximated with linear functions or linear
combinations of functions that each only com-
bine a few input bits (O’Donnell, 2014, Thm.
2.38). Sensitivity also has close connections with
other complexity measures such as decision tree
深度 (Nisan, 1991) and the degree of a Boolean
function when written as a polynomial.

2.2 Application to sequence classification

We argue that this theory can be brought to bear
to quantify the complexity of sequence classifi-
cation tasks. 在这个设置下, sensitivity measures
the nonlinearity of the decision boundary. 低的
sensitivity tasks are those where simple methods
based on linear combinations of local features
are most successful. 例如, low sensitivity
tasks can be solved by bag-of-words classifiers
and linear classifiers based on n-gram features,
which have bounded similarity (as we will make
precise in Proposition 1 以下). On the other
手, high sensitivity tasks require more sophis-
ticated methods. We expect that tasks that have
proven empirically difficult in the literature, 这样的
as those requiring reasoning, correspond to those
with high sensitivity, which means that changing
different substrings in an input can easily flip the
标签 (例如, ENTAILMENT ⇒ NONENTAILMENT).

Testing these ideas requires generalizing sen-
sitivity to functions more akin to those relevant
in NLP along several aspects. One aspect can
be dealt with without major changes: NLP tasks
are defined on alphabets Σ with more than two
元素, such as the words of a language. 这
theory can be accommodated to such alphabets,
leading to a generalized definition of sensitivity
applicable when the symbols Xi are distributed
independently and uniformly (rephrased based on
O’Donnell, 2014, Def. 8.22):
n(西德:2)

Var (F (X)|∀j (西德:5)= i : Xj = xj) ,

s(F, X) :=

我=1

(2)

where the variance measures how much f varies
across strings X ∈ Σn that agree with x on
all except possibly the i-th input. Definition (2)

892

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
0
3
1
9
5
7
7
0
7

/
t

我

A
C
_
A
_
0
0
4
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 1: Subset sensitivity (3) for sentiment analysis, for two inputs from the SST-2 dev set. For each inputs, 我们
select a one-word subsequence (marked in blue, corresponding to sets {2} for Sentence 1, 和 {3} for Sentence
2), and show 10 possible substitutions sampled using XLNet (参见章节 4; ‘‘2×’’ indicates samples appearing
twice). We show sentiment prediction (between −1.0 for negative and +1.0 for positive sentiment), obtained
using RoBERTa (参见章节 4), both for the original sentence and each version arising from substituting any
of the other adjectives. In Sentence 1, due to the presence of positive adjectives in the context, the distribution
is concentrated on positive adjectives; F (X(西德:9)) = +1 for each sampled x(西德:9) ∈ x⊕P . 所以, subset sensitivity
s(F, X, 磷 ) is estimated as 0.0. In Sentence 2, both positive and negative adjectives are plausible substitutions, 和
s(F, X, 磷 ) = 0.58.

reduces to (1) if Σ = {−1, 1} and f : {−1, 1}n →
{−1, 1}.

More challenging is the fact

that symbol
sequences in language are not distributed uni-
formly. 例如, in movie review sentiment
classification, most inputs will sound like movie
reviews (rather than tweets or Wikipedia articles),
and almost all will respect the grammatical and
statistical properties of the underlying language.
When defining a generalization of s(F, X) to nat-
ural language, we want to focus on those strings
x and their Hamming neighbors x(西德:9) that are typical
instances of the problem. We next describe an
adaptation of Equations (1) 和 (2) taking this
into account.

2.3 Formal Definitions

In order to adapt the idea of sensitivity to the
setting of NLP tasks, we introduce a generalized
notion called block sensitivity. Block sensitiv-
ity is the maximum sensitivity over all possible
partitions of the input bits. Block sensitivity has
been studied for Boolean functions as an upper
bound on (1) (Nisan, 1991; Bernasconi, 1996;
Hatami et al., 2010); we construct a probabilistic
version of this notion as a sensitivity measure
appropriate to more sequence classification tasks.
Consider a set Σ (例如, the words of a lan-
规格), with an arbitrary distribution Π over
the set Σ∗ of finite sequences of symbols from
Σ. We formalize classification tasks as functions
F : Σ∗ → [−1, 1].2 Such functions could be binary
classifiers f mapping to {−1, 1}, or they could

2For multi-class problems, we take a family of functions

f corresponding to the classes, 参见章节 4.

output a continuous score. We take the output
space to be [−1, 1] 而不是 [0, 1] to make our
definitions consistent with those from the analysis
of Boolean functions.

The subset sensitivity of the function f : Σ∗ →
R on the point x ∈ Σn and the set P ⊂ {1, . . . , n}
是

s(F, X, 磷 ) := Var

F (X)|X ∈ x⊕P

(3)

(西德:4)

(西德:5)

where x⊕P denotes the set of all strings x(西德:9) 那
agree with x on all indices outside of P :

x⊕P := {X(西德:9) ∈ Σn : X(西德:9)

j = xj for all j ∈ {1, . . . , n}−P },
(4)

and the variance is computed with respect to Π.
If P is a singleton {我}, we recover the term inside
the sum in (2): s(F, X) =

n
i=1 s(F, X, {我}).

(西德:6)

We illustrate this definition in Figure 1 和
examples from the Stanford Sentiment Treebank
(Socher et al., 2013). 这里, the function f maps
movie reviews to the probability that the review is
积极的, scaled to [−1, 1]. For each sentence, 我们
select a singleton subset P and show 10 样品
from Π, the distribution over possible substitu-
系统蒸发散. In Sentence 1, due to the positive adjectives
in the context, the distribution is concentrated
on positive adjectives, and so the sensitivity
s(F, X, 磷 ) ≈ 0. In Sentence 2, both positive and
negative adjectives are plausible substitutions,
and s(F, X, 磷 ) ≈ 0.6.

This example shows how (3) differs from the
vanilla definition (1) by accounting for the sta-
tistical dependencies between words in natural
语言: It takes into account that the choice
of possible completions for a set P is often con-
strained by the context given by x. Inputs violating

893

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
0
3
1
9
5
7
7
0
7

/
t

我

A
C
_
A
_
0
0
4
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

these statistical dependencies (例如, ‘a boring,
witty, seductive movie’ for Figure 1) are unlikely
to occur in naturalistic input, and the behavior of f
on such unlikely inputs may not impact the diffi-
culty of representing f with high average fidelity.
This motivates considering the variance of f over
neighboring strings, 而不是, 说, 整个
range of f over all possible neighboring strings.

Based on subset sensitivity, we introduce the

block sensitivity at x as an analogue to (1):

bs(F, X) := max

k,P1 ˙∪… ˙∪Pk

k(西德:2)

我=1

s(F, X, 圆周率),

(5)

超过

ranges

the maximization

在哪里
全部
partitionings of {1, . . . , n} into disjoint subsets
P1 ˙∪ . . . ˙∪Pk ( ˙∪ denoting disjoint union). 我们
recover the quantity s(F, X) 在 (1) 和 (2) 经过
restricting subsets Pi to the singletons {我}; 因此,
我们有

bs(F, X) ≥ s(F, X).

(6)

直观地, bs(F, X) measures the following:
Given an input x, how many disjoint subse-
quences can be changed individually so as to flip
the label? The formal definition modifies this
logic by considering, for each subsequence, 不是
whether changing it to flip the label is possible
原则, but also the probabilities of the dif-
ferent changes. A useful summary statistic is the
average block sensitivity:

(西德:7)bs(F ) = E
x∼Π

[bs(F, X)] .

(7)

Why Consider Subsets? By considering sub-
sets P instead of single indices i, block sensitivity
takes into account that words are composed into
短语, and that changing a phrase might change
the meaning when changing any individual word
不能. 例如, exchanging the entire phrase
‘a gorgeous, witty, seductive’ (见图 1) 和
something negative can make the review negative,
whereas exchanging any of the individual adjec-
tives cannot, due to the statistical dependencies
between the different words. This definition also
makes the sensitivity measure robust against tok-
enization: a more fine-grained tokenization (例如,
into characters) cannot decrease bs(F, X).

3 Sensitivity Bounds for NLP Methods

Many statistical NLP methods proposed over
the past decades involve linear combinations of

features that look at individual words or groups
of a few words. Proposition 1 shows that such
methods can only express functions of bounded
block sensitivity, with an upper bound quadratic
in the number k of inputs the model looks at
同时地, independently of input length n.

Proposition 1. Let f be any function Σ∗ → R
parameterized as follows:

F (X) := h

(西德:8)

1
n

我=1

n−k(西德:2)

(西德:9)

fi,n(希, . . . , xi+k)

(x ∈ Σn)

(8)
where f1,n, . . . , fn−k,n are functions Σk →
Rd such that maxx∈Σk (西德:16)fi,n(X)(西德:16)2 ≤ C, 和
H : Rd → R is L-Lipschitz continuous. 然后,
independently of input length n, 我们有

bs(F, X) ≤ 2L2C 2k2.

(9)

Proof. Fix a partition P1 ˙∪ . . . ˙∪Pl = {1, . . . , n}.
Write g(X) for the average inside h(·) 在 (8).
Changing inputs in Pi affects up to k|圆周率| 的
summands in g. 这 (西德:2)2 norm of the sum of these
affected terms is bounded by Ck|圆周率|

, 因此

V ar(F |X ∈x⊕Pi) =

|F (X) − f (是 )|2

乙
X,Y ∈x⊕Pi

1
2
≤ 1
乙
2
X,Y ∈x⊕Pi
L2C 2k2|圆周率|2
n2

≤2

(西德:16)G(X) − g(是 )(西德:16)2
2

(西德:6)

给定

k
我=1

|圆周率|2 ≤ (

(西德:6)

k
我=1

|圆周率|)2 = n2, 我们发现

我(西德:2)

我=1

s(F, X, 圆周率) ≤ 2

L2C 2k2
n2

k(西德:2)

我=1

|圆周率|2 ≤ 2L2C 2k2.

(西德:2)
This result has direct bearing on a wide vari-
ety of methods used in NLP, such as averaging
word embeddings to construct sentence embed-
丁斯 (Wieting et al., 2016; Arora et al., 2017;
Ethayarajh, 2018), CNNs (Kim, 2014) with aver-
age pooling, and log-linear models with n-gram
特征. The parameter k equals 1 for models
averaging word embeddings, the Kernel width for
CNNs with average pooling, and n for models
using n-gram features. C describes the norm of
word embeddings, of the output of a CNN kernel,
or of the weights of a linear model. Lipschitz
functions h include the sigmoid function σ used in

894

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
0
3
1
9
5
7
7
0
7

/
t

我

A
C
_
A
_
0
0
4
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 2: LSTMs are biased towards low-sensitivity functions. (1) 左边: Distribution of sensitivity of Boolean
functions defined by randomly initialized LSTMs (green and blue) and by the uniform distribution (红色的) 超过
functions f : {−1, 1}7 → {−1, 1}. (2) 正确的: Losses for an LSTM (128 hidden units) fitting random functions
F : {−1, 1}N → R (N= 7, 10, 15) with given sensitivities, 后 102, 103, 104, 105 iterations of training.

logistic regression and its generalization softmax,
which are 1-Lipschitz, and feedforward networks
with Lipschitz activations.

RNNs and LSTMs (Hochreiter and Schmidhuber,
1997) can express functions of any sensitivity,
such as fParity, because they can express all regular
语言 (Horne and Hush, 1994). On the other
手, transformers (Vaswani et al., 2017) 有
asymptotically bounded sensitivity as the input
length n increases (Hahn, 2020, Lemma 5).

We show that even LSTMs have a learning bias
towards low-sensitivity functions, despite their
theoretical capacity to represent high-sensitivity
功能. We consider functions f : {−1, 1}n →
R where inputs are uniformly distributed over
{−1, 1}n. We first evaluated average block
sensitivity both for randomly initialized LSTMs
and for the uniform distribution over Boolean
功能 {−1, 1}7 → {−1, 1}. We constructed
Boolean functions from a randomly initialized
LSTM by obtaining a scalar output and making
this a binary output f based on a threshold
chosen to maximize Var(F ). We initialized the
LSTM’s weights uniformly from [−d−0.5, d−0.5]
or from a Gaussian with σ2 = d−1, where d is
the number of hidden units. Results are shown
图中 2, for d = 128 and d = 256. Random
Boolean functions have block sensitivity tightly
concentrated around ≈ 4.5, whereas the randomly
initialized LSTMs consistently show lower block
灵敏度. This suggests that
low-sensitivity
functions are ‘overrepresented’ in the LSTM
parameter space, echoing a theoretical result for
feedforward networks (Palma et al., 2019).

第二, we directly examined learnability on
functions of different sensitivities. As randomly
chosen functions have tightly clustered sensitivity,

(西德:6)

we sampled3 functions with a specific targeted
average sensitivity as(F ) = 1
x∈{−1,1}n s(F, X).
2n
We did this for sequence lengths n = 7, 10, 15.
For each i = 1, . . . , n, we constructed five such
功能, and then trained an LSTM (128 隐
units) 为了 105 iterations with Adam (learning rate
0.003, batch size 32), and recorded average mean
squared error after 102, 103, 104, 105 训练
迭代. Training batches and test examples
are sampled uniformly from the 2n elements of
{−1, 1}n, without consideration of out-of-sample
generalization. Results are shown in Figure 2.
For n = 7, we arrange functions by (西德:7)bs(F ); 为了
n = 10, 15 we take as(F ) instead as it can be com-
puted efficiently and is strongly correlated with
(西德:7)bs(F ) at n = 7 (R = 0.95). Low-sensitivity func-
tions are learned perfectly with fewer iterations,
whereas high-sensitivity functions are not approx-
imated much better than chance even after 105
training iterations. We note that this is a result on
the ability to simply fit a function of 2n inputs, 不是
这 (harder) task of generalizing to unseen input.

4 Sensitivity and Difficulty of NLP Tasks

在部分 3, we provided evidence that sensi-
tivity describes how hard a function is to learn
and represent for simple machine learning archi-
tectures that do not include pretrained contextual
嵌入. 在这个部分, we argue empirically
that sensitivity is successful at capturing intuitive
notions of task difficulty: Low-sensitivity tasks
are those on which simple classifiers as described
in Proposition 1, and vanilla LSTMs without

3For each i = 1, . . . , 氮 , we sampled functions f where
the Fourier spectrum is entirely concentrated on degrees
{i − 1, 我, 我 + 1}. By O’Donnell (2014, Thm. 2.38), 作为(F ) ≈ i.

895

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
0
3
1
9
5
7
7
0
7

/
t

我

A
C
_
A
_
0
0
4
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

pretraining, are relatively successful. More chal-
lenging tasks such as those collected in the GLUE
suite (王等人。, 2019乙) have higher sensitivity.
Estimating block sensitivity (5) requires two
ingredients: an estimate of the distributions of
neighboring strings Π(X|X ∈ x⊕P ), 和
estimate of f on this set. We approximate Π via a
language model, and f via a trained model that is
known to attain strong performance on the task.
那是, we estimate the sensitivity of a task f
by measuring the sensitivity of a model f (西德:9) 那
is known to provide close fit to f on the task’s
input distribution. 在部分 5, we report human
annotation studies that justify this approximation.

Sampling Neighboring Strings For estimating
Π(X|X ∈ x⊕P ), we leverage the ability of XLNet
(杨等人。, 2019) and u-PMLM (Liao et al., 2020)
to model prediction in any order. We use the pre-
trained xlnet-large-cased model provided
in Wolf et al. (2019), and a pretrained u-PMLM
model trained on the 1 Billion Word benchmark
(Chelba et al., 2014). As these models take input
on the level of subword tokenizations, we require
all samples to consist of the same number of
subword symbols as in the span covered by P .
To enable meaningful comparison with traditional
tokenization and with human intuitions, 我们只
consider subsets P that respect whitespace. 我们
拿 10 samples for each P . For tasks with short
输入 (text classification and CoLA), we finetune
XLNet on the training set to produce completions
more in line with the task-specific input distri-
bution. Due to compute availability, 我们没有
apply this procedure to other tasks. Finetuning
XLNet slightly increased estimated sensitivity;
as we applied it to those tasks expected to have
low sensitivity, this procedure potentially makes
comparison between tasks more conservative.

Tasks First, we consider four text classifica-
tion tasks: movie review sentiment (MR, Pang
和李, 2005), sentence subjectivity (SUBJ,
Pang and Lee, 2004), customer reviews sen-
timent (CR, Hu and Liu, 2004), and opinion
polarity (MPQA, 维贝等人。, 2005). On these
任务,
low-sensitivity models such as CNNs
are known to achieve good performance (Kim,
2014). To approximate the functions f , we fine-
tune roberta.large.mnli using fairseq
for each of the tasks using a single set of
hyperparameters.

第二, we selected all tasks of the GLUE
challenge suite (王等人。, 2019乙), 设计的
to require a good amount of nontrivial language
understading. GLUE contains inference, 相似的-
性, and paraphrase tasks (MNLI, Williams et al.
(2018); MRPC, Dolan and Brockett (2005); QNLI,
Rajpurkar et al. (2016); QQP; STS-B, Cer et al.
(2017); RTE, Dagan et al. (2009)), an NLI ver-
sion of the Winograd schema challenge (Levesque
等人。, 2012), linguistic acceptability judgments
(CoLA, Warstadt et al., 2019), and the Stanford
sentiment treebank (SST-2, Socher et al., 2013).
On many of these tasks, simple BOW baselines
perform essentially at chance (王等人。, 2019乙).
We obtain predictions by finetuning RoBERTa
(roberta.large.mnli)
using fairseq
(Ott et al., 2019) using provided hyperparam-
eters.4 RoBERTa provides performance close
to or exceeding estimated human performance
on all GLUE tasks. For the Winograd schema
challenge, we took the WSC version from Super-
GLUE (王等人。, 2019A) instead of the NLI
reformulation (WNLI) used in GLUE; we used
the pretrained model roberta.large.wsc.
Unlike WNLI, WSC is a single-span task, 减少
the number of subsets P considered.

第三, we considered sequence classification
formulations of POS tagging and syntactic pars-
英. 为了 150 dev sentences in the English Web
树库 (Silveira et al., 2014), we considered the
word at the median position of the sentence, 和
estimated sensitivity of identifying (1) its POS
tag in the universal tagset (Petrov et al., 2012),
(2) its Universal Dependencies label (Nivre et al.,
2016), 和 (3) the relative position of its head, 作为
an integer. All three tasks are formalized as multi-
class classification problems. We estimated all
three computations using the pretrained English
dependency parser provided in Stanza (Qi et al.,
2018; Qi et al., 2020).

第四, we considered two datasets probing syn-
tactic knowledge of anaphor licensing (Marvin
and Linzen, 2018; Hu et al., 2020), 即, 任务
248 和 260 in SyntaxGym (Gauthier et al.,
2020). These tasks ask a model to choose a sin-
古拉尔 (他自己 ) or plural (他们自己) reflexive
after a context where only one is grammatical, 但
identifying the right reflexive requires syntactic

4https://github.com/pytorch/fairseq/blob
/master/examples/roberta/README.glue.md,
retrieved June 1, 2020.

896

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
0
3
1
9
5
7
7
0
7

/
t

我

A
C
_
A
_
0
0
4
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

知识. We modeled f using the medium-
sized GPT2 model (Radford et al., 2019). 我们
chose this task because it could be formalized
as binary classification problem, 并且因为
GPT2 performed better on this task than on the
more familiar subject-verb agreement (and on the
feminine version with herself ).

For each task, we estimated sensitivity for at
至少 150 dev examples, determined by compute
availability. For the syntactic tasks, we estimated
sensitivity on the full dataset, as language models
are evaluated on these tasks without finetuning.
在
We considered continuous predictions
[−1, 1] for binary classification tasks, 并在
[−1, 1]d for multiclass tasks with d classes,
obtained from the sigmoid or softmax layer of the
relevant models. For STS-B, we rescale continu-
ous similarity scores to [−1, 1]. For parsing and
WSC, we used the discrete output labels provided
by the pretrained models, represented as one-hot
vectors ∈ {−1, 1}d or binary labels ∈ {−1, 1}.
For multivariate output f (X) ∈ [−1, 1]d, 我们定义
s(F, X, 磷 ) by computing it for each of the d coordi-
nates of f (X), and taking the maximum value over
这些. The resulting sensitivity estimates describe
the behavior of the coordinate of f that has the
most nonlinear decision boundary around x.

Lower Bound Approximation Calculating
block sensitivity (5) requires calculating the vari-
ance for each of the exponentially many subparts
P of the input, intractable for all but short inputs.
We restrict consideration to a polynomial number
of subparts, thus obtaining a lower bound on full
block sensitivity. We only consider (1) subsets
的 1, . . . , 8 adjacent tokens, 和 (2) unions of sets
{xin/7, . . . , X(i+1)n/7−1} for i = 1, . . . , 7. 为了
parsing tasks, we additionally consider all subsets
in a window of 7 tokens around the relevant word.
This bounds the number of subsets by 8n + 256,
compared to 2n for full block sensitivity.

4.1 结果

Across the 15 任务, XLNet and u-PMLM yielded
very similar estimates of average block sensitiv-
性 (R = 0.87, p = 7 · 10−6). 图中 3, 我们
show block sensitivity across tasks as estimated
by XLNet. The left panels show kernel density
estimates of the distribution over bs(F, X) 超过
the inputs x from the dev sets. The right panels
show estimated average block sensitivity (西德:7)bs(F ).
Text classification tasks have low estimated block

数字 3: Block sensitivity: For each task, we provide
a smoothed histogram of the block sensitivity per input
(左边), and average block sensitivity (正确的). 估计
obtained using XLNet; compare Figure 6 for u-PMLM.

灵敏度, with bs(F, X) being concentrated on
values lower than three. For the two syntactic
任务, sensitivity is slightly higher; in comparison
to the text classification tasks, the histograms show
that these tasks have no datapoints with very low
灵敏度. For parsing, we see a substantial dif-
ference between POS tagging and relation labeling
一方面, and head identification on the
另一方面. Identifying tags and relations has
lower sensitivity comparable to text classification
任务, whereas identifying the relative position of
the head has higher sensitivity. 这是有道理的:
The relative position of the head is sensitive to
intervening words that, while not changing the
syntactic relation, change the numerical distance
between head and dependent. 最后, for GLUE,
we observe a wide range of sensitivity scores.
SST-2, a sentiment analysis task, has sensitiv-
ity very similar to the (其他) text classification
任务, as do STS-B (semantic similarity) and QQP
(identifying redundant Quora questions). 其他
tasks show substantially higher scores; the highest
estimated average block sensitivities are attained

897

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
0
3
1
9
5
7
7
0
7

/
t

我

A
C
_
A
_
0
0
4
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 4: Two inputs from SST-2. The first one has low
block sensitivity (0.93), as our models find only one
sensitive subset P . We show one completion sampled
from x⊕P that flips the label predicted by RoBERTa
from POSITIVE to NEGATIVE. The second input has higher
block sensitivity (1.88), with three disjoint sensitive
subsets. For each subset, we show a completion
sampled using XLNet that flips the predicted label.

by RTE, MRPC, and WSC, three tasks designed
to require nontrivial reasoning.

To provide insight into these results, we show
examples from SST-2 and RTE, with samples
from XLNet. 图中 4, we show two examples
from SST-2. The first example has low sensitivity,
as our models find only one sensitive subset. 在
the second example, our models find three disjoint
sensitive subsets, leading to higher sensitivity. 在
数字 7, we show an example from RTE, 骗局-
sisting of a premise and a hypothesis. The models
identify five highly sensitivity subsequences, 这样的
that changing the input on any of these subse-
quences can flip the label from ENTAILMENT to
NOENTAILMENT.

Sensitivity and Sentence Length Sensitivity
might be higher on longer sentences, 因为
they can be partitioned into more sets P . Does
this explain away the differences between tasks?
数字 5 shows per-sentence sensitivity (estimated
using XLNet) as a function of sentence length.
The left panel compares sensitivity on simple text
classification tasks and on CoLA, a GLUE task
consisting of short sentences. For the simple text
classification tasks, sensitivity increases sharply
for very short sentences, but then plateaus. 为了
CoLA, it increases with length. The right panel
shows averaged values bs(F, X) across the tasks
in each of the four categories. 再次, 灵敏度
increases for GLUE and dependency parsing,
while it plateaus for text classification. 他们俩

数字 5: Per-example block sensitivity as a function
of sentence length. 左边: Comparing text classification
tasks with CoLA, a single-span GLUE task. 正确的:
Block sensitivity across task groups.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
0
3
1
9
5
7
7
0
7

/
t

我

A
C
_
A
_
0
0
4
0
3
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 6: Sensitivity and simple models: Aver-
age block sensitivity as estimated using XLNet
(顶部) and u-PMLM (底部) against error reduction
(在 % of previously misclassified examples) of a
Bag-of-Embeddings (BoE) 模型, a vanilla BiLSTM,
and RoBERTa against the majority class baseline on
the dev set.

syntactic tasks consist of short and tightly con-
trolled sentences; in relation to their lengths, 他们的
sensitivities are particularly high.

Average Block Sensitivity and Simple Mod-
els Based on Section 3, we hypothesized that
tasks with low sensitivity correspond to those
for which bag-of-words models can meaning-
fully outperform the majority class baseline, 和
those on which vanilla LSTM models do best. 在
数字 6, we plot average block sensitivity against
error reduction (在 % of previously misclassified
examples) of a bag-of-embeddings (BoE) 模型,5

5This model averages GloVE (Pennington et al., 2014)
embeddings and applies a one-layer MLP to derive a

898

数字 7: An example from RTE, consisting of a premise and a hypothesis. 在这个例子中, the premise entails
the hypothesis. We show sensitive subsets Pi identified by the models; for each of them, we show one of those
completions created by XLNet that flip the label predicted by RoBERTa from ENTAILMENT to NOENTAILMENT. 在
this example, five highly sensitive subsequences (two in the premise and three in the hypothesis) were identified.

a vanilla BiLSTM,6 and RoBERTa against the
majority class baseline, on the development sets.
BoE instantiates the model described in Propo-
位置 1 with k = 1; 因此, we expect the top
right of this graph to be empty for BoE: 那里
can be no high-sensitivity task on which the BoE
model provides strong quantitative performance.
For both BoE and the vanilla BiLSTM, average
sensitivity was negatively associated with error
reduction (XLNet: R = −0.71, p = 0.001 为了
BoE; R = −0.82, p = 0.0002 for BiLSTM.
u-PMLM: R = −0.66, p = 0.005 for BoE;
R = −0.76, p = 0.002 for BiLSTM), while no
association was observed for RoBERTa (XLNet:
R = −0.05, p = 0.87; u-PMLM: R = −0.07,
p = 0.84). We compared sensitivity as a predictor
with label entropy, which showed little association
with error reduction of either BoE or the vanilla
BiLSTM (both p > 0.1).

Which Inputs Have High Sensitivity? 我们
used the Stanford Sentiment Treebank (SST-2,
Socher et al., 2013) to investigate which inputs
have high sensitivity in sentiment classification.
We extracted the 445 dev inputs for which we
had estimated sensitivity (determined by com-
pute availability). The dataset contains syntactic
parses, with human sentiment annotation for each
constituent. We hypothesized that inputs have
high sensitivity when different constituents have
different sentiment. We focus on estimates from

prediction. This model is called CBOW in Wang et al.
(2019乙); 然而, we apply BoE to concatenated spans in
the case of multi-span tasks, in line with the definition of
灵敏度.

6The syntax tasks have no training sets and we thus do
not report BiLSTM results; we deduced necessarily at-chance
performance for BoE from the design of the task. 我们排除了
STS-B because it cannot be evaluated with accuracy.

数字 8: 左边: Block sensitivity and dispersion
(see text) of sentiment labels of constituents. 正确的:
Accuracy as a function of sensitivity in sentiment
分析.

XLNet for simplicity; results from u-PMLM are
qualitatively identical. We measured the disper-
sion of sentiment
labels over constituents by
enumerating positive (+1) and negative (−1)
labels of all constituents, and computing the stan-
dard deviation of this resulting distribution; 这是
1 if as many constituents have positive sentiment
as there are constituents with negative sentiment.
数字 8 (左边) shows this dispersion measure as a
function of sensitivity. High-sensitivity examples
have higher dispersion. In a linear regression
with dispersion and sentence length as predictors
of sensitivity, dispersion was highly significant
(β = 0.53, p < 1.95 · 10−10), while length was not (β = −0.00, p = 0.49). This is illustrated by the examples in Figure 4 discussed above, where dispersion correlates with sensitivity: The first example has low block sensitivity (0.93) and low label dispersion (0.0); the sentence is labeled positive and no constituent is labeled negative. The second example has higher block sensitivity (1.88) and very high label dispersion (0.94): while the sentence is labeled positive, three constituents are labeled positive and five negative. Second, we hypothesized that a BoE clas- sifier and a vanilla LSTM perform better on 899 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 0 3 1 9 5 7 7 0 7 / / t l a c _ a _ 0 0 4 0 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 low-sensitivity examples, whereas RoBERTa should provide better performance also on higher-sensitivity examples. This is confirmed by Figure 8 (right), where we show the accuracy of BoE, BiLSTM, and RoBERTa as a function of sensitivity. In a logistic regression with sen- sitivity and sentence length as predictors of BoE accuracy, sensitivity was again highly significant (β = −1.22, p = 4.1·10−10). Findings were simi- lar for the BiLSTM (β = −1.16, p = 1.41 · 10−9). When predicting the accuracy of RoBERTa, there was still a measurable effect of sensitivity (β = −1.37, p = 1.6 · 10−5), but overall Figure 8 shows that RoBERTa provides more accurate predictions on higher-sensitivity input. Sentence length was not a significant predictor for accuracy of any of the three models (all p > 0.05).

If we choose s(F, X) instead of bs(F, X), 即,
restricting to singletons P , there is still a signif-
icant effect of s(F, X) on BoE accuracy (β =
−1.24, p = 1.1 · 10−6), but with inferior model fit
compared to bs(F, X) (ΔDeviance = 20.0), 骗局-
firming block sensitivity as the more appropriate
difficulty measure for simple NLP models.

Role of Task Model We have estimated sensi-
tivity of GLUE and text classification tasks using
a large pretrained transformer model (RoBERTa).
What would happen if we used a model outside
的
the family of massive pretrained contex-
tual embeddings? To answer this, we estimated
bs(F, X) on SST-2 and RTE using the vanilla
BiLSTM to represent f . On SST-2, 灵敏度
estimated with the BiLSTM’s correlated with sen-
sitivty estimated with RoBERTa on those inputs
where the BiLSTM provides correct predictions
(R = 0.36, p = 2 · 10−11), but not on those (typi-
cally higher-sensitivity ones) where its predictions
are incorrect (R = 0.15, p = 0.21); a linear
regression confirmed that RoBERTa’s sensitivity
was more predictive of the BiLSTM’s sensitivity
in those cases that the LSTM labeled correctly
(β = 0.2, p = 0.004). On RTE (where the BiL-
STM’s accuracy is at chance), the BiLSTM’s sen-
sitivity was at a constant low value (≈ 0.5) 对全部
输入. This illustrates that automatic estimation
of sensitivity requires a strong model that is able
to achieve the sensitivity levels required by a task.

Role of Lower Bound Approximation We
evaluated the role of the lower bound approxi-
mation on 20 inputs from SST-2 of between 8

和 11 words each—long enough to make the
approximation inexact but still allowing consid-
eration of all 2n subsets. We compared estimates
of bs(F, X) based on the approximation (≤ 216
subsets) and the full power set (≤ 211 = 2048 子-
套). 平均而言, the approximation decreased
estimates of b(F, X) 从 1.59 到 1.35. 然而,
the two estimates were almost perfectly correlated
(R = 0.95, p < 10−10). Even when restricting to singletons P (up to 11 subsets), the correlation remained high (R = 0.81, p < 0.0001). Thus, while the approximation may underestimate the numerical values of bs(f, x), it preserves the relative sensitivities of different inputs. 5 Human Validation In Section 4, we estimated the sensitivity of NLP tasks by plugging a model f (cid:9) of the task f into equation 5. This methodology requires that the model f (cid:9) provides good labels on the samples from x⊕P obtained using the language models. As the language models only approximate the input distribution, their samples could fall outside of the data distribution on which f (cid:9) approximates the true task f at high accuracy. If this were the case, high estimated sensitivity on tasks such as RTE might reflect brittleness of large models rather than true high sensitivity. Here, we show that this is not the case: Reasoning tasks like RTE have higher sensitivity than text classification tasks like SST-2, even when using human labels. 5.1 Experiment 1: Validating Oracle Model For 60 items from SST2 and 30 items from RTE each, we collected the subsets P1, . . . , Pk achieving the maximum in (5), with 6 samples from XLNet for each subset (we collected fewer items from RTE because they typically have more sensitive subsets Pi, making annotation more expensive). We then recruited naive participants who labeled these samples; each sample was labeled by two or three annotators. In addition to the appropriate labels (‘‘positive’’ and ‘‘nega- tive’’ for SST-2, ‘‘entails’’ and ‘‘does not entail’’ for RTE), participants were also provided with a ‘‘makes no sense’’ option. We repeated the study for SST2 both with and without finetuning. The rate of ‘‘makes no sense’’ responses on SST-2 was 18% without finetuning and 11% with finetuning; it was 12% on RTE. The agree- ment between RoBERTa and the modal human 900 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 0 3 1 9 5 7 7 0 7 / / t l a c _ a _ 0 0 4 0 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 5.2 Experiment 2: Manual Approximation Experiment 1 showed that human and model labels yield similar results in estimating sensi- tivity. However, we still relied on the subsets Pi generated by the models. Here, we show that sen- sitivity, both on the level of individual inputs and on the level of tasks, relates to human intuitions about the number of disjoint subsequences that can be changed to flip the label, which can be easily estimated without any model. We asked 30 naive individuals to find disjoint subsets in inputs from SST-2 and RTE such that changing the words in any one of them would flip the label. Each participant worked on 30 items from one of the tasks. They rewrote sentences by clicking on words they wanted to replace and entering text replacing those. After submitting a rewrite, participants had the option of identify- ing another subset disjoint from the previously selected words. They changed at least one subset for every input, and were provided a bonus for every additional subset, incentivizing them to find as many disjoint subsequences as possible. For both SST-2 and RTE, participants were shown an example with instructions guiding them to change three disjoint subsequences. For RTE, we only allowed participants to modify the premise, as similar changes are often possible in premise and hypothesis. We interpreted the number of disjoint sub- sets found by participants as a proxy for block sensitivity. This quantity is different from block sensitivity (5), as it does not weight the relative probabilities of all possible changes, and we thus do not expect the same numerical values for both quantities. An exact human estimate of block sensitivity would rely on asking humans both to create multiple samples from x⊕P for different subsets P and to then label these, infeasible given the large number of possible subsets of each input. In contrast, the task described here only requires annotation from a few annotators for every input. Figure 9 (right) shows the average number of changes made on each input, as a function of the sensitivity estimated by XLNet+RoBERTa. We conducted a mixed-effects Poisson regression of the number of changes made on the inputs, with random effects for items and subjects. Sensitivity predicted the number of changes (β = 0.061, SE = 0.02, p = 0.0023), and there were overall more changes for RTE than for SST2 (β = 0.39, Figure 9: Results of Experiments 1 and 2: Left: Sensitivity on SST-2 and RTE calculated using RoBERTa’s labels (x-axis) and using human labels (y-axis). On both tasks, both versions are highly correlated (R > 0.8 in both tasks). 正确的: 结果
of Experiment 2: Average number of disjoint subsets
on which participants change inputs to flip the label, 作为
a function of estimated sensitivity on SST-2 and RTE.

label was 80% (without finetuning) 和 85%
(with finetuning) on SST-2, 和 72% on RTE;
相比 87%, 92%, 和 79%, 分别,
average agreement between a single annotator
and the modal label. Interannotator agreement is
below the human accuracies reported by Nangia
and Bowman (2019); we note that the creators
of RTE specifically excluded items where human
annotators did not agree (Dagan et al., 2009)
and that SST-2 excludes reviews labeled as
neutral (Socher et al., 2013); we thus expect
lower agreement on other strings from the same
domain.

The key question is whether these levels of
agreement guarantee consistent sensitivity esti-
伙伴. 数字 9 (左边) compares block sensitivity
estimated using RoBERTa with values obtained
by plugging in average human labels for the
function f (·). On both SST-2 and RTE, the values
are strongly correlated (SST-2 with and without
finetuning both R = 0.85; RTE: R = 0.91; 全部
p < 2.2 · 10−16). On RTE, human estimates are numerically lower than automatic estimates, but the difference in average sensitivity between SST- 2 and RTE was strongly replicated by the human estimates (β = 1.3, p = 1.3 · 10−14 in a linear regression). These results indicate that a strong model of a task leads to results similar to a human oracle. In particular, the qualitative difference in sensitivity between SST2 and RTE is replicated when using human labels. 901 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 0 3 1 9 5 7 7 0 7 / / t l a c _ a _ 0 0 4 0 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 SE = 0.097, p = 6 · 10−5). Input length was not predictive (β = −0.0015, SE = 0.002, p = 0.32). This result shows that a fully manual annotation task can approximate differences in sensitivity both between inputs (effect of sensi- tivity) and between tasks (effects of the contrast between RTE and SST-2). 6 Discussion We have proposed sensitivity as a theoreti- cal framework for studying the complexity of sequence classification tasks, arguing that it cap- tures complexity both across several machine learning architectures (Section 3) and across NLP tasks (Section 4). Prior work has studied the ability of RNNs and transformers to represent and learn languages in different classes of the Chomsky hierarchy (e.g., Merrill, 2019). Sensitivity is orthogonal to the Chomsky hierarchy: The maximally sensitive function fParity has a two-state finite automaton, but there are also low-sensitivity functions that are not even computable. Sensitivity is also distinct from Kolmogorov complexity and similar descrip- tion length measures (Li and Vitányi, 1993): fParity has high sensitivity but very low descrip- tion length. Whereas Kolmogorov complexity is uncomputable and can only be approxi- mated asymptotically, sensitivity can be calculated for individual inputs, enabling us to explicitly evaluate it as a predictor of difficulty on NLP tasks. Implications for NLP Practice Our results in Section 4 suggest that pretrained contextualized embeddings have been so successful in NLP because they make it possible to learn high- sensitivity functions with modest amounts of task-specific training data. We conjecture that, through large-scale pretraining, models implicitly learn high-sensitivity operations that are generally useful for language understanding. Finetuning such models for classification tasks (Howard and Ruder, 2018; Peters et al., 2018; Devlin et al., 2019) amounts to composing a high-sensitivity model with a low-sensitivity classifier. Some classical techniques can also be interpreted in this light, such as aligning parse trees (a potentially high-sensitivity computation) and extracting features from these alignments that then are fed into an SVM (a low-sensitivity classifier) as an approach to tasks like RTE (Dagan et al., 2009). Decision Boundaries in NLP The decision boundaries of NLP models are commonly stud- ied to understand their linguistic knowledge (e.g., Linzen et al., 2016; Marvin and Linzen, 2018; Futrell et al., 2019; Jeretic et al., 2020). Kaushik et al. (2020) and Gardner et al. (2020) propose to improve NLP models and their eval- uation by specifically considering input pairs that differ in some part and in their (true) label. Dattan et al. (2020) propose to quantify the dif- ficulty of an input by the largest eigenvalue of the Fisher information matrix of a task model, finding that it predicts how sensitive classifiers are to word substitutions. Sensitivity is different from widely studied pheh nomena of adversarial brittleness (Szegedy et al., 2014; Jia and Liang, 2017): The existence of adversarial examples typically means that natural examples have some neighbors, possibly outside of the input distribution, on which model output changes even though the true label does not. In contrast, high sensitivity means that there are many neighboring inputs within the data distribu- tion on which the true label changes. Sensitivity may be related to the observation that models often rely on spurious statistical patterns, such as simple lexical correlates of the label observed in read- ing comprehension datasets (e.g., Kaushik and Lipton, 2018; Gururangan et al., 2018); we expect that such artifacts decrease task sensitivity as they make the gold labels correlated with the output of simple lexical classifiers. Similarly, if the premise alone is predictive of the label in an entailment task (Poliak et al., 2018), changing the hypothesis while staying within the task distribution is less likely to flip the label, again decreasing sensitivity. Inductive Biases in Neural Networks There the is empirical and theoretical evidence that generalization capabilities of neural networks are in part due to a bias towards ‘‘simple’’ functions, with different formal notions of simplicity (e.g., Franco, 2006; Palma et al., 2019; Valle-Perez et al., 2019). A few studies explicitly propose notions similar to low sensitivity as describing simplicity (Franco, 2006; Palma et al., 2019; Novak et al., 2018). Relatedly, empirical work shows that neural networks learn low frequen- cies in the Fourier spectrum of functions first (Rahaman et al., 2019; Xu et al., 2019; Cao et al., 2019). As low average sensitivity corresponds to concentration of Fourier spectrum on low 902 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 0 3 1 9 5 7 7 0 7 / / t l a c _ a _ 0 0 4 0 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 frequencies (O’Donnell, 2014, Prop. 3.2), this can be understood as a bias towards low sensitivity. One aspect distinguishing our results here from these prior studies is that we measure sensitivity of realistic functions arising as NLP tasks and on distributions reflecting the nontrivial statistics of natural language. Measuring sensitivity or Fourier spectra on other machine learning tasks is an interesting problem for future research. 7 Conclusion We proposed block sensitivity as a complex- ity measure for functions from sequences to labels, applying the measure to quantify the com- plexity of sequence classification tasks in NLP. Block sensitivity generalizes well-understood complexity measures from the theory of Boolean functions to the setting of natural language. We showed both theoretically and empirically that low sensitivity characterizes tasks on which sim- ple models without massive pretraining provide reasonable performance, and that, in such tasks, more difficult inputs correspond to those with high sensitivity. Our results show that pretrained con- textual embeddings enable models to learn tasks with higher sensitivity, and suggest designing challenging tasks by maximizing sensitivity. Acknowledgments We thank Judith Degen, Kawin Ethayarajh, Mike Frank, Noah Goodman, and the members of the Stanford NLP group for helpful discussion and feedback. We also thank the anonymous TACL reviewers for their insightful feedback that helped improve the paper. We are also grateful to Yi Liao for providing code and models for u-PMLM. This work was supported by NSF grant #1947307 to RF. References Sanjeev Arora, Yingyu Liang, and Tengyu Ma. tough-to-beat baseline 2017. A simple but In ICLR 2017: for sentence embeddings. International Conference on Learning Repre- sentations 2017. A. Bernasconi. 1996. Sensitivity vs. block sen- sitivity (an average-case study). Information Processing Letters, 59(3):151–157. https:// doi.org/10.1016/0020-0190(96)00105-6 Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quanquan Gu. 2019. Towards understanding the spectral bias of deep learning. arXiv preprint arXiv:1912.01198. Daniel M. Cer, Mona T. Diab, Eneko I˜nigo Lopez-Gazpio, and Lucia Agirre, Specia. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual the focused evaluation. 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14. In Proceedings of Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word bench- mark for measuring progress in statistical INTERSPEECH, language modeling. pages 2635–2639. In Noam Chomsky. 1956. Three models for the description of language. IEEE Transactions on Information Theory, 2(3):113–124. https:// doi.org/10.1109/TIT.1956.1056813 Ido Dagan, Bill Dolan, Bernardo Magnini, and Dan Roth. 2009. Recognizing textual entail- ment: Rational, evaluation and approaches. Nat- ural Language Engineering, 15(4). https:// doi.org/10.1017/S1351324909990209 Debajyoti Datta, Shashwat Kumar, Laura E. Barnes, and Tom Fletcher. 2020. Geometry matters: Exploring language examples at the decision boundary. CoRR, abs/2010.07212. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT 2019: Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 4171–4186. William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sen- the tential paraphrases. Third International Workshop on Paraphras- ing, IWP@IJCNLP 2005, Jeju Island, Korea, October 2005, 2005. Asian Federation of Nat- ural Language Processing. In Proceedings of 903 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 0 3 1 9 5 7 7 0 7 / / t l a c _ a _ 0 0 4 0 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Kawin Ethayarajh. 2018. Unsupervised random walk sentence embeddings: A strong but simple baseline. In Proceedings of The Third Work- shop on Representation Learning for NLP, https://doi.org/10 pages 91–100. .18653/v1/W18-3012 Leonardo Franco. 2006. Generalization ability of boolean functions implemented in feedfor- ward neural networks. Neurocomputing, 70(1): https://doi.org/10.1016 351–361. /j.neucom.2006.01.025 Richard Futrell, Ethan Wilcox, Takashi Morita, Peng Qian, Miguel Ballesteros, and Roger Levy. 2019. Neural language models as psycho- linguistic subjects: Representations of syntactic state. In Proceedings of the 2019 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 32–42, Minneapolis, Minnesota. Association for Computational Linguistics. https://doi.org/10.18653 /v1/N19-1004 Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. 2020. Evaluating NLP models via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Association for Computational Linguistics. Jon Gauthier, Jennifer Hu, Ethan Wilcox, Peng Qian, and Roger Levy. 2020. SyntaxGym: An online platform for targeted evaluation of language models. In Proceedings of the Association for Computational Linguistics: System Demonstrations (ACL 2020). Edward Gibson. 1998. Linguistic complexity: Locality of syntactic dependencies. Cognition, 68(1):1–76. https://doi.org/10.1016 /S0010-0277(98)00034-1 Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation artifacts in language inference data. In NAACL natural HLT 2018: 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 2, pages 107–112. https://doi.org/10.18653/v1/N18 -2017 Michael Hahn. 2020. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Compu- tational Linguistics, 8:156–171. https:// doi.org/10.1162/tacl a 00306 John T. Hale. 2001. A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics Technologies, pages 1–8. https://doi.org/10.3115 /1073336.1073357 Language and Pooya Hatami, Raghav Kulkarni, and Denis Pankratov. 2010. Variations on the sensitivity conjecture. Theory of Computing, 4:1–27. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780. Bill G. Horne and Don R. Hush. 1994. Bounds on the complexity of recurrent neural network implementations of finite state machines. In Advances in Neural Information Processing Systems, pages 359–366. language model Jeremy Howard and Sebastian Ruder. 2018. fine-tuning for Universal text classification. In ACL 2018: 56th Annual Meeting of the Association for Computa- tional Linguistics, volume 1, pages 328–339. https://doi.org/10.18653/v1/P18 -1031 Jennifer Hu, Sherry Y. Chen, and Roger P. Levy. 2020. A closer look at the performance of neural language models on reflexive anaphor licensing. Proceedings of the Society for Computation in Linguistics, 3(1):382–392. Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings 904 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 0 3 1 9 5 7 7 0 7 / / t l a c _ a _ 0 0 4 0 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 168–177. Paloma Jeretic, Alex Warstadt, Suvrat Bhooshan, and Adina Williams. 2020. Are natural language inference models IMPPRESsive? Learning IM- Plicature and PRESupposition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8690–8705. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .acl-main.768 Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehen- sion systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2021–2031. https://doi.org/10.18653/v1/D17 -1215 pages Jeff Kahn, Gil Kalai, and Nathan Linial. 1988. The influence of variables on Boolean functions. In [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science, pages 68–80. https://doi.org /10.1109/SFCS.1988.21923 Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. 2020. Learning the difference that makes a difference with counterfactually- In ICLR 2020: Eighth augmented data. International Conference on Learning Repre- sentations. Divyansh Kaushik and Zachary C. Lipton. 2018. How much reading does reading compre- investigation of hension require? A critical popular benchmarks. In EMNLP 2018: 2018 Conference on Empirical Methods in Natural Language Processing, 5010–5015. https://doi.org/10.18653/v1/D18 -1546 pages for sentence classification. 2014. Convolutional Yoon Kim. networks Proceedings of Empirical Methods Processing https://doi.org/10.3115/v1/D14 -1181 neural In the 2014 Conference on in Natural Language 1746–1751. (EMNLP), pages Hector and J. Levesque, Ernest Davis, Leora Morgenstern. 2012. The winograd schema challenge. In KR’12 Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, pages 552–561. Ming Li and Paul Vitányi. 1993. An Introduction to Kolmogorov Complexity and its Applications, Springer. of autoregressive Yi Liao, Xin Jiang, and Qun Liu. 2020. language model Probabilistically masked in generation capable arbitrary word order. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 263–274. https://doi.org/10.18653/v1/2020 .acl-main.24 Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of to learn syntax-sensitive depen- LSTMs dencies. Transactions of the Association for Computational Linguistics, 4:521–535. https://doi.org/10.1162/tacl a 00115 Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. arXiv preprint arXiv:1808.09031. https://doi .org/10.18653/v1/D18-1151 William Merrill. 2019. Sequential neural networks as automata. arXiv preprint arXiv:1906.01615. https://doi.org/10.18653/v1/W19 -3901 Marvin Minsky and Seymour A. Papert. 1969. Perceptrons: An Introduction to Computational Geometry. The MIT Press. Nikita Nangia and Samuel R. Bowman. 2019. Human vs. muppet: A conservative esti- mate of human performance on the glue the 57th In Proceedings of benchmark. Annual Meeting of the Association for Com- putational Linguistics, 4566–4575. https://doi.org/10.18653/v1/P19 -1449 pages Noam Nisan. 1991. CREW PRAMs and decision trees. SIAM Journal on Computing, 20(6): 999–1007. https://doi.org/10.1137 /0220062 905 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 0 3 1 9 5 7 7 0 7 / / t l a c _ a _ 0 0 4 0 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Joakim Nivre, Marie-Catherine de Marneffe, Jan Hajic, Filip Ginter, Yoav Goldberg, Christopher D. Manning, Ryan T. McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual In Tenth International treebank collection. Conference on Language Resources and Evaluation (LREC 2016). Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl- Dickstein. 2018. Sensitivity and generalization in neural networks: an empirical study. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. Ryan O’Donnell. 2014. Analysis of Boolean Functions. Cambridge University Press. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, fairseq: A fast, and Michael Auli. 2019. extensible toolkit for sequence modeling. In NAACL-HLT 2019: Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 48–53. Giacomo De Palma, Bobak Toussi Kiani, and Seth Lloyd. 2019. Random deep neural networks are biased towards simple functions. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 1962–1974. Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjec- tivity summarization based on minimum cuts. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pages 271–278. https://doi.org/10.3115/1218955 .1218990 class relationships Bo Pang and Lillian Lee. 2005. Seeing stars: senti- Exploiting to rating ment categorization with respect the 43rd Annual scales. In Proceedings of Meeting of the Association for Computa- tional Linguistics (ACL’05), pages 115–124. for https://doi.org/10.3115/1219840 .1219855 Jeffrey Socher, Pennington, Richard and Christopher Manning. 2014. GloVe: Global In Pro- for word representation. vectors 2014 Conference ceedings on in Natural Language Empirical Methods Processing 1532–1543. https://doi.org/10.3115/v1/D14 -1162 (EMNLP), pages the of Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep con- textualized word representations. In NAACL HLT 2018: 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 2227–2237. https://doi.org/10.18653/v1/N18 -1202 Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pages 2089–2096. Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in language inference. In Proceedings natural of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191. https://doi.org/10.18653/v1/S18 -2023 Peng Qi, Timothy Dozat, Yuhao Zhang, and Christopher D. Manning. 2018. Universal de- pendency parsing from scratch. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Depen- dencies, pages 160–170. Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. Stanza: A Python natural language process- ing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2020, Online, July 5-10, 2020, pages 101–108. Association for Computational Linguistics. 906 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 0 3 1 9 5 7 7 0 7 / / t l a c _ a _ 0 0 4 0 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9. Łukasz Kaiser, and Illia Polosukhin. 2017. In Advances Attention is all you need. in Neural Information Processing Systems, pages 5998–6008. Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. 2019. On the spectral bias of neural networks. In ICML 2019: Thirty-sixth International Conference on Machine Learning, pages 5301–5310. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine compre- hension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2383–2392. https://doi.org/10.18653/v1/D16 -1264 pages Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, John Bauer, and Chris Manning. 2014. A gold standard dependency corpus for English. the Ninth International In Proceedings of Conference on Language Resources and Evaluation (LREC’14), pages 2897–2904. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. Guillermo Valle-Perez, Chico Q. Camargo, and Ard Louis. 2019. Deep learning generalizes because the parameter-function map is biased In ICLR 2019: towards simple functions. 7th International Conference on Learning Representations. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019a. SuperGLUE: A stickier benchmark for general- purpose language understanding systems. In Advances in Neural Information Processing Systems, pages 3266–3280. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task bench- mark and analysis platform for natural language understanding. In ICLR 2019: 7th International Conference on Learning Representations. https://doi.org/10.1162/tacl a 00290 Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641. Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39(2):165–210. John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2016. Towards univer- sal paraphrastic sentence embeddings. In 4th International Conference on Learning Repre- sentations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. corpus Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage chal- lenge for sentence understanding In NAACL HLT 2018: through inference. 16th Annual Conference the North the Association for American Chapter of Computational Linguistics: Human Language Technologies, volume 1, pages 1112–1122. https://doi.org/10.18653/v1/N18 -1101 of Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, and Jamie Brew. 907 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 0 3 1 9 5 7 7 0 7 / / t l a c _ a _ 0 0 4 0 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 2019. Huggingface’s transformers: State-of- the-art natural language processing. arXiv pre- print arXiv:1910.03771. https://doi.org /10.18653/v1/2020.emnlp-demos.6 Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang Xiao. 2019. Training behavior of deep In neural network in frequency domain. International Conference on Neural Infor- mation Processing, pages 264–274. Springer. https://doi.org/10.1007/978-3-030 -36708-4 22 Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for In NeurIPS 2019: Thirty-third Conference on Neural Systems, Information Processing pages 5753–5763. language understanding. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 0 3 1 9 5 7 7 0 7 / / t l a c _ a _ 0 0 4 0 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 908
下载pdf