A Multi-Level Optimization Framework for End-to-End - 麻省理工学院人工智能研究专业

A Multi-Level Optimization Framework for End-to-End
Text Augmentation

Sai Ashish Somayajula
加州大学圣地亚哥分校, 美国
ssomayaj@ucsd.edu

Linfeng Song
Tencent AI Lab, 美国
lfsong@tencent.com

Pengtao Xie∗
加州大学圣地亚哥分校, 美国
p1xie@eng.ucsd.edu

抽象的

Text augmentation is an effective technique
in alleviating overfitting in NLP tasks. In ex-
isting methods, text augmentation and down-
stream tasks are mostly performed separately.
因此, the augmented texts may not be
optimal to train the downstream model. 到
address this problem, we propose a three-level
optimization framework to perform text aug-
mentation and the downstream task end-to-
结尾. The augmentation model is trained in
a way tailored to the downstream task. 我们的
framework consists of three learning stages.
A text summarization model is trained to per-
form data augmentation at the first stage. 每个
summarization example is associated with a
weight to account for its domain difference
with the text classification data. At the second
阶段, we use the model trained at the first
stage to perform text augmentation and train
a text classification model on the augmented
文本. At the third stage, we evaluate the text
classification model trained at the second stage
and update weights of summarization exam-
ples by minimizing the validation loss. 这些
three stages are performed end-to-end. 我们
evaluate our method on several text classifica-
tion datasets where the results demonstrate the
effectiveness of our method. Code is available
在https://github.com/Sai-Ashish
/End-to-End-Text-Augmentation.

介绍

Data augmentation (Sennrich et al., 2015; Fadaee
等人。, 2017; Wei and Zou, 2019) is an effec-
tive technique for mitigating the deficiency of
training data and preventing overfitting. In natu-
ral language processing, many data augmentation
methods have been proposed, such as back transla-
的 (Sennrich et al., 2015), synonym replacement
(Wang and Yang, 2015), random insertion (Wei
and Zou, 2019), 等等. In existing approaches,

∗Corresponding author

343

data augmentation and downstream tasks are per-
formed separately: Augmented texts are created
第一的, then they are used to train a downstream
模型. The downstream task does not influence
the augmentation process. 因此, the aug-
mented texts may not be optimal for training the
downstream model.

在本文中, we aim to address this problem.
We propose an end-to-end learning framework
based on multi-level optimization (Feurer et al.,
2015), which performs data augmentation and
downstream tasks in a unified manner where not
only augmented texts influence the training of the
downstream model, but also the performance of
downstream task affects how data augmentation
is performed.

In our framework, we use a text summarization
(Gambhir and Gupta, 2017) model to perform data
增强. Given an original text t with class
label c, we feed t into the summarization model
to generate a summary s. We set the class label
of s to be c. (s, C) is treated as an augmented
text-label pair of (t, C). The motivation of using
a summarization model for text augmentation is
two-fold. 第一的, the major semantics of an original
text is preserved in its summary; 所以, 这是
sensible to assign the class label of the original
text to its summary. 第二, the summary ex-
cludes non-essential details in the original text;
因此, the semantic diversity between the
summary and the original text is rich, which well
serves the purpose of creating diverse augmenta-
系统蒸发散. The summarization model is trained on a
summarization dataset {(的, si)}中号
i=1 where ti is an
original text and si is the corresponding summary.
For the downstream task, we assume it is text
classification. We assume there is a text classifi-
cation training set {(X(tr)
i=1 where x(tr)
, y(tr)
我
我
is an input text and y(tr)
is the corresponding class
标签, and there is a text classification validation
放 {(X(val)
.

)}氮 (tr)

)}氮 (val)
我=1

, y(val)
我

我

计算语言学协会会刊, 卷. 10, PP. 343–358, 2022. https://doi.org/10.1162/tacl 00464
动作编辑器: Dani Yogatama. 提交批次: 09/2021; 修改批次: 12/2021; 已发表 4/2022.
C(西德:3) 2022 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Our framework consists of three learning stages
that are performed end-to-end. We train a text sum-
marization model G on the summarization dataset
at the first stage. Considering a domain difference
between the summarization data and text classi-
fication data, we associate each summarization
training pair with a weight a ∈ [0, 1]. A smaller
a indicates a large domain difference between
this summarization pair and the text classifica-
tion data, and this pair should be down-weighted
during the training of the summarization model.
These weights are tentatively fixed at this stage
and will be updated later. At the second stage, 我们
use the trained summarization model to perform
text augmentation for the classification dataset
and train a text classification model on the aug-
mented and original datasets. At the third stage, 我们
validate the classification model trained at the sec-
ond stage and update weights of summarization
training examples by minimizing the validation
loss. The three stages are performed end-to-end
where they mutually influence each other. 我们
evaluate our framework on several text classifi-
cation datasets. Various experiments demonstrate
the effectiveness of our method.

The major contributions of this work include:

• We propose a three-level optimization frame-
work to perform text augmentation in an
end-to-end manner. Our framework con-
sists of three learning stages that mutually
influence each other: 1) training text summa-
rization model; 2) training text classification
模型; 3) updating weights of summarization
data by minimizing the validation loss of the
classification model.

• Experiments on various datasets demonstrate

the effectiveness of our framework.

The rest of the paper is organized as follows.
部分 2 reviews related work. 部分 3 intro-
duces the method. 部分 4 gives experimental
结果. 部分 5 concludes the paper.

2 相关工作

2.1 Data Augmentation in NLP

As an effective way of mitigating the deficiency
of training data, data augmentation has been
broadly studied in NLP (Feng et al., 2021).
Sennrich et al. (2015) proposed a back translation

method for data augmentation, which improves
the BLEU (Papineni et al., 2002A) scores in ma-
chine translation (公吨). The back translation tech-
nique first converts the sentences to another
语言. It again translates it back to the original
language to augment the original text.

Fadaee et al. (2017) propose a data augmen-
tation method for low-frequency words. Specif-
ically, the method generates new sentence pairs
that contain rare words. Kafle et al. (2017) intro-
duce two data augmentation methods for visual
question answering. The first method uses se-
the questions.
mantic annotations to augment
The second technique generates new questions
from images using an LSTM network (Hochreiter
and Schmidhuber, 1997). Wang and Yang (2015)
propose an augmentation technique that replaces
query words with their synonyms. Synonyms are
retrieved based on cosine similarities calculated
on word embeddings. Kolomiyets et al. (2011)
propose to augment data by replacing the tem-
poral expression words with their corresponding
synonyms. They use the vocabulary from the La-
tent Words Language Model (LWLM) 和
WordNet.

S¸ ahin and Steedman (2019) propose two text
augmentation techniques based on dependency
树. The first technique crops the sentences by
discarding dependency links. The second tech-
nique rotates sentences by tree fragments that are
pivoted at the root. Chen et al. (2020) propose
augmenting texts by interpolating input texts in
a hidden space. Wang et al. (2018) propose aug-
menting sentences by randomly replacing words
in input and target sentences with words from the
词汇. SeqMix (Guo et al., 2020) 提议
to create augments by softly merging input/target
序列.

电子设计自动化 (Wei and Zou, 2019) uses four operations
to produce data augmentation: synonym replace-
蒙特, random insertion, random swap, and random
deletion. Kobayashi (2018) proposes to replace
words stochastically with words predicted by a
bi-directional language model. Andreas (2020)
proposes a compositional data augmentation ap-
proach that constructs a synthetic training example
by replacing text fragments in a real example with
other fragments appearing in similar contexts.
Kumar et al. (2021) apply pretrained Transformer
models including GPT-2, BERT, and BART for
conditional data augmentation, where the conca-
tenation of class labels and input texts are fed into

344

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

these pretrained models to generate augmented
文本. Kumar et al. (2021) propose a language
model-based data augmentation method. This ap-
proach first finetunes a language model on limited
training data, then feeds class labels into the fine-
tuned model to generate augmented sentences.
Min et al. (2020) explore several syntactically
informative augmentation methods by applying
syntactic transformations to original sentences
and showed that subject/object inversion could
increase robustness to inference heuristics.

2.2 Bi-level Optimization

Many NLP applications (Feurer et al., 2015;
Baydin et al., 2017; Finn et al., 2017; 刘等人。,
2018; Shu et al., 2019; Zheng et al., 2019) 是
based on bi-level optimization (BLO), 例如
neural architecture search (刘等人。, 2018), 数据
选择 (Shu et al., 2019; Ren et al., 2020; 王
等人。, 2020), meta learning (Finn et al., 2017),
hyperparameter tuning (Feurer et al., 2015), 这-
bel correction (Zheng et al., 2019), training data
一代 (Such et al., 2019), learning rate adap-
站 (Baydin et al., 2017), 等等. In these
BLO-based applications, model parameters are
learned by minimizing a training loss in an in-
ner optimization problem while meta parameters
are learned by minimizing a validation loss in an
outer optimization problem. In these applications,
meta parameters are neural architectures, 重量
of data examples, hyperparameters, 等等.

3 方法

This section proposes a three-level optimiza-
tion framework to perform end-to-end text
增强.

数字 1: Overview of our framework.

我

, y(tr)
我

data augmentation. Given an original training pair
(X(tr)
), we feed the input text x(tr)
进入
我
text summarization model and get a summary
si. Because si preserves the major semantics
of x(tr)
, we can assign the class label y(tr)
的
我
X(tr)
to si. 到底, we obtain an augmented
我
training pair (si, y(tr)
). This process can be ap-
plied to every original training example and create
corresponding augmented training examples.

我

To enable the text summarization model and
the text classifier to influence and benefit from
each other mutually, we develop a three-level
optimization framework to train these two mod-
els end-to-end. Our framework consists of three
learning stages performed in a unified manner. 在
the first stage, we train the text summarization
模型. At the second stage, we use the summa-
rization model trained at the first stage to perform
text augmentation and train the classifier on the
augmented examples. At the third stage, we eval-
uate the classifier on a validation set and update
weights of summarization training examples by
minimizing the validation loss. 数字 1 shows an
overview of our framework. 下一个, we describe the
three stages in detail.

3.1 Overview

3.2 Stage I

我

c = {(X(tr)

, y(tr)
我
我
text and y(tr)

We assume the target task is text classification. 我们
train a BERT-based (Devlin et al., 2018) text clas-
sifier on a training set D(tr)
)}氮 (tr)
我=1
where x(tr)
是个
is an input
corresponding class label. 同时, 我们有
access to a classification validation set D(val)
=
{(X(val)
. In many application sce-
我
narios, the training data is limited, which incurs
a high risk of overfitting. To address this prob-
莱姆, we perform data augmentation of the training
data to enlarge the number of training examples.
We use a text summarization model to perform

)}氮 (val)
我=1

, y(val)
我

At the first stage, we train the text summariza-
化模型. We use BART (刘易斯等人。, 2019)
to perform summarization. BART is a pretrained
Transformer (Vaswani et al., 2017) model con-
sisting of an encoder and a decoder. The encoder
takes a text as input, and the decoder generates a
summary of the text. Let S denote the summariza-
化模型. The training data is Ds = {(的, si)}中号
我=1
where ti is an input text and si is the correspond-
ing summary. 经常, the summarization dataset Ds
has a domain shift with the classification dataset
Dc. 例如, in Ds, if its domain difference

345

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

with the classification dataset is large, the summa-
rization model trained by this example may not be
suitable to perform data augmentation for Dc. 到
address this problem, we associate each example
in Ds with a weight a ∈ [0, 1]. If a is close to 0,
the domain difference between
it means that
this example and Ds is large, and this example
should be down-weighted during training the
summarization model.

At this stage, we solve the following optimiza-

化问题:

S∗(A) = min
S

中号(西德:2)

我=1

ail(S, 的, si)

(1)

where A = {人工智能}中号
i=1 and l(·) is the teacher-forcing
loss. The loss of (的, si) is weighted by the weight
ai of this example. 如果 (的, si) has large domain
difference with Ds, ai should be close to 0, 然后
ail(S, 的, si) is made close to 0, which effectively
excludes (的, si) from the training process. 这
optimally trained model S∗ depends on A since
S∗ depends on the loss function, and the loss
function depends on A. A is tentatively fixed at
this stage and will be updated at a later stage.
A cannot be updated at this stage. 否则, A
trivial solution will be yielded where all values in
A are 0.

3.3 Stage II

我

, y(tr)
我

in D(tr)
C

At the second stage, we use the summarization
model S∗(A) trained at the first stage to perform
c = {(X(tr)
data augmentation of D(tr)
)}氮 (tr)
我=1 .
For each x(tr)
, we feed it into S∗(A)
to generate a summary g(X(tr)
, S∗(A)). 在里面
结尾, we obtain an augmented dataset G(D(tr)
,
S∗(A)) = {(G(X(tr)
)}氮 (tr)
我=1 . 我们
classifier C on
train a BERT-based text
the original data D(tr)
and augmented data
G(D(tr)
, S∗(A)).
C

, S∗(A)), y(tr)

我

At this stage, we solve the following optimiza-

化问题:

C ∗(S∗(A)) =
minC L(C, D(tr)

) + γL(C, G(D(tr)

, S∗(A)))

(2)
where L(·) denotes a cross-entropy classification
loss and γ is a tradeoff parameter. The first loss
term is defined on the original training dataset, 和
the second loss term is defined on the augmented
training dataset. The optimally trained classifier

346

C ∗ depends on S∗(A) since C ∗ depends on the
training loss, which depends on S∗(A).

3.4 Stage III

At the third stage, we evaluate the classifier trained
at the second stage on the classification validation
set D(val)
, y(val)
and update the
C
我
weights A by minimizing the validation loss. 在
this stage, we solve the following optimization
问题:

= {(X(val)
我

)}氮 (val)
我=1

min
A

L(C ∗(S∗(A)), D(val)

)

(3)

3.5 A Three-Level Optimization Framework

Putting all pieces together, we have the following
three-level optimization framework.

minA L(C ∗(S∗(A)), D(val)
s.t.

C ∗(S∗(A)) = minC L(C, D(tr)

)

) +

, S∗(A)))

γL(C, G(D(tr)
C
中号(西德:2)

S∗(A) = minS

ail(S, 的, si)

(4)

我=1

There are three optimization problems in this
框架, each corresponding to a learning
阶段. From bottom to top the optimization prob-
lems correspond to learning stages I, 二, and III,
分别. The first two optimization problems
are nested on the constraint of the third optimiza-
化问题. These three stages are conducted
end-to-end in this unified framework. The solution
S∗(A) obtained in the first stage is used to per-
form text augmentation in the second stage. 这
classification model trained in the second stage
is used to make predictions at the third stage.
The importance weights A updated in the third
stage change the training loss in the first stage and
consequently changes the solution S∗(A), 哪个
subsequently changes C ∗(S∗(A)).

3.6 Optimization Algorithm

在这个部分, we develop a gradient-based opti-
mization algorithm to solve the problem defined
in Eq. (4). Drawing inspiration from Liu et al.
(2018), we approximate S∗(A) using one step
gradient descent update of S:

S∗(A) ≈ S(西德:6) = S − ηs∇S

中号(西德:2)

我=1

ail(S, 的, si)

(5)

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

We plug S∗(A) ≈ S(西德:6) into the objective func-
tion at the second stage and get an approximate
客观的. We approximate C ∗(S∗(A)) 使用
one-step gradient descent update of C:

Algorithm 1 Optimization algorithm

while not converged do

Update weight parameters S using Eq. (5)
Update weight parameters C using Eq. (6)
Update meta parameters A using Eq. (7)

C ∗(A) ≈ C (西德:6) = C − ηc∇C(L(C, D(tr)
C

+ γL(C, G(D(tr)

, S(西德:6))))

)

(6)

end while

4 实验

)

最后, we plug C ∗(A) ≈ C (西德:6) into the validation
loss and get an approximated objective. Then we
update A by gradient descent:

A ← A − ηa∇AL(C (西德:6), D(val)

)

(7)

在哪里

∂C (西德:6)
∂S(西德:6)

∂L(C (西德:6),D(val)
C
∂C (西德:6)

)

)) = ∂S(西德:6)
∇AL(C (西德:6), D(val)
∂A
中号(西德:3)
= ηsηcγ∇2

A,S

∇2

S(西德:6),CL(C, G(D(tr)

我=1

ail(S, 的, si)
, S(西德:6)))∇C (西德:6)L(C (西德:6), D(val)

(8)
Eq. (8) involves an expensive matrix-vector prod-
uct, whose computational complexity can be
reduced by a finite difference approximation.
Eq. (8) can be approximated as:

≈ ηsηcγ
2A

{[∇S(西德:6)L(C +, G(D(tr)
C
, S(西德:6)))]∇2

∇S(西德:6)L(C −, G(D(tr)

A,S

, S(西德:6))) -
中号(西德:3)

ail(S, 的, si)}

我=1

(9)

在哪里

α =

0.01
(西德:4)
(西德:4)
(西德:4)∇C (西德:6)L(C (西德:6), D(val)

(西德:4)
(西德:4)
(西德:4)

)

C ± = C ± α∇C (西德:6)L(C (西德:6), D(val)

)

The matrix-vector multiplication in Eq. (9) 可
further approximated by:

ail(S+

± ,的, si)−∇A

中号(西德:2)

我=1

ail(S−

± ,的, si)}

(10)

1
α±
S

{∇A

中号(西德:2)

我=1

在哪里

α±
S =

0.01

(西德:4)
(西德:4)
(西德:4)∇S(西德:6)L(C ±, G(D(tr)

, S(西德:6)))

(西德:4)
(西德:4)
(西德:4)

± = S ± α+
S+
S
± = S ± α−
S−
S

∇T (西德:6)L(C +, G(D(tr)
∇T (西德:6)L(C −, G(D(tr)

Train Validation Test
数据集
25k
6.7k
6.7k
IMDB
10k
10k
38k
Yelp Review
10k
10k
SST-2
10k
Amazon Review 10k

872

10k

桌子 1: Split statistics of the classification
datasets.

These update steps iterate until convergence. 这
overall algorithm is summarized in Algorithm 1.

在这个部分, we report experimental results.
The key takeaways include: 1) our proposed
end-to-end augmentation method performs bet-
ter than baselines that conduct augmentation and
classification separately; 2) our framework is ag-
nostic to the choices of classification models and
can be applied to improve a variety of classifiers;
3) our framework is particularly effective when
the number of training examples is small; 4) 我们的
framework is more effective when the length of
input texts is large.

4.1 数据集

For the text summarization data, 我们使用
CNN-DailyMail dataset (See et al., 2017). It con-
tains summaries of around 300k news articles.
The full CNN-DailyMail training set is used for
training the summarization model. We used four
text classification datasets: 1) IMDB containing
movie reviews (Maas et al., 2011), 2) Yelp Re-
看法: a binary sentiment classification dataset
(张等人。, 2015), 3) SST-2: Stanford Sen-
timent Treebank (Socher et al., 2013), 和 4)
Amazon Review: a product review dataset from
亚马逊 (McAuley and Leskovec, 2013). 这
split statistics of these datasets are summarized
表中 1.

4.2 Baseline

We compare our method with the following
baseline methods.

2
, S(西德:6)))
, S(西德:6)))

347

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

• No-Aug: No data augmentation is applied.

• EDA (Wei and Zou, 2019): Augmented sen-
tences are created by randomly applying the
following operations: synonym replacement,
random insertion, random swap, and random
deletion.

• GECA (Andreas, 2020): A compositional
data augmentation approach that constructs
a synthetic training example by replacing
text fragments in a real example with other
fragments appearing in similar contexts.

• LAMBADA (Kumar et al., 2021): A
language model based data augmentation
方法. This approach first finetunes a lan-
然后
guage model on real
feeds class labels into the finetuned model
to generate augmented sentences.

training data,

• Sum-Sep: Summarization-based augmenta-
tion and text classification are performed
separately. We first train a summarization
模型, use it to perform data augmentation,
then train the classification model on the
augmented training data.

• Multi-task learning (MTL): In this baseline,
the summarization model and classification
model are trained by minimizing a single
objective function, which is the weighted
sum of the summarization and classification
losses. The corresponding formulation is:

minA L(C ∗(A), D(val)
s.t. C ∗(A) = minC,S L(C, D(tr)
C
中号(西德:2)

)

γL(C, G(D(tr)

, S)) + λ minS

ail(S, 的, si)

我=1

4.3 Hyperparameter Settings

For the text classification model, we use the one
in EDA (Wei and Zou, 2019). It contains an in-
put layer, a bi-directional hidden layer with 64
LSTM (Hochreiter and Schmidhuber, 1997) units,
a dropout layer with a probability of 0.5, 其他
bi-directional layer with 32 LSTM units, 其他
dropout layer with a probability of 0.5, ReLU ac-
tivation, a dense layer with 20 hidden units, 和
a softmax output layer. We set the maximum text
length to 150. The loss function is cross-entropy
loss. Model parameters are optimized using the

348

亚当 (Kingma and Ba, 2014) optimizer, with an
epsilon of 10−8. In Adam, β1 and β2 are set to
0.9 和 0.999, 分别. The learning rate is
a constant 10−3. The batch size used is 8. 为了
the importance weights A of summarization data
examples, we optimize them using an Adam opti-
悲惨的, with a weight decay of 10−3. Epsilon, β1,
and β2 are set to 10−8, 0.5, 和 0.999, 重新指定-
主动地. The learning rate is 3 × 10−4. The tradeoff
parameter γ is set to 1 for all experiments, 除非
otherwise stated.

For the summarization model, we use the dis-
tilled BART model (Shleifer and Rush, 2020).
It has six encoder layers and six decoder layers.
We set the maximum text length for the article
到 1024 and the summary to 75. We use an SGD
optimizer with a momentum of 0.9 and a learning
rate of 10−3. We use a cosine annealing learning
rate scheduler with the minimum learning rate
set to 5 × 10−4. We randomly sample a subset of
CNN/DailyMail to train the summarization model.
We balance the CNN/DailyMail data and the
classification data by making the number of sam-
pled CNN/DailyMail examples roughly the same
as the classification data examples.

Accuracy is used as the evaluation metric. 每个
experiment runs five times with random initializa-
系统蒸发散. We report the mean and standard deviation
of the results. The experiments are conducted on
1080Ti GPUs.

4.4 Main Results

Following Wei and Zou (2019), for each classifi-
cation dataset, we randomly sample x percentage
of training data for model training. Tables 2, 3, 4,
和 5 show the average accuracies on the Yelp
审查, IMDB, Amazon Review, and SST-2
datasets, under different percentages.

From these tables, we make the following ob-
servations. 第一的, our method works better than
Sum-Sep. The reason is that our method performs
the summarization-based augmentation and text
classification in an end-to-end framework while
Sum-Sep performs these two tasks separately.
In our end-to-end framework, the summarization
and classification models mutually influence each
其他. The performance of the classification model
guides the training of the summarization model.
If the classification accuracy is not high, it indi-
cates that the augmented data generated by the
summarization model is not useful. When this

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

No-Aug
Perc.
63.45±4.21
5%
68.15±3.77
10%
72.43±3.16
20%
81.78±1.76
50%
84.52±1.55
75%
100% 85.42±1.52

电子设计自动化
68.45±2.49
74.46±1.82
75.07±4.22
81.57±2.44
82.70±2.30
84.66±1.36

GECA
69.72±1.72
74.91±2.07
78.29±0.95
83.61±0.93
85.02±0.94
86.45±0.26

LAMBADA
69.30±1.38
75.03±1.45
76.25±2.57
82.18±1.82
83.71±0.71
85.08±0.49

Sum-Sep
72.53±0.44
76.73±0.57
81.29±0.09
84.90±0.19
86.18±0.25
87.46±0.10

MTL
72.94±0.52
76.35±0.51
80.47±0.25
84.29±0.22
86.07±0.28
87.15±0.13

Ours
74.58±0.37
78.62±0.73
81.66±0.13
85.55±0.15
87.10±0.31
87.79±0.07

桌子 2: Classification accuracy (%) on Yelp Review. Perc. denotes percentage.

No-Aug
Perc.
52.86±3.22
5%
56.76±3.38
10%
59.54±1.68
20%
66.90±1.98
50%
72.25±1.05
75%
100% 73.96±0.85

电子设计自动化
61.42±1.75
64.06±1.92
67.18±3.20
71.67±1.33
74.23±0.72
75.75±0.27

GECA
61.58±0.94
63.27±2.76
65.36±0.83
72.89±0.58
73.69±0.55
75.38±0.44

LAMBADA
61.03±1.46
64.95±1.83
67.61±1.94
71.52±1.57
74.02±0.73
76.45±0.03

Sum-Sep
63.74±0.27
68.42±0.11
69.82±0.61
72.86±0.31
74.78±0.25
76.70±0.08

MTL
63.27±0.41
67.92±0.14
68.47±0.69
72.40±0.19
74.63±0.41
76.11±0.09

Ours
64.79±0.32
68.75±0.08
72.04±0.52
73.98±0.25
75.78±0.37
77.00±0.05

桌子 3: Classification accuracy (%) on IMDB.

No-Aug
Perc.
63.46±1.51
5%
66.03±1.44
10%
68.04±2.26
20%
76.03±0.75
50%
77.40±1.09
75%
100% 78.36±1.02

电子设计自动化
62.59±2.35
66.20±2.91
68.36±1.05
75.39±1.69
76.30±1.82
78.13±1.38

GECA
64.82±0.93
68.49±0.82
69.04±1.94
77.02±0.50
78.52±0.52
80.16±0.32

LAMBADA
63.21±1.07
66.15±1.35
70.96±2.45
76.31±0.37
77.02±0.81
78.94±0.69

Sum-Sep
65.44±0.31
69.76±0.21
72.64±0.58
77.07±0.88
78.78±0.52
81.43±0.12

MTL
64.81±0.69
69.33±0.26
72.51±0.99
77.05±0.19
78.93±0.37
81.22±0.33

Ours
67.53±0.42
70.84±0.11
75.28±0.73
78.80±0.62
80.46±0.62
82.01±0.25

桌子 4: Classification accuracy (%) on Amazon Review.

No-Aug
Perc.
57.22±2.07
5%
62.04±1.44
10%
64.45±1.99
20%
71.90±1.51
50%
73.74±0.59
75%
100% 77.75±1.49

电子设计自动化
63.42±1.24
66.86±0.73
68.00±0.49
74.77±0.39
75.80±0.81
77.64±0.30

GECA
58.44±0.57
64.92±0.49
67.48±0.55
72.40±0.55
75.03±0.33
77.10±0.42

LAMBADA
59.31±0.99
65.79±0.75
66.12±0.87
73.35±0.64
74.29±0.28
75.41±0.85

Sum-Sep
59.05±0.91
65.15±0.37
67.09±1.27
72.25±0.27
75.34±0.07
75.80±0.21

MTL
60.51±0.32
64.86±0.66
67.84±0.49
74.02±0.44
74.25±0.10
75.39±0.43

Ours
63.76±0.85
68.23±0.51
69.61±0.21
75.23±0.26
76.00±0.14
77.92±0.05

桌子 5: Classification accuracy (%) on SST-2, γ is set to 0.5 为了 5% 和 20%.

发生, the summarization model will adjust its
network weights and data weights to generate use-
ful augmentations in the next round of learning. 在
对比, in Sum-Sep, such a feedback loop (从
classification to summarization) does not exist.
所以, its performance is inferior.

第二, our method works better than MTL. 在
MTL, the summarization model and classification
model are trained simultaneously by minimizing
a single objective function, which is the weighted
sum of the summarization loss and classification
loss. This incurs a competition between these two
任务: seeking for more decrease of the loss of
one task leads to less decrease of the loss of the
other task. Our method avoids task competition

by performing these two tasks sequentially and
minimizing two different objective functions. 这
summarization model is trained by minimizing the
summarization loss. Then the classification model
is trained by minimizing the classification loss. 在
这边走, there is no competition between these
two tasks. Though performed in two stages, 这些
two tasks can still mutually influence each other
in our end-to-end framework.

第三, our method performs better than No-Aug.
This shows that the augmented data generated by
our method has good utility for model training.

第四, the improvement of our method over
baselines is more prominent when the number
of training examples is smaller (IE。, 当。。。的时候

349

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Ours
Perc.
48.36±3.72
5%
64.24±2.14
20%
100% 72.36±1.29

电子设计自动化
40.16±1.94
51.48±2.44
57.96±1.21

No-Aug
44.40±5.26
58.00±4.10
69.52±2.43

桌子 6: Classification accuracy (%) on TREC.

percentage is small). Under the 5% percentage,
we observe an improvement of 11.13%, 11.9%,
4.07%, 和 6.59% on Yelp, IMDB, 亚马逊,
and SST-2 datasets, 分别, over No-Aug.
When the training dataset size is smaller, 这
necessity of data augmentation is more significant.
As the percentage increases, the impact of data
augmentation decreases. 为了 100% percentage,
we observe an improvement of 2.37%, 3.04%, 和
3.65% on Yelp, IMDB, and Amazon, 分别,
over No-Aug.

第五, our method works better than EDA,
GECA, and LAMBADA. 再次, the reason is
that our method performs data augmentation and
classification end-to-end while these baselines
perform them separately. Compared to EDA, 和-
der the 5% percentage, we observe an accuracy
gain of 6.13%, 3.37%, 和 4.94% on Yelp, IMDB,
and Amazon, 分别. EDA augments each
input sentence to produce 8-16 new sentences.
Our method achieves a better accuracy (on Yelp,
IMDB, and Amazon) or a similar accuracy (在
SST-2) than EDA with just one augmentation per
input sentence.

EDA uses simple and heuristic rules to generate
augmented sentences which may be noisy and lack
semantic meaningfulness. These noisy augmenta-
tions may render the classification model trained
on them to perform worse. To verify this, we per-
formed experiments on the TREC (Li and Roth,
2002; Hovy et al., 2001) dataset (which is split
into a train/validation/test set with 3000, 2000, 和
500 examples respectively). 桌子 6 shows the
结果. As can be seen, with EDA as augmenta-
的, the classification performance becomes much
更差 (compared with No-Aug). 相比之下, 我们的
method trains a summarization model to gener-
ate semantically meaningful augmentations and
perform much better than EDA.

Sixth, our method is more effective on long
文本. 例如, our framework outperforms
EDA under all percentages on datasets where
the input texts are relatively long, including Yelp,
IMDB, and Amazon (the average number of words

Original sentence: This review is for the Kindle edition of what is
supposedly the Christopher Gill translation of Plato’s Symposium.
然而, it turns out if you download it that it is the same as the
Benjamin Jowett translation which is available for free, 然而
here you have to pay upwards of $8 为了它. I also checked Penguin Classics web site and they do not indicate that a eBook version of this book is available. So be careful when purchasing the Kindle edition. I had to return my purchase for this Kindle book. 增强: Critically reviewed the Christopher Gill trans- lation of Plato’s Symposium. An online version of the book is currently available for free. 然而, if you download it it is the same as the Benjamin Jowett translation. 相比之下, you pay upwards of $8.

Original sentence: My issue is not with anything Tom Robbins
写了, but with narrator Barret Whitener: a white male who
opts to read all the dialogue from Japanese characters in a bad,
generic ‘‘Asian’’ accent. It’s unbearable to listen to. 此外,
Whitener can’t even pronounce the simplest of Japanese words,
like ‘‘sake.’’ How many more years must transpire before an
Asian person can have a voice without being mocked?
增强: The book is written by a white male who reads it
in a bad, generic ‘‘Asian’’ accent. He can’t even pronounce the
simplest of Japanese words, like ‘‘sake’’ Movie is currently being
played by Playmakers’ weekly series .series

桌子 7: Two augments generated by our method
for Amazon Review.

per input text in Yelp, IMDB, and Amazon are
133, 234, 和 75, 分别). On SST-2, 在哪里
the texts are relatively short, our method per-
forms marginally better than EDA under certain
percentages. In CNN-DailyMail, which is used
for training the summarization model, 输入
texts are also long. 因此, the summarization
model trained on CNN-DailyMail can generate
good-quality summaries for Yelp, IMDB, 和
亚马逊. The generated summaries preserve the
major information in the input texts, discard redun-
dant information, and align well with the original
sentences semantically. 桌子 7 shows some ex-
阿普莱斯. 在这种情况下, it is meaningful for the
summary to inherit the class label of the orig-
inal sentence, and the resulting (增强,
标签) pairs are useful for training the classifi-
cation model. 相比之下, in SST-2, 平均数
number of words per input text in training data
只是 10. When a short sentence is fed into the
summarization model, the model tends to perform
open text generation by treating the input short
sentence as a prompt. The generated text may
deviate from the original sentence in semantics.
桌子 8 shows some examples. 在这种情况下, 分配-
ing the class label of the original sentence to the
summary introduces labeling noises. 所以,
the improvement of our method over baselines
may not be substantial. We performed statistical
significance tests using double-sided t-tests. 在

350

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Original sentence: An object lesson in period filmmaking.
增强: An object lesson in period filmmaking is a
time-saving technique. The film was shot by the crew of the late
1960s and early 1980s. It’s been produced since then, 和一些
scenes being filmed on location . 几年后, 它变成了
an important part of this type of cinema.

Original sentence: Mean-spirited lashing.
增强: Critically ill-tempered lashing. ‘‘I feel like I’m
going to get out of my own way,’’ says one person. An angry
person lashed out at him for being unprovoked and hateful.
demonstrated anger over the use of alcohol.

Ours
数据集
68.75±0.08
IMDB
78.62±0.73
Yelp Review
Amazon Review 70.84±0.11
68.23±0.51
SST-2

Aug-Only
66.69±0.37
77.94±0.91
70.14±0.44
66.17±0.42

桌子 9: Comparison with Aug-Only, 在 10% 的
the classification datasets.

桌子 8: Two augmentation generated by our
method for SST-2.

most cases, the p-values of our method against
baselines are less than 1e − 3. This shows that
our method is significantly better than baselines
一般来说.

4.5 Ablation Studies

To evaluate the effectiveness of our proposed
方法, we compare with the following ablation
settings.

• Aug-Only. We train the text classification
model on augmented texts only, without us-
ing the original classification data, 哪个
amounts to solving the following problem:

minA L(C ∗(S∗(A)), D(val)
s.t. C ∗(S∗(A)) = minCL(C,G(D(tr)

)

, S∗(A)))

S∗(A) = minS

中号(西德:2)

ail(S, 的, si)

我=1

(11)
• Ablation on classification models. 在这个
学习, we investigate whether our framework
is effective for other classifiers, 包括
CNN and RoBERTa (刘等人。, 2019). 福尔-
lowing Wei and Zou (2019), the CNN model
层, 1D convolutional
consists of input
layer with 128 filters (kernel size is 5),
global 1D max-pooling layer, ReLU acti-
vation function, dense layer (和 20 隐
units), softmax output layer. For RoBERTa,
we use a batch size of 16 and a learning rate
的 2 × 10−6 to train the model. The maxi-
mum classification text length is set to 128.
γ is set to 1 for all datasets. The rest of the
hyperparameters remain the same.

• Ablation on γ. In this study, we investi-
gate how the test accuracy is affected by the
hyperparameter γ.

351

数据集
IMDB
Yelp
亚马逊
SST-2

Ours
69.84±0.15
78.95±0.69
73.46±0.22
67.32±0.18

电子设计自动化
69.33±0.08
77.25±0.56
72.86±0.48
66.74±0.05

No-Aug
65.57±0.52
76.85±0.90
70.62±0.79
62.38±0.44

桌子 10: Comparison of our method with EDA
and No-Aug, with CNN as the text classifier,
trained on 10% of the classification datasets.

数据集
IMDB
Yelp
亚马逊
SST-2

Ours
78.38±0.19
88.25±0.28
83.85±0.34
76.34±0.34

电子设计自动化
77.26±3.31
85.30±0.44
81.36±0.23
74.38±2.18

No-Aug
77.33±0.62
87.13±0.14
82.54±0.24
75.76±1.09

桌子 11: Comparison of our method with EDA
and No-Aug, with CNN as the text classifier,
trained on 100% of the classification datasets.

桌子 9 compares our method with Aug-Only.
As can be seen, our full method outperforms
Aug-Only. In Aug-Only, only augmented texts
train the classification model without leveraging
the original human-written texts. Because the aug-
mented data is noisier than human-provided data,
using augmented data only may lead to noisy
classification models.

Tables 10 和 11 compare our method with
EDA and No-Aug, with CNN as the text classi-
fier, trained on 10% 和 100% of the classification
datasets, 分别. Tables 12 和 13 com-
pare our method with EDA and No-Aug, 和
RoBERTa as the text classifier, trained on 10%
和 100% of the classification datasets, 重新指定-
主动地. As can be seen, with CNN and RoBERTa
as classifiers, our method still outperforms EDA
and No-Aug. This shows that our method is ag-
nostic to text classifiers and can be leveraged to
improve different text classifiers.

数字 2 shows how test accuracy changes with
C, where the model is trained on 10% of IMDB.
As can be seen, when γ increases from 0 到 0.5,

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数据集
IMDB
Yelp
亚马逊
SST-2

Ours
84.40±0.37
91.12±0.29
90.56±0.37
87.73±0.86

电子设计自动化
84.35±0.35
90.38±0.39
89.90±0.41
87.10±0.83

No-Aug
83.89±0.64
90.23±0.39
89.60±0.36
87.57±0.74

桌子 12: Comparison of our method with EDA
and No-Aug, with RoBERTa as the text classifier,
trained on 10% of the classification datasets.

数据集
IMDB
Yelp
亚马逊
SST-2

Ours
89.06±0.07
93.60±0.11
92.71±0.04
91.28±0.06

电子设计自动化
88.59±0.10
93.12±0.05
92.33±0.06
91.09±0.03

No-Aug
88.50±0.14
92.97±0.37
92.20±0.14
91.05±0.07

桌子 13: Comparison of our method with EDA
and No-Aug, with RoBERTa as the text classifier,
trained on 100% of the classification datasets.

the classification accuracy increases. This is be-
cause a larger γ renders the classification model
to be trained more by the augmented data. 这
augmented data provides additional training re-
来源, which can mitigate the lack of original
数据. 然而, as γ continues to increase, the ac-
curacy decreases. This is because an excessively
large γ renders too much emphasis on augmented
data and less attention paid to original data. Orig-
inal data is less noisy than augmented data and
therefore is more valuable than augmented data
for training high-quality models. Similar results
can be observed in Figure 3, where the model is
trained on 10% of Amazon, SST-2, and Yelp.

We also perform an ablation study which re-
places the summarization data with paraphrase
数据. The model S remains the same, 这是
still distill-BART. The paraphrase data contains
3,900 sentence pairs from Microsoft Research
Paraphrase Corpus (MRPC) (Dolan and Brockett,
2005). These pairs are labeled as paraphrases by
人类. To ensure a fair comparison, we ran-
domly sample 3,900 data examples from CNN-
DailyMail, denoted by CNNDM-3900. 桌子 16
shows the results. As can be seen, CNNDM-3900
yields better performance than MRPC. This shows
that using the summarization model for sentence
augmentation is more effective than using the
paraphrase model. The possible reason is that a

数字 2: How test accuracy changes with γ, 哪里的
model is trained on 10% of IMDB.

summarization model discards less important in-
formation of the input text while a paraphrasing
model preserves most information of the input
文本; 因此, augmentations generated by the
summarization model are less noisy and have a
larger semantic diversity to the input texts.

4.6 Analysis of Learned Weights of

Summarization Data

桌子 14 shows some randomly sampled summa-
rization data examples whose weights learned by
our framework are close to zero when the classifi-
cation dataset is SST-2. Due to the space limit, 我们
only show the summaries. As can be seen, 这些
data are primarily about healthcare, 活力, 法律,
和政治, which have a large domain discrep-
ancy with SST-2, which is about movie reviews.
This shows that our framework is able to success-
fully identify out-of-domain summarization data
and exclude them from training the summariza-
化模型.

桌子 15 shows some randomly sampled sum-
marization data examples whose weights learned
by our framework are close to one when the clas-
sification dataset is SST-2. Due to the space limit,
we only show the summaries. As can be seen, 这些
summarization data are primarily about movies,
recreation, and media, which are close to SST-2
in terms of domain similarity.

When the classification dataset is IMDB, 这
weights of examples in Table 14 are also close
to zero, and the weights of examples in Table 15
are also close to one. This is because IMDB and
examples in Table 15 are about movies while
examples in Table 14 不是.

Given the learned weights, we split the sum-
marization data into two subsets: ONE, 哪个
contains examples whose weights are > 0.5, 和

352

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 3: How test accuracy changes with γ, where the model is trained on 10% of Amazon, SST-2, and Yelp.

Half of states will expand Medicaid under Obamacare; half
refuse or are on the fence. Low-income citizens and their
advocates say Medicaid expansion necessary. States like Texas,
Florida say a Medicaid expansion is costly and will fail. 政治
at play and most states will eventually expand the program,
political experts say.

The country plans to remove all subsidies in five years. 这
price of gasoline goes up four-folds. Iran has trillions of dollars
in natural resources but still struggles, experts say. The changes
aim to dampen domestic demand for fuel and oil, and bolster
overall revenues.

Attorneys for three kidnapped women have ‘‘confidence and
faith’’ in prosecution. Indictment lists use of chains, tape,
vacuum cord against women. Castro also faces 139 counts of
强奸, 177 counts of kidnapping, one aggravated murder charge.
A prosecutor’s committee will later consider whether seeking
death penalty is appropriate.

North Korea’s announcement of a planned satellite launch has
provoked alarm. Other countries say it is a way of testing
missile technology. The Japanese defense minister orders the
preparation of missile defenses.

桌子 14: Some summarization data examples
whose weights learned by our framework are
close to 0. The classification dataset is SST-2.

Hollywood often misrepresents careers and the workplace in
电影. Work is portrayed as easy or non-existent, and any
outfit is appropriate. This can have a negative effect on younger
generations deciding on careers.

Hollywood bringing back box-office juggernauts Iron Man,
Batman and Spider-Man. 2012 may be a peak year for super-
heroes on film, with much more to come. Writer-director: 直流
Comics characters often face more challenges than Marvel.
’‘‘The genre has been popular for decades and is here to stay,’’
Boxofficeguru.com editor says.

Sam Smith wins best new artist, best pop vocal album. Beyonce
now has 20 Grammys, passing Aretha Franklin.

Ted Turner turns 75 years old this month. He founded CNN,
the first 24-hour cable news network, 在 1980. 在 1990, 车工
hired Wolf Blitzer, host of CNN’s ‘‘The Situation Room’’.
Blitzer reflects on what he learned from his former boss.

桌子 15: Some summarization data examples
whose weights learned by our framework are
close to 1. The classification dataset is SST-2.

ZERO, which contains examples whose weights
are ≤ 0.5. For each subset, we use it to train
a summarization model (no reweighting of data
during training), use the summarization model

353

数据
Yelp
IMDB
亚马逊
SST-2

No-Aug
68.15±3.77
56.76±3.38
66.03±1.44
62.04±1.44

Summarize
75.06±0.62
64.94±0.15
67.36±0.08
67.44±0.27

Paraphrase
74.33±0.47
64.13±0.07
66.39±0.14
66.72±0.11

桌子 16: Classification accuracy (%) 在 10%
of different datasets, when using summarization
data and paraphrase data to train the augmentation
模型.

数据
Yelp
IMDB
亚马逊
SST-2

No-Aug
68.15±3.77
56.76±3.38
66.03±1.44
62.04±1.44

ONE
77.27±0.96
66.94±0.23
69.24±0.15
67.50±0.34

ZERO
69.92±2.09
57.85±0.41
67.74±0.22
63.69±0.46

桌子 17: Classification accuracy (%) 在 10% 的
different datasets, in the study of how summari-
zation data weights affect downstream classifi-
cation performance.

to generate augmentations, and train the classifi-
cation model on augmented data (together with
real data). 桌子 17 shows the results. As can be
见过, augmentations generated by the summariza-
tion model trained on ONE improve classification
performance in most cases. 相比之下, augmen-
tations generated by the summarization model
trained on ZERO are not helpful.

Using In-domain Data to Train a Summariza-
tion Model. We conduct an experiment that uses
in-domain data to train the summarization model.
具体来说, we replace the CNN-Dailymail
dataset with a movie summarization (MovSum)
dataset crawled from IMDB. The MovSum dataset
contains 135K (movie synopsis, movie summary)
pairs where the summary (更短) is treated as
a summarization of the synopsis (更长). Mov-
Sum is in the same domain as SST-2 since they
are both about movies. We randomly sample
135K examples from CNN-Dailymail (denoted
by CNNDM-135K) and train the summarization

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Summarization
数据
MovSum
CNNDM-135K

Classification
Accuracy
67.72±0.36
67.15±0.21

Percentage
of ones
84.5
39.3

桌子 18: Results of our proposed method
on SST-2 (10%) under different summarization
datasets.

model on this subset for a fair comparison. Clas-
sification is conducted on SST-2 (使用 10%
training examples).

桌子 18 shows the results. As can be seen, 使用
MovSum to train the summarization model leads
to better classification performance on SST-2.
这是因为, compared with CNNDM-135K,
MovSum has a higher domain similarity with
SST-2. A summarization model
trained using
in-domain data can generate in-domain augmen-
tations, which are more suitable to train the
classification model. 此外, we measure the
percentage of ones in the learned weights of sum-
marization data examples. Under MovSum, 这
percentage is larger. This is because more data
examples in MovSum are in the same domain as
SST-2, compared with CNNDM-135K.

4.7 Analysis of Generated Augmentations

We measure the diversity of generated augmenta-
的. Two types of diversity measures are used: 我)
how different the generated augmentation is from
the input text; and II) how different sentences in
a generated augmentation are. We measure type-I
diversity in the following way: Given an input
text t and an augmentation s generated from t,
calculate the BLEU (Papineni et al., 2002乙) 分数
between s and t. We measure type-II diversity in
the following way: for each pair of sentences in
a generated augmentation, calculate their BLEU
分数, then take the average over all pairs. 在里面
BLEU score, the number of grams is set to 4.
桌子 19 shows the results. As can be seen, 这
BLEU scores are low for both types of diversity,
which indicates that the augmentations generated
by our method are diverse.

We also check whether generated augmenta-
tions are abstractive. We randomly sample 300
generated augmentations and ask 5 undergrad-
uates to manually annotate whether they are
abstractive (with score 1) or extractive (分数 0).
桌子 20 shows the results. The average score

数据集
Yelp
IMDB
亚马逊
SST-2

Type-I
0.15
0.22
0.11
0.13

Type-II
0.07
0.06
0.04
0.06

桌子 19: Diversity of generated augmentations.

数据集
Yelp
IMDB
亚马逊
SST-2

分数
0.83
0.79
0.88
0.92

桌子 20: Abstractiveness of
generated augmentations.

模型
GPT-2 (Radford et al., 2019)
Sum-Sep
MTL-SST2
Ours-SST2
MTL-IMDB
Ours-IMDB
MTL-Yelp
Ours-Yelp
MTL-Amazon
Ours-Amazon

Rouge-2 F1
8.27
16.06
11.06
12.77
10.79
13.06
12.15
12.97
11.73
13.13

Rouge-L F1
26.58
34.07
28.44
30.27
29.20
30.72
29.43
30.62
29.16
30.71

桌子 21: Evaluation of summarization models.

is close to 1, indicating that the augmentations
generated by our method are abstractive.

4.8 Performance of Summarization Models

We report the performance of the summarization
models in Table 21. For our method and MTL,
the reported scores are averages for different per-
centages of classification training data, 哪里的
classification model is LSTM. From this table, 我们
make two observations. 第一的, our method outper-
forms GPT2 (Radford et al., 2019), a competitive
model used for text summarization. This shows
that our method can generate meaningful sum-
maries. 第二, our method performs worse than
Sum-Sep. The reason is that in our method, 这
summarization model is trained for the sake of
generating good augmentations for the classifi-
cation model while Sum-Sep is trained solely to
maximize the summarization performance. 我们的

354

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

goal is not to generate best summaries, 反而
generating best augmentations for classification
using the summarization model.

5 Conclusions and Discussion

在本文中, we propose a three-level optimiza-
tion framework to perform text augmentation and
classification end-to-end. Our framework consists
of three learning stages performed end-to-end:
1) training a text summarization model; 2) 火车-
ing a text classification model; 和 3) updating
weights of summarization examples by minimiz-
ing the validation loss of the classification model.
Each learning stage corresponds to one level of the
optimization problem in the framework. The three
levels of optimization problems are nested and
solved in a unified way. Our framework enables
the augmentation process to be influenced by the
performance of the text classification task so that
the augmented texts are specifically suitable for
training the classification model. Experiments on
various datasets demonstrate the effectiveness of
our method.

Our framework can be extended to other down-
stream tasks beyond classification, including but
are not limited to text-to-image generation, visual
question answering, dialog generation, ETC. 到
extend our framework to a downstream task T ,
we need to change the loss functions in Eq. (2)
and Eq. (3) to the loss of task T . 例如,
to apply our framework for text-to-image genera-
的, given each text-image (t, 我) pair, we perform
augmentation of the input text t using the sum-
marization model trained in Eq. (1), to get an
augmented text ˆt. (ˆt, 我) would be treated as an
augmented data pair. Then we define GAN-based
losses (Goodfellow et al., 2014) on each (ˆt, 我)
and each (t, 我) to train a text-to-image genera-
化模型. We plan to study such extensions in
future work.

Our current framework has the following limi-
tations. 第一的, it incurs additional computation and
memory costs due to the usage of a summarization
模型. 第二, 现在, our method uses a text
summarization model to generate augmentations.
Publicly available summarization datasets are lim-
ited in size, limiting the augmentation model’s
质量. We plan to address these limitations in
future works. To reduce memory and computa-
tion costs, we will perform parameter sharing

where the encoder weights of the summariza-
tion model and those of the classification model
are tied together. We plan to leverage data-rich
text generation tasks for augmentation to address
the second limitation, such as using machine
translation models to perform back translation.

致谢

This work is partially supported by a gift fund
from CCF and Tencent. We also thank the editor
and the anonymous reviewers for their valuable
suggestions.

参考

Jacob Andreas. 2020. Good-enough composi-
tional data augmentation. https://doi.org
/10.18653/v1/2020.acl-main.676

Atilim Gunes Baydin, Robert Cornish, 大卫
Mart´ınez-Rubio, Mark Schmidt, and Frank D.
Wood. 2017. Online learning rate adaptation
with hypergradient descent. CoRR, abs/1703
.04782.

Jiaao Chen, Zichao Yang, and Diyi Yang. 2020.
Mixtext: Linguistically-informed interpolation
of hidden space for semi-supervised text clas-
sification. arXiv 预印本 arXiv:2004.12239.
https://doi.org/10.18653/v1/2020
.acl-main.194

Jacob Devlin, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2018. BERT: 预-
training of deep bidirectional transformers for
language understanding. arXiv 预印本 arXiv:
1810.04805.

William B. Dolan and Chris Brockett. 2005.
Automatically constructing a corpus of sen-
这
tential paraphrases.
Third International Workshop on Paraphrasing
(IWP2005).

在诉讼程序中

Marzieh Fadaee, Arianna Bisazza, and Christof
Monz. 2017. Data augmentation for
低的-
resource neural machine translation. arXiv pre-
print arXiv:1705.00440. https://doi.org
/10.18653/v1/P17-2090

Steven Y. 冯, Varun Gangal, Jason Wei, Sarath
Chandar, Soroush Vosoughi, Teruko Mitamura,
and Eduard Hovy. 2021. A survey of data

355

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

augmentation approaches for NLP. arXiv pre-
print arXiv:2105.03075. https://doi.org
/10.18653/v1/2021.findings-acl.84

Diederik P. Kingma and Jimmy Ba. 2014. 亚当:
A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.

Matthias Feurer, Jost Springenberg, and Frank
Hutter. 2015. Initializing bayesian hyperpa-
rameter optimization via meta-learning.
在
Proceedings of the AAAI Conference on Arti-
ficial Intelligence, 体积 29.

Sosuke Kobayashi. 2018. Contextual augmenta-
的: Data augmentation by words with para-
digmatic relations. arXiv 预印本 arXiv:1805
.06201. https://doi.org/10.18653/v1
/N18-2072

Chelsea Finn, Pieter Abbeel, and Sergey Levine.
2017. Model-agnostic meta-learning for fast
adaptation of deep networks. In Proceedings
of the 34th International Conference on Ma-
chine Learning-Volume 70, pages 1126–1135.
JMLR.org.

Mahak Gambhir

and Vishal Gupta. 2017.
Recent automatic text summarization tech-
好的: 一项调查. Artificial
智力
47(1):1–66. https://doi.org
审查,
/10.1007/s10462-016-9475-9

Ian Goodfellow, Jean Pouget-Abadie, Mehdi
Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio.
2014. Generative adversarial nets. 进展
Neural Information Processing Systems, 27.

Demi Guo, Yoon Kim, and Alexander M. 匆忙.
2020. Sequence-level mixed sample data aug-
心理状态. arXiv 预印本 arXiv:2011.09039.
https://doi.org/10.18653/v1/2020
.emnlp-main.447

Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. Neural Computa-
的, 9(8):1735–1780. https://doi.org
/10.1162/neco.1997.9.8.1735

Eduard Hovy, Laurie Gerber, Ulf Hermjakob,
Chin-Yew Lin, and Deepak Ravichandran.
2001. Toward semantics-based answer pin-
在-
pointing.
ternational Conference on Human Language
Technology Research. https://doi.org
/10.3115/1072133.1072221

在诉讼程序中

the First

Oleksandr Kolomiyets, Steven Bethard, 和
Marie-Francine Moens. 2011. Model-portability
experiments for textual temporal analysis. 在
Proceedings of the 49th Annual Meeting of
the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 2,
pages 271–276. 前交叉韧带; East Stroudsburg, PA.

Varun Kumar, Ashutosh Choudhary, and Eunah
给. 2021. Data augmentation using pre-trained
transformer models.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
征收, Ves Stoyanov, and Luke Zettlemoyer.
2019. 捷运: Denoising sequence-to-sequence
pre-training for natural language generation,
翻译, and comprehension. arXiv 预印本
arXiv:1910.13461. https://doi.org/10
.18653/v1/2020.acl-main.703

Xin Li and Dan Roth. 2002. Learning ques-
tion classifiers. In COLING 2002: The 19th
国际计算会议
语言学. https://doi.org/10.3115
/1072228.1072378

Hanxiao Liu, Karen Simonyan, and Yiming Yang.
2018. Darts: Differentiable architecture search.
arXiv 预印本 arXiv:1806.09055.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly opti-
mized bert pretraining approach. arXiv 预印本
arXiv:1907.11692.

Kushal Kafle, Mohammed Yousefhussien, 和
Christopher Kanan. 2017. Data augmentation
for visual question answering. In Proceedings
of the 10th International Conference on Nat-
ural Language Generation, pages 198–202.
https://doi.org/10.18653/v1/W17
-3529

Andrew L. Maas, Raymond E. Daly, 彼得
时间. Pham, Dan Huang, 安德鲁·Y. 的, 和
Christopher Potts. 2011. Learning word vec-
tors for sentiment analysis. 在诉讼程序中
the 49th Annual Meeting of the Association
for Computational Linguistics: Human Lan-
guage Technologies, pages 142–150, Portland,

356

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Oregon, 美国. Association for Computational
语言学.

Julian McAuley and Jure Leskovec. 2013. Hidden
factors and hidden topics: Understanding rating
dimensions with review text. 在诉讼程序中
the 7th ACM Conference on Recommender Sys-
特姆斯, pages 165–172. https://doi.org
/10.1145/2507157.2507163

Junghyun Min, 右. Thomas McCoy, Dipanjan
这, Emily Pitler, and Tal Linzen. 2020. Syn-
tactic data augmentation increases robustness
to inference heuristics.

Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002A. 蓝线: A method
for automatic evaluation of machine transla-
的. In ACL. https://doi.org/10.3115
/1073083.1073135

Kishore Papineni, Salim Roukos, Todd Ward, 和
Wei-Jing Zhu. 2002乙. 蓝线: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics,
pages 311–318. Association for Computational
语言学. https://doi.org/10.3115
/1073083.1073135

Alec Radford, Jeffrey Wu, Rewon Child, 大卫
Luan, Dario Amodei, 伊利亚·苏茨克维尔. 2019.
Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9.

Zhongzheng Ren, Raymond Yeh, and Alexander
Schwing. 2020. Not all unlabeled data are
equal: Learning to weight data in semi-
supervised learning. In Advances in Neural
Information Processing Systems, 体积 33,
pages 21786–21797. 柯伦联合公司, Inc.

G¨ozde G¨ul S¸ ahin and Mark Steedman. 2019.
Data augmentation via dependency tree morph-
ing for low-resource languages. arXiv 预印本
arXiv:1903.09460. https://doi.org/10
.18653/v1/D18-1545

Abigail See, 彼得·J. 刘, and Christopher D.
说到点子上: Summa-
曼宁. 2017. Get
rization with pointer-generator networks. 在
Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 1073–1083,
计算语言学协会,
Vancouver, 加拿大.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2015. Improving neural machine trans-
lation models with monolingual data. arXiv
preprint arXiv:1511.06709. https://土井
.org/10.18653/v1/P16-1009

Sam Shleifer and Alexander M. 匆忙. 2020.
Pre-trained summarization distillation. arXiv
preprint arXiv:2010.13002.

Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, 桑-
ping Zhou, Zongben Xu, and Deyu Meng.
2019. Meta-weight-net: Learning an explicit
mapping for sample weighting. In Advances
in Neural Information Processing Systems,
pages 1919–1930.

Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D. 曼宁, Andrew Ng,
and Christopher Potts. 2013. Recursive deep
models for semantic compositionality over a
sentiment treebank. 在诉讼程序中 2013
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1631–1642,
Seattle, 华盛顿, 美国. 协会
计算语言学.

Felipe Petroski Such, Aditya Rawal, Joel Lehman,
肯尼思·O. 斯坦利, and Jeff Clune. 2019. Gen-
erative teaching networks: Accelerating neural
architecture search by learning to generate
synthetic training data. CoRR, abs/1912.07768.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin.
2017. Attention is all you need. In Advances
in Neural Information Processing Systems,
pages 5998–6008.

William Yang Wang and Diyi Yang. 2015. That’s
so annoying!!!: A lexical and frame-semantic
embedding based data augmentation approach
to automatic categorization of annoying be-
haviors using# petpeeve tweets. In Proceed-
ings of
这 2015 Conference on Empirical
Methods in Natural Language Processing,
pages 2557–2563. https://doi.org/10
.18653/v1/D15-1306

Xinyi Wang, Hieu Pham, Zihang Dai, and Graham
Neubig. 2018. Switchout: An efficient data
augmentation algorithm for neural machine
翻译. arXiv 预印本 arXiv:1808.07512.

357

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

https://doi.org/10.18653/v1/D18
-1100

print arXiv:1901.11196. https://doi.org
/10.18653/v1/D19-1670

Yulin Wang, Jiayi Guo, Shiji Song, and Gao
黄. 2020. Meta-semi: A meta-learning ap-
proach for semi-supervised learning. CoRR,
abs/2007.02394.

Jason Wei and Kai Zou. 2019. 电子设计自动化: Easy data
augmentation techniques for boosting perfor-
mance on text classification tasks. arXiv pre-

Xiang Zhang, Junbo Zhao, and Yann LeCun.
2015. Character-level convolutional networks
for text classification. arXiv:1509.01626 [cs].

Guoqing Zheng, Ahmed Hassan Awadallah, 和
苏珊·T. Dumais. 2019. Meta label correc-
tion for learning with weak supervision. CoRR,
abs/1911.03809.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
6
4
2
0
0
6
9
6
5

/
t

我

A
C
_
A
_
0
0
4
6
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

358
下载pdf