A Multi-Level Optimization Framework for End-to-End
Text Augmentation
Sai Ashish Somayajula
UC San Diego, Etats-Unis
ssomayaj@ucsd.edu
Linfeng Song
Tencent AI Lab, Etats-Unis
lfsong@tencent.com
Pengtao Xie∗
UC San Diego, Etats-Unis
p1xie@eng.ucsd.edu
Abstrait
Text augmentation is an effective technique
in alleviating overfitting in NLP tasks. In ex-
isting methods, text augmentation and down-
stream tasks are mostly performed separately.
Par conséquent, the augmented texts may not be
optimal to train the downstream model. À
address this problem, we propose a three-level
optimization framework to perform text aug-
mentation and the downstream task end-to-
end. The augmentation model is trained in
a way tailored to the downstream task. Notre
framework consists of three learning stages.
A text summarization model is trained to per-
form data augmentation at the first stage. Chaque
summarization example is associated with a
weight to account for its domain difference
with the text classification data. At the second
stage, we use the model trained at the first
stage to perform text augmentation and train
a text classification model on the augmented
texts. At the third stage, we evaluate the text
classification model trained at the second stage
and update weights of summarization exam-
ples by minimizing the validation loss. These
three stages are performed end-to-end. Nous
evaluate our method on several text classifica-
tion datasets where the results demonstrate the
effectiveness of our method. Code is available
at https://github.com/Sai-Ashish
/End-to-End-Text-Augmentation.
1
Introduction
Data augmentation (Sennrich et al., 2015; Fadaee
et coll., 2017; Wei and Zou, 2019) is an effec-
tive technique for mitigating the deficiency of
training data and preventing overfitting. In natu-
ral language processing, many data augmentation
methods have been proposed, such as back transla-
tion (Sennrich et al., 2015), synonym replacement
(Wang and Yang, 2015), random insertion (Wei
and Zou, 2019), et ainsi de suite. In existing approaches,
∗Corresponding author
343
data augmentation and downstream tasks are per-
formed separately: Augmented texts are created
d'abord, then they are used to train a downstream
model. The downstream task does not influence
the augmentation process. Par conséquent, the aug-
mented texts may not be optimal for training the
downstream model.
In this paper, we aim to address this problem.
We propose an end-to-end learning framework
based on multi-level optimization (Feurer et al.,
2015), which performs data augmentation and
downstream tasks in a unified manner where not
only augmented texts influence the training of the
downstream model, but also the performance of
downstream task affects how data augmentation
is performed.
In our framework, we use a text summarization
(Gambhir and Gupta, 2017) model to perform data
augmentation. Given an original text t with class
label c, we feed t into the summarization model
to generate a summary s. We set the class label
of s to be c. (s, c) is treated as an augmented
text-label pair of (t, c). The motivation of using
a summarization model for text augmentation is
two-fold. D'abord, the major semantics of an original
text is preserved in its summary; donc, it is
sensible to assign the class label of the original
text to its summary. Deuxième, the summary ex-
cludes non-essential details in the original text;
as a result, the semantic diversity between the
summary and the original text is rich, which well
serves the purpose of creating diverse augmenta-
tion. The summarization model is trained on a
summarization dataset {(ti, si)}M.
i=1 where ti is an
original text and si is the corresponding summary.
For the downstream task, we assume it is text
classification. We assume there is a text classifi-
cation training set {(X(tr)
i=1 where x(tr)
, oui(tr)
je
je
is an input text and y(tr)
is the corresponding class
label, and there is a text classification validation
ensemble {(X(val)
.
)}N (tr)
)}N (val)
je = 1
, oui(val)
je
je
je
je
Transactions of the Association for Computational Linguistics, vol. 10, pp. 343–358, 2022. https://doi.org/10.1162/tacl a 00464
Action Editor: Dani Yogatama. Submission batch: 09/2021; Revision batch: 12/2021; Published 4/2022.
c(cid:3) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Our framework consists of three learning stages
that are performed end-to-end. We train a text sum-
marization model G on the summarization dataset
at the first stage. Considering a domain difference
between the summarization data and text classi-
fication data, we associate each summarization
training pair with a weight a ∈ [0, 1]. A smaller
a indicates a large domain difference between
this summarization pair and the text classifica-
tion data, and this pair should be down-weighted
during the training of the summarization model.
These weights are tentatively fixed at this stage
and will be updated later. At the second stage, nous
use the trained summarization model to perform
text augmentation for the classification dataset
and train a text classification model on the aug-
mented and original datasets. At the third stage, nous
validate the classification model trained at the sec-
ond stage and update weights of summarization
training examples by minimizing the validation
perte. The three stages are performed end-to-end
where they mutually influence each other. Nous
evaluate our framework on several text classifi-
cation datasets. Various experiments demonstrate
the effectiveness of our method.
The major contributions of this work include:
• We propose a three-level optimization frame-
work to perform text augmentation in an
end-to-end manner. Our framework con-
sists of three learning stages that mutually
influence each other: 1) training text summa-
rization model; 2) training text classification
model; 3) updating weights of summarization
data by minimizing the validation loss of the
classification model.
• Experiments on various datasets demonstrate
the effectiveness of our framework.
The rest of the paper is organized as follows.
Section 2 reviews related work. Section 3 intro-
duces the method. Section 4 gives experimental
résultats. Section 5 concludes the paper.
2 Related Work
2.1 Data Augmentation in NLP
As an effective way of mitigating the deficiency
of training data, data augmentation has been
broadly studied in NLP (Feng et al., 2021).
Sennrich et al. (2015) proposed a back translation
method for data augmentation, which improves
the BLEU (Papineni et al., 2002un) scores in ma-
chine translation (MT). The back translation tech-
nique first converts the sentences to another
langue. It again translates it back to the original
language to augment the original text.
Fadaee et al. (2017) propose a data augmen-
tation method for low-frequency words. Specif-
ically, the method generates new sentence pairs
that contain rare words. Kafle et al. (2017) intro-
duce two data augmentation methods for visual
question answering. The first method uses se-
the questions.
mantic annotations to augment
The second technique generates new questions
from images using an LSTM network (Hochreiter
and Schmidhuber, 1997). Wang and Yang (2015)
propose an augmentation technique that replaces
query words with their synonyms. Synonyms are
retrieved based on cosine similarities calculated
on word embeddings. Kolomiyets et al. (2011)
propose to augment data by replacing the tem-
poral expression words with their corresponding
synonyms. They use the vocabulary from the La-
tent Words Language Model (LWLM) et le
WordNet.
S¸ ahin and Steedman (2019) propose two text
augmentation techniques based on dependency
trees. The first technique crops the sentences by
discarding dependency links. The second tech-
nique rotates sentences by tree fragments that are
pivoted at the root. Chen et al. (2020) propose
augmenting texts by interpolating input texts in
a hidden space. Wang et al. (2018) propose aug-
menting sentences by randomly replacing words
in input and target sentences with words from the
vocabulary. SeqMix (Guo et al., 2020) proposes
to create augments by softly merging input/target
sequences.
EDA (Wei and Zou, 2019) uses four operations
to produce data augmentation: synonym replace-
ment, random insertion, random swap, and random
deletion. Kobayashi (2018) proposes to replace
words stochastically with words predicted by a
bi-directional language model. Andreas (2020)
proposes a compositional data augmentation ap-
proach that constructs a synthetic training example
by replacing text fragments in a real example with
other fragments appearing in similar contexts.
Kumar et al. (2021) apply pretrained Transformer
models including GPT-2, BERT, and BART for
conditional data augmentation, where the conca-
tenation of class labels and input texts are fed into
344
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
these pretrained models to generate augmented
texts. Kumar et al. (2021) propose a language
model-based data augmentation method. This ap-
proach first finetunes a language model on limited
training data, then feeds class labels into the fine-
tuned model to generate augmented sentences.
Min et al. (2020) explore several syntactically
informative augmentation methods by applying
syntactic transformations to original sentences
and showed that subject/object inversion could
increase robustness to inference heuristics.
2.2 Bi-level Optimization
Many NLP applications (Feurer et al., 2015;
Baydin et al., 2017; Finn et al., 2017; Liu et al.,
2018; Shu et al., 2019; Zheng et al., 2019) sont
based on bi-level optimization (BLO), tel que
neural architecture search (Liu et al., 2018), data
selection (Shu et al., 2019; Ren et al., 2020; Wang
et coll., 2020), meta learning (Finn et al., 2017),
hyperparameter tuning (Feurer et al., 2015), la-
bel correction (Zheng et al., 2019), training data
generation (Such et al., 2019), learning rate adap-
tation (Baydin et al., 2017), and so forth. Dans ces
BLO-based applications, model parameters are
learned by minimizing a training loss in an in-
ner optimization problem while meta parameters
are learned by minimizing a validation loss in an
outer optimization problem. In these applications,
meta parameters are neural architectures, weights
of data examples, hyperparameters, et ainsi de suite.
3 Method
This section proposes a three-level optimiza-
tion framework to perform end-to-end text
augmentation.
Chiffre 1: Overview of our framework.
je
, oui(tr)
je
data augmentation. Given an original training pair
(X(tr)
), we feed the input text x(tr)
into the
je
text summarization model and get a summary
si. Because si preserves the major semantics
of x(tr)
, we can assign the class label y(tr)
de
je
X(tr)
to si. À la fin, we obtain an augmented
je
training pair (si, oui(tr)
). This process can be ap-
plied to every original training example and create
corresponding augmented training examples.
je
je
To enable the text summarization model and
the text classifier to influence and benefit from
each other mutually, we develop a three-level
optimization framework to train these two mod-
els end-to-end. Our framework consists of three
learning stages performed in a unified manner. À
the first stage, we train the text summarization
model. At the second stage, we use the summa-
rization model trained at the first stage to perform
text augmentation and train the classifier on the
augmented examples. At the third stage, we eval-
uate the classifier on a validation set and update
weights of summarization training examples by
minimizing the validation loss. Chiffre 1 shows an
overview of our framework. Suivant, we describe the
three stages in detail.
3.1 Overview
3.2 Stage I
je
je
c = {(X(tr)
, oui(tr)
je
je
text and y(tr)
We assume the target task is text classification. Nous
train a BERT-based (Devlin et al., 2018) text clas-
sifier on a training set D(tr)
)}N (tr)
je = 1
where x(tr)
is the
is an input
corresponding class label. Entre-temps, we have
access to a classification validation set D(val)
=
{(X(val)
. In many application sce-
je
narios, the training data is limited, which incurs
a high risk of overfitting. To address this prob-
lem, we perform data augmentation of the training
data to enlarge the number of training examples.
We use a text summarization model to perform
)}N (val)
je = 1
, oui(val)
je
c
At the first stage, we train the text summariza-
tion model. We use BART (Lewis et al., 2019)
to perform summarization. BART is a pretrained
Transformer (Vaswani et al., 2017) model con-
sisting of an encoder and a decoder. The encoder
takes a text as input, and the decoder generates a
summary of the text. Let S denote the summariza-
tion model. The training data is Ds = {(ti, si)}M.
je = 1
where ti is an input text and si is the correspond-
ing summary. Souvent, the summarization dataset Ds
has a domain shift with the classification dataset
Dc. Par exemple, in Ds, if its domain difference
345
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
with the classification dataset is large, the summa-
rization model trained by this example may not be
suitable to perform data augmentation for Dc. À
address this problem, we associate each example
in Ds with a weight a ∈ [0, 1]. If a is close to 0,
the domain difference between
it means that
this example and Ds is large, and this example
should be down-weighted during training the
summarization model.
At this stage, we solve the following optimiza-
tion problem:
S∗(UN) = min
S
M.(cid:2)
je = 1
ail(S, ti, si)
(1)
where A = {ai}M.
i=1 and l(·) is the teacher-forcing
perte. The loss of (ti, si) is weighted by the weight
ai of this example. If (ti, si) has large domain
difference with Ds, ai should be close to 0, alors
ail(S, ti, si) is made close to 0, which effectively
excludes (ti, si) from the training process. Le
optimally trained model S∗ depends on A since
S∗ depends on the loss function, and the loss
function depends on A. A is tentatively fixed at
this stage and will be updated at a later stage.
A cannot be updated at this stage. Otherwise, un
trivial solution will be yielded where all values in
A are 0.
3.3 Stage II
je
je
, oui(tr)
je
in D(tr)
c
At the second stage, we use the summarization
model S∗(UN) trained at the first stage to perform
c = {(X(tr)
data augmentation of D(tr)
)}N (tr)
je = 1 .
For each x(tr)
, we feed it into S∗(UN)
to generate a summary g(X(tr)
, S∗(UN)). In the
end, we obtain an augmented dataset G(D(tr)
,
S∗(UN)) = {(g(X(tr)
)}N (tr)
je = 1 . Nous
classifier C on
train a BERT-based text
the original data D(tr)
and augmented data
G(D(tr)
, S∗(UN)).
c
, S∗(UN)), oui(tr)
c
c
je
je
je
At this stage, we solve the following optimiza-
tion problem:
C ∗(S∗(UN)) =
minC L(C, D(tr)
c
) + γL(C, G(D(tr)
c
, S∗(UN)))
(2)
where L(·) denotes a cross-entropy classification
loss and γ is a tradeoff parameter. The first loss
term is defined on the original training dataset, et
the second loss term is defined on the augmented
training dataset. The optimally trained classifier
346
C ∗ depends on S∗(UN) since C ∗ depends on the
training loss, which depends on S∗(UN).
3.4 Stage III
At the third stage, we evaluate the classifier trained
at the second stage on the classification validation
set D(val)
, oui(val)
and update the
c
je
weights A by minimizing the validation loss. À
this stage, we solve the following optimization
problem:
= {(X(val)
je
)}N (val)
je = 1
min
UN
L(C ∗(S∗(UN)), D(val)
c
)
(3)
3.5 A Three-Level Optimization Framework
Putting all pieces together, we have the following
three-level optimization framework.
minA L(C ∗(S∗(UN)), D(val)
s.t.
C ∗(S∗(UN)) = minC L(C, D(tr)
)
c
c
) +
, S∗(UN)))
γL(C, G(D(tr)
c
M.(cid:2)
S∗(UN) = minS
ail(S, ti, si)
(4)
je = 1
There are three optimization problems in this
framework, each corresponding to a learning
stage. From bottom to top the optimization prob-
lems correspond to learning stages I, II, and III,
respectivement. The first two optimization problems
are nested on the constraint of the third optimiza-
tion problem. These three stages are conducted
end-to-end in this unified framework. The solution
S∗(UN) obtained in the first stage is used to per-
form text augmentation in the second stage. Le
classification model trained in the second stage
is used to make predictions at the third stage.
The importance weights A updated in the third
stage change the training loss in the first stage and
consequently changes the solution S∗(UN), lequel
subsequently changes C ∗(S∗(UN)).
3.6 Optimization Algorithm
Dans cette section, we develop a gradient-based opti-
mization algorithm to solve the problem defined
in Eq. (4). Drawing inspiration from Liu et al.
(2018), we approximate S∗(UN) using one step
gradient descent update of S:
S∗(UN) ≈ S(cid:6) = S − ηs∇S
M.(cid:2)
je = 1
ail(S, ti, si)
(5)
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
We plug S∗(UN) ≈ S(cid:6) into the objective func-
tion at the second stage and get an approximate
objective. We approximate C ∗(S∗(UN)) en utilisant
one-step gradient descent update of C:
Algorithm 1 Optimization algorithm
while not converged do
Update weight parameters S using Eq. (5)
Update weight parameters C using Eq. (6)
Update meta parameters A using Eq. (7)
C ∗(UN) ≈ C (cid:6) = C − ηc∇C(L(C, D(tr)
c
+ γL(C, G(D(tr)
, S(cid:6))))
c
)
(6)
end while
4 Experiments
)
Enfin, we plug C ∗(UN) ≈ C (cid:6) into the validation
loss and get an approximated objective. Then we
update A by gradient descent:
A ← A − ηa∇AL(C (cid:6), D(val)
c
)
(7)
où
∂C (cid:6)
∂S(cid:6)
∂L(C (cid:6),D(val)
c
∂C (cid:6)
)
)) = ∂S(cid:6)
∇AL(C (cid:6), D(val)
∂A
M.(cid:3)
= ηsηcγ∇2
c
UN,S
∇2
S(cid:6),CL(C, G(D(tr)
c
je = 1
ail(S, ti, si)
, S(cid:6)))∇C (cid:6)L(C (cid:6), D(val)
c
(8)
Eq. (8) involves an expensive matrix-vector prod-
uct, whose computational complexity can be
reduced by a finite difference approximation.
Eq. (8) can be approximated as:
≈ ηsηcγ
2un
{[∇S(cid:6)L(C +, G(D(tr)
c
, S(cid:6)))]∇2
∇S(cid:6)L(C −, G(D(tr)
c
UN,S
, S(cid:6))) −
M.(cid:3)
ail(S, ti, si)}
je = 1
(9)
où
α =
0.01
(cid:4)
(cid:4)
(cid:4)∇C (cid:6)L(C (cid:6), D(val)
c
(cid:4)
(cid:4)
(cid:4)
)
,
2
C ± = C ± α∇C (cid:6)L(C (cid:6), D(val)
c
)
The matrix-vector multiplication in Eq. (9) can be
further approximated by:
ail(S+
± ,ti, si)−∇A
M.(cid:2)
je = 1
ail(S−
± ,ti, si)}
(10)
1
α±
S
{∇A
M.(cid:2)
je = 1
où
α±
S =
0.01
(cid:4)
(cid:4)
(cid:4)∇S(cid:6)L(C ±, G(D(tr)
c
, S(cid:6)))
,
(cid:4)
(cid:4)
(cid:4)
± = S ± α+
S+
S
± = S ± α−
S−
S
∇T (cid:6)L(C +, G(D(tr)
∇T (cid:6)L(C −, G(D(tr)
c
c
Train Validation Test
Dataset
25k
6.7k
6.7k
IMDB
10k
10k
38k
Yelp Review
10k
10k
SST-2
10k
Amazon Review 10k
872
10k
Tableau 1: Split statistics of the classification
datasets.
These update steps iterate until convergence. Le
overall algorithm is summarized in Algorithm 1.
Dans cette section, we report experimental results.
The key takeaways include: 1) our proposed
end-to-end augmentation method performs bet-
ter than baselines that conduct augmentation and
classification separately; 2) our framework is ag-
nostic to the choices of classification models and
can be applied to improve a variety of classifiers;
3) our framework is particularly effective when
the number of training examples is small; 4) notre
framework is more effective when the length of
input texts is large.
4.1 Dataset
For the text summarization data, we use the
CNN-DailyMail dataset (Voir et al., 2017). It con-
tains summaries of around 300k news articles.
The full CNN-DailyMail training set is used for
training the summarization model. We used four
text classification datasets: 1) IMDB containing
movie reviews (Maas et al., 2011), 2) Yelp Re-
voir: a binary sentiment classification dataset
(Zhang et al., 2015), 3) SST-2: Stanford Sen-
timent Treebank (Socher et al., 2013), et 4)
Amazon Review: a product review dataset from
Amazon (McAuley and Leskovec, 2013). Le
split statistics of these datasets are summarized
in Table 1.
4.2 Baseline
We compare our method with the following
baseline methods.
2
, S(cid:6)))
, S(cid:6)))
347
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
• No-Aug: No data augmentation is applied.
• EDA (Wei and Zou, 2019): Augmented sen-
tences are created by randomly applying the
following operations: synonym replacement,
random insertion, random swap, and random
deletion.
• GECA (Andreas, 2020): A compositional
data augmentation approach that constructs
a synthetic training example by replacing
text fragments in a real example with other
fragments appearing in similar contexts.
• LAMBADA (Kumar et al., 2021): UN
language model based data augmentation
method. This approach first finetunes a lan-
alors
guage model on real
feeds class labels into the finetuned model
to generate augmented sentences.
training data,
• Sum-Sep: Summarization-based augmenta-
tion and text classification are performed
separately. We first train a summarization
model, use it to perform data augmentation,
then train the classification model on the
augmented training data.
• Multi-task learning (MTL): In this baseline,
the summarization model and classification
model are trained by minimizing a single
objective function, which is the weighted
sum of the summarization and classification
pertes. The corresponding formulation is:
minA L(C ∗(UN), D(val)
s.t. C ∗(UN) = minC,S L(C, D(tr)
c
M.(cid:2)
)
c
)+
γL(C, G(D(tr)
c
, S)) + λ minS
ail(S, ti, si)
je = 1
4.3 Hyperparameter Settings
For the text classification model, we use the one
in EDA (Wei and Zou, 2019). It contains an in-
put layer, a bi-directional hidden layer with 64
LSTM (Hochreiter and Schmidhuber, 1997) units,
a dropout layer with a probability of 0.5, another
bi-directional layer with 32 LSTM units, another
dropout layer with a probability of 0.5, ReLU ac-
tivation, a dense layer with 20 hidden units, et
a softmax output layer. We set the maximum text
length to 150. The loss function is cross-entropy
perte. Model parameters are optimized using the
348
Adam (Kingma and Ba, 2014) optimizer, with an
epsilon of 10−8. In Adam, β1 and β2 are set to
0.9 et 0.999, respectivement. The learning rate is
a constant 10−3. The batch size used is 8. Pour
the importance weights A of summarization data
examples, we optimize them using an Adam opti-
mizer, with a weight decay of 10−3. Epsilon, β1,
and β2 are set to 10−8, 0.5, et 0.999, respecter-
tivement. The learning rate is 3 × 10−4. The tradeoff
parameter γ is set to 1 for all experiments, unless
otherwise stated.
For the summarization model, we use the dis-
tilled BART model (Shleifer and Rush, 2020).
It has six encoder layers and six decoder layers.
We set the maximum text length for the article
à 1024 and the summary to 75. We use an SGD
optimizer with a momentum of 0.9 and a learning
rate of 10−3. We use a cosine annealing learning
rate scheduler with the minimum learning rate
set to 5 × 10−4. We randomly sample a subset of
CNN/DailyMail to train the summarization model.
We balance the CNN/DailyMail data and the
classification data by making the number of sam-
pled CNN/DailyMail examples roughly the same
as the classification data examples.
Accuracy is used as the evaluation metric. Chaque
experiment runs five times with random initializa-
tion. We report the mean and standard deviation
of the results. The experiments are conducted on
1080Ti GPUs.
4.4 Main Results
Following Wei and Zou (2019), for each classifi-
cation dataset, we randomly sample x percentage
of training data for model training. Tables 2, 3, 4,
et 5 show the average accuracies on the Yelp
Review, IMDB, Amazon Review, and SST-2
datasets, under different percentages.
From these tables, we make the following ob-
servations. D'abord, our method works better than
Sum-Sep. The reason is that our method performs
the summarization-based augmentation and text
classification in an end-to-end framework while
Sum-Sep performs these two tasks separately.
In our end-to-end framework, the summarization
and classification models mutually influence each
other. The performance of the classification model
guides the training of the summarization model.
If the classification accuracy is not high, it indi-
cates that the augmented data generated by the
summarization model is not useful. When this
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
No-Aug
Perc.
63.45±4.21
5%
68.15±3.77
10%
72.43±3.16
20%
81.78±1.76
50%
84.52±1.55
75%
100% 85.42±1.52
EDA
68.45±2.49
74.46±1.82
75.07±4.22
81.57±2.44
82.70±2.30
84.66±1.36
GECA
69.72±1.72
74.91±2.07
78.29±0.95
83.61±0.93
85.02±0.94
86.45±0.26
LAMBADA
69.30±1.38
75.03±1.45
76.25±2.57
82.18±1.82
83.71±0.71
85.08±0.49
Sum-Sep
72.53±0.44
76.73±0.57
81.29±0.09
84.90±0.19
86.18±0.25
87.46±0.10
MTL
72.94±0.52
76.35±0.51
80.47±0.25
84.29±0.22
86.07±0.28
87.15±0.13
Ours
74.58±0.37
78.62±0.73
81.66±0.13
85.55±0.15
87.10±0.31
87.79±0.07
Tableau 2: Classification accuracy (%) on Yelp Review. Perc. denotes percentage.
No-Aug
Perc.
52.86±3.22
5%
56.76±3.38
10%
59.54±1.68
20%
66.90±1.98
50%
72.25±1.05
75%
100% 73.96±0.85
EDA
61.42±1.75
64.06±1.92
67.18±3.20
71.67±1.33
74.23±0.72
75.75±0.27
GECA
61.58±0.94
63.27±2.76
65.36±0.83
72.89±0.58
73.69±0.55
75.38±0.44
LAMBADA
61.03±1.46
64.95±1.83
67.61±1.94
71.52±1.57
74.02±0.73
76.45±0.03
Sum-Sep
63.74±0.27
68.42±0.11
69.82±0.61
72.86±0.31
74.78±0.25
76.70±0.08
MTL
63.27±0.41
67.92±0.14
68.47±0.69
72.40±0.19
74.63±0.41
76.11±0.09
Ours
64.79±0.32
68.75±0.08
72.04±0.52
73.98±0.25
75.78±0.37
77.00±0.05
Tableau 3: Classification accuracy (%) on IMDB.
No-Aug
Perc.
63.46±1.51
5%
66.03±1.44
10%
68.04±2.26
20%
76.03±0.75
50%
77.40±1.09
75%
100% 78.36±1.02
EDA
62.59±2.35
66.20±2.91
68.36±1.05
75.39±1.69
76.30±1.82
78.13±1.38
GECA
64.82±0.93
68.49±0.82
69.04±1.94
77.02±0.50
78.52±0.52
80.16±0.32
LAMBADA
63.21±1.07
66.15±1.35
70.96±2.45
76.31±0.37
77.02±0.81
78.94±0.69
Sum-Sep
65.44±0.31
69.76±0.21
72.64±0.58
77.07±0.88
78.78±0.52
81.43±0.12
MTL
64.81±0.69
69.33±0.26
72.51±0.99
77.05±0.19
78.93±0.37
81.22±0.33
Ours
67.53±0.42
70.84±0.11
75.28±0.73
78.80±0.62
80.46±0.62
82.01±0.25
Tableau 4: Classification accuracy (%) on Amazon Review.
No-Aug
Perc.
57.22±2.07
5%
62.04±1.44
10%
64.45±1.99
20%
71.90±1.51
50%
73.74±0.59
75%
100% 77.75±1.49
EDA
63.42±1.24
66.86±0.73
68.00±0.49
74.77±0.39
75.80±0.81
77.64±0.30
GECA
58.44±0.57
64.92±0.49
67.48±0.55
72.40±0.55
75.03±0.33
77.10±0.42
LAMBADA
59.31±0.99
65.79±0.75
66.12±0.87
73.35±0.64
74.29±0.28
75.41±0.85
Sum-Sep
59.05±0.91
65.15±0.37
67.09±1.27
72.25±0.27
75.34±0.07
75.80±0.21
MTL
60.51±0.32
64.86±0.66
67.84±0.49
74.02±0.44
74.25±0.10
75.39±0.43
Ours
63.76±0.85
68.23±0.51
69.61±0.21
75.23±0.26
76.00±0.14
77.92±0.05
Tableau 5: Classification accuracy (%) on SST-2, γ is set to 0.5 pour 5% et 20%.
occurs, the summarization model will adjust its
network weights and data weights to generate use-
ful augmentations in the next round of learning. Dans
contraste, in Sum-Sep, such a feedback loop (depuis
classification to summarization) does not exist.
Donc, its performance is inferior.
Deuxième, our method works better than MTL. Dans
MTL, the summarization model and classification
model are trained simultaneously by minimizing
a single objective function, which is the weighted
sum of the summarization loss and classification
perte. This incurs a competition between these two
tasks: seeking for more decrease of the loss of
one task leads to less decrease of the loss of the
other task. Our method avoids task competition
by performing these two tasks sequentially and
minimizing two different objective functions. Le
summarization model is trained by minimizing the
summarization loss. Then the classification model
is trained by minimizing the classification loss. Dans
this way, there is no competition between these
two tasks. Though performed in two stages, ces
two tasks can still mutually influence each other
in our end-to-end framework.
Troisième, our method performs better than No-Aug.
This shows that the augmented data generated by
our method has good utility for model training.
Fourth, the improvement of our method over
baselines is more prominent when the number
of training examples is smaller (c'est à dire., when the
349
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Ours
Perc.
48.36±3.72
5%
64.24±2.14
20%
100% 72.36±1.29
EDA
40.16±1.94
51.48±2.44
57.96±1.21
No-Aug
44.40±5.26
58.00±4.10
69.52±2.43
Tableau 6: Classification accuracy (%) on TREC.
percentage is small). Under the 5% percentage,
we observe an improvement of 11.13%, 11.9%,
4.07%, et 6.59% on Yelp, IMDB, Amazon,
and SST-2 datasets, respectivement, over No-Aug.
When the training dataset size is smaller, le
necessity of data augmentation is more significant.
As the percentage increases, the impact of data
augmentation decreases. For the 100% percentage,
we observe an improvement of 2.37%, 3.04%, et
3.65% on Yelp, IMDB, and Amazon, respectivement,
over No-Aug.
Fifth, our method works better than EDA,
GECA, and LAMBADA. Encore, the reason is
that our method performs data augmentation and
classification end-to-end while these baselines
perform them separately. Compared to EDA, et-
der the 5% percentage, we observe an accuracy
gain of 6.13%, 3.37%, et 4.94% on Yelp, IMDB,
and Amazon, respectivement. EDA augments each
input sentence to produce 8-16 new sentences.
Our method achieves a better accuracy (on Yelp,
IMDB, and Amazon) or a similar accuracy (sur
SST-2) than EDA with just one augmentation per
input sentence.
EDA uses simple and heuristic rules to generate
augmented sentences which may be noisy and lack
semantic meaningfulness. These noisy augmenta-
tions may render the classification model trained
on them to perform worse. To verify this, we per-
formed experiments on the TREC (Li and Roth,
2002; Hovy et al., 2001) dataset (which is split
into a train/validation/test set with 3000, 2000, et
500 examples respectively). Tableau 6 shows the
résultats. As can be seen, with EDA as augmenta-
tion, the classification performance becomes much
worse (compared with No-Aug). In contrast, notre
method trains a summarization model to gener-
ate semantically meaningful augmentations and
perform much better than EDA.
Sixth, our method is more effective on long
texts. Par exemple, our framework outperforms
EDA under all percentages on datasets where
the input texts are relatively long, including Yelp,
IMDB, and Amazon (the average number of words
Original sentence: This review is for the Kindle edition of what is
supposedly the Christopher Gill translation of Plato’s Symposium.
Cependant, it turns out if you download it that it is the same as the
Benjamin Jowett translation which is available for free, alors que
here you have to pay upwards of $8 for it. I also checked Penguin Classics web site and they do not indicate that a eBook version of this book is available. So be careful when purchasing the Kindle edition. I had to return my purchase for this Kindle book. Augmentation: Critically reviewed the Christopher Gill trans- lation of Plato’s Symposium. An online version of the book is currently available for free. Cependant, if you download it it is the same as the Benjamin Jowett translation. In contrast, you pay upwards of $8.
Original sentence: My issue is not with anything Tom Robbins
wrote, but with narrator Barret Whitener: a white male who
opts to read all the dialogue from Japanese characters in a bad,
generic ‘‘Asian’’ accent. It’s unbearable to listen to. En outre,
Whitener can’t even pronounce the simplest of Japanese words,
like ‘‘sake.’’ How many more years must transpire before an
Asian person can have a voice without being mocked?
Augmentation: The book is written by a white male who reads it
in a bad, generic ‘‘Asian’’ accent. He can’t even pronounce the
simplest of Japanese words, like ‘‘sake’’ Movie is currently being
played by Playmakers’ weekly series .series
Tableau 7: Two augments generated by our method
for Amazon Review.
per input text in Yelp, IMDB, and Amazon are
133, 234, et 75, respectivement). On SST-2, où
the texts are relatively short, our method per-
forms marginally better than EDA under certain
percentages. In CNN-DailyMail, which is used
for training the summarization model, the input
texts are also long. Par conséquent, the summarization
model trained on CNN-DailyMail can generate
good-quality summaries for Yelp, IMDB, et
Amazon. The generated summaries preserve the
major information in the input texts, discard redun-
dant information, and align well with the original
sentences semantically. Tableau 7 shows some ex-
amples. Dans ce cas, it is meaningful for the
summary to inherit the class label of the orig-
inal sentence, and the resulting (augmentation,
label) pairs are useful for training the classifi-
cation model. In contrast, in SST-2, the average
number of words per input text in training data
is only 10. When a short sentence is fed into the
summarization model, the model tends to perform
open text generation by treating the input short
sentence as a prompt. The generated text may
deviate from the original sentence in semantics.
Tableau 8 shows some examples. Dans ce cas, assign-
ing the class label of the original sentence to the
summary introduces labeling noises. Donc,
the improvement of our method over baselines
may not be substantial. We performed statistical
significance tests using double-sided t-tests. Dans
350
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Original sentence: An object lesson in period filmmaking.
Augmentation: An object lesson in period filmmaking is a
time-saving technique. The film was shot by the crew of the late
1960s and early 1980s. It’s been produced since then, with some
scenes being filmed on location . A few years later, it has become
an important part of this type of cinema.
Original sentence: Mean-spirited lashing.
Augmentation: Critically ill-tempered lashing. ‘‘I feel like I’m
going to get out of my own way,’’ says one person. An angry
person lashed out at him for being unprovoked and hateful.
demonstrated anger over the use of alcohol.
Ours
Dataset
68.75±0.08
IMDB
78.62±0.73
Yelp Review
Amazon Review 70.84±0.11
68.23±0.51
SST-2
Aug-Only
66.69±0.37
77.94±0.91
70.14±0.44
66.17±0.42
Tableau 9: Comparison with Aug-Only, sur 10% de
the classification datasets.
Tableau 8: Two augmentation generated by our
method for SST-2.
most cases, the p-values of our method against
baselines are less than 1e − 3. This shows that
our method is significantly better than baselines
in general.
4.5 Ablation Studies
To evaluate the effectiveness of our proposed
method, we compare with the following ablation
settings.
• Aug-Only. We train the text classification
model on augmented texts only, without us-
ing the original classification data, lequel
amounts to solving the following problem:
minA L(C ∗(S∗(UN)), D(val)
s.t. C ∗(S∗(UN)) = minCL(C,G(D(tr)
)
c
c
, S∗(UN)))
S∗(UN) = minS
M.(cid:2)
ail(S, ti, si)
je = 1
(11)
• Ablation on classification models. Dans ce
étude, we investigate whether our framework
is effective for other classifiers, y compris
CNN and RoBERTa (Liu et al., 2019). Fol-
lowing Wei and Zou (2019), the CNN model
layer, 1D convolutional
consists of input
layer with 128 filters (kernel size is 5),
global 1D max-pooling layer, ReLU acti-
vation function, dense layer (avec 20 hidden
units), softmax output layer. For RoBERTa,
we use a batch size of 16 and a learning rate
de 2 × 10−6 to train the model. The maxi-
mum classification text length is set to 128.
γ is set to 1 for all datasets. The rest of the
hyperparameters remain the same.
• Ablation on γ. Dans cette étude, we investi-
gate how the test accuracy is affected by the
hyperparameter γ.
351
Dataset
IMDB
Yelp
Amazon
SST-2
Ours
69.84±0.15
78.95±0.69
73.46±0.22
67.32±0.18
EDA
69.33±0.08
77.25±0.56
72.86±0.48
66.74±0.05
No-Aug
65.57±0.52
76.85±0.90
70.62±0.79
62.38±0.44
Tableau 10: Comparison of our method with EDA
and No-Aug, with CNN as the text classifier,
trained on 10% of the classification datasets.
Dataset
IMDB
Yelp
Amazon
SST-2
Ours
78.38±0.19
88.25±0.28
83.85±0.34
76.34±0.34
EDA
77.26±3.31
85.30±0.44
81.36±0.23
74.38±2.18
No-Aug
77.33±0.62
87.13±0.14
82.54±0.24
75.76±1.09
Tableau 11: Comparison of our method with EDA
and No-Aug, with CNN as the text classifier,
trained on 100% of the classification datasets.
Tableau 9 compares our method with Aug-Only.
As can be seen, our full method outperforms
Aug-Only. In Aug-Only, only augmented texts
train the classification model without leveraging
the original human-written texts. Because the aug-
mented data is noisier than human-provided data,
using augmented data only may lead to noisy
classification models.
Tables 10 et 11 compare our method with
EDA and No-Aug, with CNN as the text classi-
fier, trained on 10% et 100% of the classification
datasets, respectivement. Tables 12 et 13 com-
pare our method with EDA and No-Aug, avec
RoBERTa as the text classifier, trained on 10%
et 100% of the classification datasets, respecter-
tivement. As can be seen, with CNN and RoBERTa
as classifiers, our method still outperforms EDA
and No-Aug. This shows that our method is ag-
nostic to text classifiers and can be leveraged to
improve different text classifiers.
Chiffre 2 shows how test accuracy changes with
c, where the model is trained on 10% of IMDB.
As can be seen, when γ increases from 0 à 0.5,
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Dataset
IMDB
Yelp
Amazon
SST-2
Ours
84.40±0.37
91.12±0.29
90.56±0.37
87.73±0.86
EDA
84.35±0.35
90.38±0.39
89.90±0.41
87.10±0.83
No-Aug
83.89±0.64
90.23±0.39
89.60±0.36
87.57±0.74
Tableau 12: Comparison of our method with EDA
and No-Aug, with RoBERTa as the text classifier,
trained on 10% of the classification datasets.
Dataset
IMDB
Yelp
Amazon
SST-2
Ours
89.06±0.07
93.60±0.11
92.71±0.04
91.28±0.06
EDA
88.59±0.10
93.12±0.05
92.33±0.06
91.09±0.03
No-Aug
88.50±0.14
92.97±0.37
92.20±0.14
91.05±0.07
Tableau 13: Comparison of our method with EDA
and No-Aug, with RoBERTa as the text classifier,
trained on 100% of the classification datasets.
the classification accuracy increases. This is be-
cause a larger γ renders the classification model
to be trained more by the augmented data. Le
augmented data provides additional training re-
sources, which can mitigate the lack of original
data. Cependant, as γ continues to increase, the ac-
curacy decreases. This is because an excessively
large γ renders too much emphasis on augmented
data and less attention paid to original data. Orig-
inal data is less noisy than augmented data and
therefore is more valuable than augmented data
for training high-quality models. Similar results
can be observed in Figure 3, where the model is
trained on 10% of Amazon, SST-2, and Yelp.
We also perform an ablation study which re-
places the summarization data with paraphrase
data. The model S remains the same, which is
still distill-BART. The paraphrase data contains
3,900 sentence pairs from Microsoft Research
Paraphrase Corpus (MRPC) (Dolan and Brockett,
2005). These pairs are labeled as paraphrases by
human. To ensure a fair comparison, we ran-
domly sample 3,900 data examples from CNN-
DailyMail, denoted by CNNDM-3900. Tableau 16
shows the results. As can be seen, CNNDM-3900
yields better performance than MRPC. This shows
that using the summarization model for sentence
augmentation is more effective than using the
paraphrase model. The possible reason is that a
Chiffre 2: How test accuracy changes with γ, where the
model is trained on 10% of IMDB.
summarization model discards less important in-
formation of the input text while a paraphrasing
model preserves most information of the input
text; as a result, augmentations generated by the
summarization model are less noisy and have a
larger semantic diversity to the input texts.
4.6 Analysis of Learned Weights of
Summarization Data
Tableau 14 shows some randomly sampled summa-
rization data examples whose weights learned by
our framework are close to zero when the classifi-
cation dataset is SST-2. Due to the space limit, nous
only show the summaries. As can be seen, ces
data are primarily about healthcare, energy, law,
and politics, which have a large domain discrep-
ancy with SST-2, which is about movie reviews.
This shows that our framework is able to success-
fully identify out-of-domain summarization data
and exclude them from training the summariza-
tion model.
Tableau 15 shows some randomly sampled sum-
marization data examples whose weights learned
by our framework are close to one when the clas-
sification dataset is SST-2. Due to the space limit,
we only show the summaries. As can be seen, ces
summarization data are primarily about movies,
recreation, and media, which are close to SST-2
in terms of domain similarity.
When the classification dataset is IMDB, le
weights of examples in Table 14 are also close
to zero, and the weights of examples in Table 15
are also close to one. This is because IMDB and
examples in Table 15 are about movies while
examples in Table 14 are not.
Given the learned weights, we split the sum-
marization data into two subsets: ONE, lequel
contains examples whose weights are > 0.5, et
352
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Chiffre 3: How test accuracy changes with γ, where the model is trained on 10% of Amazon, SST-2, and Yelp.
Half of states will expand Medicaid under Obamacare; half
refuse or are on the fence. Low-income citizens and their
advocates say Medicaid expansion necessary. States like Texas,
Florida say a Medicaid expansion is costly and will fail. Politique
at play and most states will eventually expand the program,
political experts say.
The country plans to remove all subsidies in five years. Le
price of gasoline goes up four-folds. Iran has trillions of dollars
in natural resources but still struggles, experts say. The changes
aim to dampen domestic demand for fuel and oil, and bolster
overall revenues.
Attorneys for three kidnapped women have ‘‘confidence and
faith’’ in prosecution. Indictment lists use of chains, tape,
vacuum cord against women. Castro also faces 139 counts of
rape, 177 counts of kidnapping, one aggravated murder charge.
A prosecutor’s committee will later consider whether seeking
death penalty is appropriate.
North Korea’s announcement of a planned satellite launch has
provoked alarm. Other countries say it is a way of testing
missile technology. The Japanese defense minister orders the
preparation of missile defenses.
Tableau 14: Some summarization data examples
whose weights learned by our framework are
close to 0. The classification dataset is SST-2.
Hollywood often misrepresents careers and the workplace in
movies. Work is portrayed as easy or non-existent, and any
outfit is appropriate. This can have a negative effect on younger
generations deciding on careers.
Hollywood bringing back box-office juggernauts Iron Man,
Batman and Spider-Man. 2012 may be a peak year for super-
heroes on film, with much more to come. Writer-director: CC
Comics characters often face more challenges than Marvel.
’‘‘The genre has been popular for decades and is here to stay,’’
Boxofficeguru.com editor says.
Sam Smith wins best new artist, best pop vocal album. Beyonce
now has 20 Grammys, passing Aretha Franklin.
Ted Turner turns 75 years old this month. He founded CNN,
the first 24-hour cable news network, dans 1980. Dans 1990, Tourneur
hired Wolf Blitzer, host of CNN’s ‘‘The Situation Room’’.
Blitzer reflects on what he learned from his former boss.
Tableau 15: Some summarization data examples
whose weights learned by our framework are
close to 1. The classification dataset is SST-2.
ZERO, which contains examples whose weights
are ≤ 0.5. For each subset, we use it to train
a summarization model (no reweighting of data
during training), use the summarization model
353
Données
Yelp
IMDB
Amazon
SST-2
No-Aug
68.15±3.77
56.76±3.38
66.03±1.44
62.04±1.44
Summarize
75.06±0.62
64.94±0.15
67.36±0.08
67.44±0.27
Paraphrase
74.33±0.47
64.13±0.07
66.39±0.14
66.72±0.11
Tableau 16: Classification accuracy (%) sur 10%
of different datasets, when using summarization
data and paraphrase data to train the augmentation
model.
Données
Yelp
IMDB
Amazon
SST-2
No-Aug
68.15±3.77
56.76±3.38
66.03±1.44
62.04±1.44
ONE
77.27±0.96
66.94±0.23
69.24±0.15
67.50±0.34
ZERO
69.92±2.09
57.85±0.41
67.74±0.22
63.69±0.46
Tableau 17: Classification accuracy (%) sur 10% de
different datasets, in the study of how summari-
zation data weights affect downstream classifi-
cation performance.
to generate augmentations, and train the classifi-
cation model on augmented data (together with
real data). Tableau 17 shows the results. As can be
seen, augmentations generated by the summariza-
tion model trained on ONE improve classification
performance in most cases. In contrast, augmen-
tations generated by the summarization model
trained on ZERO are not helpful.
Using In-domain Data to Train a Summariza-
tion Model. We conduct an experiment that uses
in-domain data to train the summarization model.
Spécifiquement, we replace the CNN-Dailymail
dataset with a movie summarization (MovSum)
dataset crawled from IMDB. The MovSum dataset
contains 135K (movie synopsis, movie summary)
pairs where the summary (shorter) is treated as
a summarization of the synopsis (longer). Mov-
Sum is in the same domain as SST-2 since they
are both about movies. We randomly sample
135K examples from CNN-Dailymail (denoted
by CNNDM-135K) and train the summarization
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Summarization
Données
MovSum
CNNDM-135K
Classification
Accuracy
67.72±0.36
67.15±0.21
Percentage
of ones
84.5
39.3
Tableau 18: Results of our proposed method
on SST-2 (10%) under different summarization
datasets.
model on this subset for a fair comparison. Clas-
sification is conducted on SST-2 (en utilisant 10%
training examples).
Tableau 18 shows the results. As can be seen, en utilisant
MovSum to train the summarization model leads
to better classification performance on SST-2.
This is because, compared with CNNDM-135K,
MovSum has a higher domain similarity with
SST-2. A summarization model
trained using
in-domain data can generate in-domain augmen-
tations, which are more suitable to train the
classification model. En outre, we measure the
percentage of ones in the learned weights of sum-
marization data examples. Under MovSum, le
percentage is larger. This is because more data
examples in MovSum are in the same domain as
SST-2, compared with CNNDM-135K.
4.7 Analysis of Generated Augmentations
We measure the diversity of generated augmenta-
tion. Two types of diversity measures are used: je)
how different the generated augmentation is from
the input text; and II) how different sentences in
a generated augmentation are. We measure type-I
diversity in the following way: Given an input
text t and an augmentation s generated from t,
calculate the BLEU (Papineni et al., 2002b) score
between s and t. We measure type-II diversity in
the following way: for each pair of sentences in
a generated augmentation, calculate their BLEU
score, then take the average over all pairs. In the
BLEU score, the number of grams is set to 4.
Tableau 19 shows the results. As can be seen, le
BLEU scores are low for both types of diversity,
which indicates that the augmentations generated
by our method are diverse.
We also check whether generated augmenta-
tions are abstractive. We randomly sample 300
generated augmentations and ask 5 undergrad-
uates to manually annotate whether they are
abstractive (with score 1) or extractive (score 0).
Tableau 20 shows the results. The average score
Dataset
Yelp
IMDB
Amazon
SST-2
Type-I
0.15
0.22
0.11
0.13
Type-II
0.07
0.06
0.04
0.06
Tableau 19: Diversity of generated augmentations.
Dataset
Yelp
IMDB
Amazon
SST-2
Score
0.83
0.79
0.88
0.92
Tableau 20: Abstractiveness of
generated augmentations.
Model
GPT-2 (Radford et al., 2019)
Sum-Sep
MTL-SST2
Ours-SST2
MTL-IMDB
Ours-IMDB
MTL-Yelp
Ours-Yelp
MTL-Amazon
Ours-Amazon
Rouge-2 F1
8.27
16.06
11.06
12.77
10.79
13.06
12.15
12.97
11.73
13.13
Rouge-L F1
26.58
34.07
28.44
30.27
29.20
30.72
29.43
30.62
29.16
30.71
Tableau 21: Evaluation of summarization models.
is close to 1, indicating that the augmentations
generated by our method are abstractive.
4.8 Performance of Summarization Models
We report the performance of the summarization
models in Table 21. For our method and MTL,
the reported scores are averages for different per-
centages of classification training data, where the
classification model is LSTM. From this table, nous
make two observations. D'abord, our method outper-
forms GPT2 (Radford et al., 2019), a competitive
model used for text summarization. This shows
that our method can generate meaningful sum-
maries. Deuxième, our method performs worse than
Sum-Sep. The reason is that in our method, le
summarization model is trained for the sake of
generating good augmentations for the classifi-
cation model while Sum-Sep is trained solely to
maximize the summarization performance. Notre
354
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
goal is not to generate best summaries, but rather
generating best augmentations for classification
using the summarization model.
5 Conclusions and Discussion
In this paper, we propose a three-level optimiza-
tion framework to perform text augmentation and
classification end-to-end. Our framework consists
of three learning stages performed end-to-end:
1) training a text summarization model; 2) train-
ing a text classification model; et 3) updating
weights of summarization examples by minimiz-
ing the validation loss of the classification model.
Each learning stage corresponds to one level of the
optimization problem in the framework. The three
levels of optimization problems are nested and
solved in a unified way. Our framework enables
the augmentation process to be influenced by the
performance of the text classification task so that
the augmented texts are specifically suitable for
training the classification model. Experiments on
various datasets demonstrate the effectiveness of
our method.
Our framework can be extended to other down-
stream tasks beyond classification, including but
are not limited to text-to-image generation, visuel
question answering, dialog generation, etc.. À
extend our framework to a downstream task T ,
we need to change the loss functions in Eq. (2)
and Eq. (3) to the loss of task T . Par exemple,
to apply our framework for text-to-image genera-
tion, given each text-image (t, je) pair, we perform
augmentation of the input text t using the sum-
marization model trained in Eq. (1), to get an
augmented text ˆt. (ˆt, je) would be treated as an
augmented data pair. Then we define GAN-based
pertes (Goodfellow et al., 2014) on each (ˆt, je)
and each (t, je) to train a text-to-image genera-
tion model. We plan to study such extensions in
future work.
Our current framework has the following limi-
tations. D'abord, it incurs additional computation and
memory costs due to the usage of a summarization
model. Deuxième, currently, our method uses a text
summarization model to generate augmentations.
Publicly available summarization datasets are lim-
ited in size, limiting the augmentation model’s
qualité. We plan to address these limitations in
future works. To reduce memory and computa-
tion costs, we will perform parameter sharing
where the encoder weights of the summariza-
tion model and those of the classification model
are tied together. We plan to leverage data-rich
text generation tasks for augmentation to address
the second limitation, such as using machine
translation models to perform back translation.
Remerciements
This work is partially supported by a gift fund
from CCF and Tencent. We also thank the editor
and the anonymous reviewers for their valuable
suggestions.
Les références
Jacob Andreas. 2020. Good-enough composi-
tional data augmentation. https://doi.org
/10.18653/v1/2020.acl-main.676
Atilim Gunes Baydin, Robert Cornish, David
Mart´ınez-Rubio, Mark Schmidt, and Frank D.
Wood. 2017. Online learning rate adaptation
with hypergradient descent. CoRR, abs/1703
.04782.
Jiaao Chen, Zichao Yang, and Diyi Yang. 2020.
Mixtext: Linguistically-informed interpolation
of hidden space for semi-supervised text clas-
sification. arXiv preprint arXiv:2004.12239.
https://doi.org/10.18653/v1/2020
.acl-main.194
Jacob Devlin, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2018. BERT: Pre-
training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:
1810.04805.
William B. Dolan and Chris Brockett. 2005.
Automatically constructing a corpus of sen-
le
tential paraphrases.
Third International Workshop on Paraphrasing
(IWP2005).
In Proceedings of
Marzieh Fadaee, Arianna Bisazza, and Christof
Monz. 2017. Data augmentation for
faible-
resource neural machine translation. arXiv pre-
print arXiv:1705.00440. https://doi.org
/10.18653/v1/P17-2090
Steven Y. Feng, Varun Gangal, Jason Wei, Sarath
Chandar, Soroush Vosoughi, Teruko Mitamura,
and Eduard Hovy. 2021. A survey of data
355
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
augmentation approaches for NLP. arXiv pre-
print arXiv:2105.03075. https://doi.org
/10.18653/v1/2021.findings-acl.84
Diederik P. Kingma and Jimmy Ba. 2014. Adam:
A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.
Matthias Feurer, Jost Springenberg, and Frank
Hutter. 2015. Initializing bayesian hyperpa-
rameter optimization via meta-learning.
Dans
Proceedings of the AAAI Conference on Arti-
ficial Intelligence, volume 29.
Sosuke Kobayashi. 2018. Contextual augmenta-
tion: Data augmentation by words with para-
digmatic relations. arXiv preprint arXiv:1805
.06201. https://doi.org/10.18653/v1
/N18-2072
Chelsea Finn, Pieter Abbeel, and Sergey Levine.
2017. Model-agnostic meta-learning for fast
adaptation of deep networks. In Proceedings
of the 34th International Conference on Ma-
chine Learning-Volume 70, pages 1126–1135.
JMLR.org.
Mahak Gambhir
and Vishal Gupta. 2017.
Recent automatic text summarization tech-
niques: A survey. Artificial
Intelligence
47(1):1–66. https://doi.org
Review,
/10.1007/s10462-016-9475-9
Ian Goodfellow, Jean Pouget-Abadie, Mehdi
Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio.
2014. Generative adversarial nets. Advances in
Neural Information Processing Systems, 27.
Demi Guo, Yoon Kim, and Alexander M. Rush.
2020. Sequence-level mixed sample data aug-
mentation. arXiv preprint arXiv:2011.09039.
https://doi.org/10.18653/v1/2020
.emnlp-main.447
Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. Neural Computa-
tion, 9(8):1735–1780. https://doi.org
/10.1162/neco.1997.9.8.1735
Eduard Hovy, Laurie Gerber, Ulf Hermjakob,
Chin-Yew Lin, and Deepak Ravichandran.
2001. Toward semantics-based answer pin-
Dans-
pointing.
ternational Conference on Human Language
Technology Research. https://doi.org
/10.3115/1072133.1072221
In Proceedings of
the First
Oleksandr Kolomiyets, Steven Bethard, et
Marie-Francine Moens. 2011. Model-portability
experiments for textual temporal analysis. Dans
Proceedings of the 49th Annual Meeting of
the Association for Computational Linguis-
tics: Human Language Technologies, volume 2,
pages 271–276. ACL; East Stroudsburg, Pennsylvanie.
Varun Kumar, Ashutosh Choudhary, and Eunah
Cho. 2021. Data augmentation using pre-trained
transformer models.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Ves Stoyanov, and Luke Zettlemoyer.
2019. BART: Denoising sequence-to-sequence
pre-training for natural language generation,
translation, and comprehension. arXiv preprint
arXiv:1910.13461. https://est ce que je.org/10
.18653/v1/2020.acl-main.703
Xin Li and Dan Roth. 2002. Learning ques-
tion classifiers. In COLING 2002: The 19th
International Conference on Computational
Linguistics. https://doi.org/10.3115
/1072228.1072378
Hanxiao Liu, Karen Simonyan, and Yiming Yang.
2018. Darts: Differentiable architecture search.
arXiv preprint arXiv:1806.09055.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly opti-
mized bert pretraining approach. arXiv preprint
arXiv:1907.11692.
Kushal Kafle, Mohammed Yousefhussien, et
Christopher Kanan. 2017. Data augmentation
for visual question answering. In Proceedings
of the 10th International Conference on Nat-
ural Language Generation, pages 198–202.
https://doi.org/10.18653/v1/W17
-3529
Andrew L. Maas, Raymond E. Daly, Pierre
T. Pham, Dan Huang, Andrew Y. Ng, et
Christopher Potts. 2011. Learning word vec-
tors for sentiment analysis. In Proceedings of
the 49th Annual Meeting of the Association
for Computational Linguistics: Human Lan-
guage Technologies, pages 142–150, Portland,
356
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Oregon, Etats-Unis. Association for Computational
Linguistics.
Julian McAuley and Jure Leskovec. 2013. Hidden
factors and hidden topics: Understanding rating
dimensions with review text. In Proceedings of
the 7th ACM Conference on Recommender Sys-
thèmes, pages 165–172. https://doi.org
/10.1145/2507157.2507163
Junghyun Min, R.. Thomas McCoy, Dipanjan
Le, Emily Pitler, and Tal Linzen. 2020. Syn-
tactic data augmentation increases robustness
to inference heuristics.
Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002un. BLEU: A method
for automatic evaluation of machine transla-
tion. In ACL. https://doi.org/10.3115
/1073083.1073135
Kishore Papineni, Salim Roukos, Todd Ward, et
Wei-Jing Zhu. 2002b. BLEU: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics,
pages 311–318. Association for Computational
Linguistics. https://doi.org/10.3115
/1073083.1073135
Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9.
Zhongzheng Ren, Raymond Yeh, and Alexander
Schwing. 2020. Not all unlabeled data are
equal: Learning to weight data in semi-
supervised learning. In Advances in Neural
Information Processing Systems, volume 33,
pages 21786–21797. Curran Associates, Inc.
G¨ozde G¨ul S¸ ahin and Mark Steedman. 2019.
Data augmentation via dependency tree morph-
ing for low-resource languages. arXiv preprint
arXiv:1903.09460. https://est ce que je.org/10
.18653/v1/D18-1545
Abigail See, Pierre J. Liu, and Christopher D.
to the point: Summa-
Manning. 2017. Get
rization with pointer-generator networks. Dans
Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1073–1083,
Association for Computational Linguistics,
Vancouver, Canada.
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2015. Improving neural machine trans-
lation models with monolingual data. arXiv
preprint arXiv:1511.06709. https://est ce que je
.org/10.18653/v1/P16-1009
Sam Shleifer and Alexander M. Rush. 2020.
Pre-trained summarization distillation. arXiv
preprint arXiv:2010.13002.
Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, San-
ping Zhou, Zongben Xu, and Deyu Meng.
2019. Meta-weight-net: Learning an explicit
mapping for sample weighting. In Advances
in Neural Information Processing Systems,
pages 1919–1930.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D. Manning, Andrew Ng,
and Christopher Potts. 2013. Recursive deep
models for semantic compositionality over a
sentiment treebank. In Proceedings of the 2013
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1631–1642,
Seattle, Washington, Etats-Unis. Association for
Computational Linguistics.
Felipe Petroski Such, Aditya Rawal, Joel Lehman,
Kenneth O. Stanley, and Jeff Clune. 2019. Gen-
erative teaching networks: Accelerating neural
architecture search by learning to generate
synthetic training data. CoRR, abs/1912.07768.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin.
2017. Attention is all you need. In Advances
in Neural Information Processing Systems,
pages 5998–6008.
William Yang Wang and Diyi Yang. 2015. That’s
so annoying!!!: A lexical and frame-semantic
embedding based data augmentation approach
to automatic categorization of annoying be-
haviors using# petpeeve tweets. In Proceed-
ings of
le 2015 Conference on Empirical
Methods in Natural Language Processing,
pages 2557–2563. https://est ce que je.org/10
.18653/v1/D15-1306
Xinyi Wang, Hieu Pham, Zihang Dai, and Graham
Neubig. 2018. Switchout: An efficient data
augmentation algorithm for neural machine
translation. arXiv preprint arXiv:1808.07512.
357
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
https://doi.org/10.18653/v1/D18
-1100
print arXiv:1901.11196. https://doi.org
/10.18653/v1/D19-1670
Yulin Wang, Jiayi Guo, Shiji Song, and Gao
Huang. 2020. Meta-semi: A meta-learning ap-
proach for semi-supervised learning. CoRR,
abs/2007.02394.
Jason Wei and Kai Zou. 2019. EDA: Easy data
augmentation techniques for boosting perfor-
mance on text classification tasks. arXiv pre-
Xiang Zhang, Junbo Zhao, and Yann LeCun.
2015. Character-level convolutional networks
for text classification. arXiv:1509.01626 [cs].
Guoqing Zheng, Ahmed Hassan Awadallah, et
Susan T. Dumais. 2019. Meta label correc-
tion for learning with weak supervision. CoRR,
abs/1911.03809.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
6
4
2
0
0
6
9
6
5
/
/
t
je
un
c
_
un
_
0
0
4
6
4
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3