Un marco de optimización multinivel de extremo a extremo

Un marco de optimización multinivel de extremo a extremo
Text Augmentation

Sai Ashish Somayajula
UC San Diego, EE.UU
ssomayaj@ucsd.edu

Linfeng Song
Tencent AI Lab, EE.UU
lfsong@tencent.com

Pengtao Xie∗
UC San Diego, EE.UU
p1xie@eng.ucsd.edu

Abstracto

Text augmentation is an effective technique
in alleviating overfitting in NLP tasks. In ex-
isting methods, text augmentation and down-
stream tasks are mostly performed separately.
Como resultado, the augmented texts may not be
optimal to train the downstream model. A
address this problem, we propose a three-level
optimization framework to perform text aug-
mentation and the downstream task end-to-
end. The augmentation model is trained in
a way tailored to the downstream task. Nuestro
framework consists of three learning stages.
A text summarization model is trained to per-
form data augmentation at the first stage. Cada
summarization example is associated with a
weight to account for its domain difference
with the text classification data. At the second
stage, we use the model trained at the first
stage to perform text augmentation and train
a text classification model on the augmented
textos. At the third stage, we evaluate the text
classification model trained at the second stage
and update weights of summarization exam-
ples by minimizing the validation loss. Estos
three stages are performed end-to-end. Nosotros
evaluate our method on several text classifica-
tion datasets where the results demonstrate the
effectiveness of our method. Code is available
at https://github.com/Sai-Ashish
/End-to-End-Text-Augmentation.

1

Introducción

Data augmentation (Sennrich et al., 2015; Fadaee
et al., 2017; Wei and Zou, 2019) is an effec-
tive technique for mitigating the deficiency of
training data and preventing overfitting. In natu-
ral language processing, many data augmentation
methods have been proposed, such as back transla-
ción (Sennrich et al., 2015), synonym replacement
(Wang and Yang, 2015), random insertion (Wei
and Zou, 2019), etcétera. In existing approaches,

∗Corresponding author

343

data augmentation and downstream tasks are per-
formed separately: Augmented texts are created
primero, then they are used to train a downstream
modelo. The downstream task does not influence
the augmentation process. Como resultado, the aug-
mented texts may not be optimal for training the
downstream model.

en este documento, we aim to address this problem.
We propose an end-to-end learning framework
based on multi-level optimization (Feurer et al.,
2015), which performs data augmentation and
downstream tasks in a unified manner where not
only augmented texts influence the training of the
downstream model, but also the performance of
downstream task affects how data augmentation
is performed.

In our framework, we use a text summarization
(Gambhir and Gupta, 2017) model to perform data
augmentation. Given an original text t with class
label c, we feed t into the summarization model
to generate a summary s. We set the class label
of s to be c. (s, C) is treated as an augmented
text-label pair of (t, C). The motivation of using
a summarization model for text augmentation is
two-fold. Primero, the major semantics of an original
text is preserved in its summary; por lo tanto, es
sensible to assign the class label of the original
text to its summary. Segundo, the summary ex-
cludes non-essential details in the original text;
como resultado, the semantic diversity between the
summary and the original text is rich, which well
serves the purpose of creating diverse augmenta-
ciones. The summarization model is trained on a
summarization dataset {(de, si)}METRO
i=1 where ti is an
original text and si is the corresponding summary.
For the downstream task, we assume it is text
clasificación. We assume there is a text classifi-
cation training set {(X(tr)
i=1 where x(tr)
, y(tr)
i
i
is an input text and y(tr)
is the corresponding class
label, and there is a text classification validation
colocar {(X(vale)
.

)}norte (tr)

)}norte (vale)
yo=1

, y(vale)
i

i

i

i

Transacciones de la Asociación de Lingüística Computacional, volumen. 10, páginas. 343–358, 2022. https://doi.org/10.1162/tacl a 00464
Editor de acciones: Dani Yogatama. Lote de envío: 09/2021; Lote de revisión: 12/2021; Publicado 4/2022.
C(cid:3) 2022 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Our framework consists of three learning stages
that are performed end-to-end. We train a text sum-
marization model G on the summarization dataset
at the first stage. Considering a domain difference
between the summarization data and text classi-
fication data, we associate each summarization
training pair with a weight a ∈ [0, 1]. Un más pequeño
a indicates a large domain difference between
this summarization pair and the text classifica-
tion data, and this pair should be down-weighted
during the training of the summarization model.
These weights are tentatively fixed at this stage
and will be updated later. At the second stage, nosotros
use the trained summarization model to perform
text augmentation for the classification dataset
and train a text classification model on the aug-
mented and original datasets. At the third stage, nosotros
validate the classification model trained at the sec-
ond stage and update weights of summarization
training examples by minimizing the validation
loss. The three stages are performed end-to-end
where they mutually influence each other. Nosotros
evaluate our framework on several text classifi-
cation datasets. Various experiments demonstrate
the effectiveness of our method.

The major contributions of this work include:

• We propose a three-level optimization frame-
work to perform text augmentation in an
end-to-end manner. Our framework con-
sists of three learning stages that mutually
influence each other: 1) training text summa-
rization model; 2) training text classification
modelo; 3) updating weights of summarization
data by minimizing the validation loss of the
classification model.

• Experiments on various datasets demonstrate

the effectiveness of our framework.

The rest of the paper is organized as follows.
Sección 2 reviews related work. Sección 3 introducción-
duces the method. Sección 4 gives experimental
resultados. Sección 5 concludes the paper.

2 Trabajo relacionado

2.1 Data Augmentation in NLP

As an effective way of mitigating the deficiency
of training data, data augmentation has been
broadly studied in NLP (Feng et al., 2021).
Sennrich et al. (2015) proposed a back translation

method for data augmentation, which improves
the BLEU (Papineni et al., 2002a) scores in ma-
chine translation (MONTE). The back translation tech-
nique first converts the sentences to another
idioma. It again translates it back to the original
language to augment the original text.

Fadaee et al. (2017) propose a data augmen-
tation method for low-frequency words. Specif-
icamente, the method generates new sentence pairs
that contain rare words. Kafle et al. (2017) introducción-
duce two data augmentation methods for visual
question answering. The first method uses se-
the questions.
mantic annotations to augment
The second technique generates new questions
from images using an LSTM network (Hochreiter
and Schmidhuber, 1997). Wang and Yang (2015)
propose an augmentation technique that replaces
query words with their synonyms. Synonyms are
retrieved based on cosine similarities calculated
on word embeddings. Kolomiyets et al. (2011)
propose to augment data by replacing the tem-
poral expression words with their corresponding
synonyms. They use the vocabulary from the La-
tent Words Language Model (LWLM) y el
WordNet.

S¸ ahin and Steedman (2019) propose two text
augmentation techniques based on dependency
árboles. The first technique crops the sentences by
discarding dependency links. The second tech-
nique rotates sentences by tree fragments that are
pivoted at the root. Chen et al. (2020) propose
augmenting texts by interpolating input texts in
a hidden space. Wang et al. (2018) propose aug-
menting sentences by randomly replacing words
in input and target sentences with words from the
vocabulary. SeqMix (Guo et al., 2020) proposes
to create augments by softly merging input/target
sequences.

EDA (Wei and Zou, 2019) uses four operations
to produce data augmentation: synonym replace-
mento, random insertion, random swap, and random
deletion. Kobayashi (2018) proposes to replace
words stochastically with words predicted by a
bi-directional language model. Andreas (2020)
proposes a compositional data augmentation ap-
proach that constructs a synthetic training example
by replacing text fragments in a real example with
other fragments appearing in similar contexts.
Kumar et al. (2021) apply pretrained Transformer
models including GPT-2, BERT, and BART for
conditional data augmentation, where the conca-
tenation of class labels and input texts are fed into

344

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

these pretrained models to generate augmented
textos. Kumar et al. (2021) propose a language
model-based data augmentation method. This ap-
proach first finetunes a language model on limited
training data, then feeds class labels into the fine-
tuned model to generate augmented sentences.
Min et al. (2020) explore several syntactically
informative augmentation methods by applying
syntactic transformations to original sentences
and showed that subject/object inversion could
increase robustness to inference heuristics.

2.2 Bi-level Optimization

Many NLP applications (Feurer et al., 2015;
Baydin et al., 2017; Finn et al., 2017; Liu et al.,
2018; Shu et al., 2019; Zheng et al., 2019) son
based on bi-level optimization (BLO), como
neural architecture search (Liu et al., 2018), datos
selección (Shu et al., 2019; Ren et al., 2020; Wang
et al., 2020), meta learning (Finn et al., 2017),
hyperparameter tuning (Feurer et al., 2015), la-
bel correction (Zheng et al., 2019), training data
generación (Such et al., 2019), learning rate adap-
tation (Baydin et al., 2017), Etcétera. In these
BLO-based applications, model parameters are
learned by minimizing a training loss in an in-
ner optimization problem while meta parameters
are learned by minimizing a validation loss in an
outer optimization problem. In these applications,
meta parameters are neural architectures, weights
of data examples, hyperparameters, etcétera.

3 Método

This section proposes a three-level optimiza-
tion framework to perform end-to-end text
augmentation.

Cifra 1: Overview of our framework.

i

, y(tr)
i

data augmentation. Given an original training pair
(X(tr)
), we feed the input text x(tr)
into the
i
text summarization model and get a summary
si. Because si preserves the major semantics
de x(tr)
, we can assign the class label y(tr)
de
i
X(tr)
to si. In the end, we obtain an augmented
i
training pair (si, y(tr)
). This process can be ap-
plied to every original training example and create
corresponding augmented training examples.

i

i

To enable the text summarization model and
the text classifier to influence and benefit from
each other mutually, we develop a three-level
optimization framework to train these two mod-
els end-to-end. Our framework consists of three
learning stages performed in a unified manner. En
the first stage, we train the text summarization
modelo. At the second stage, we use the summa-
rization model trained at the first stage to perform
text augmentation and train the classifier on the
augmented examples. At the third stage, we eval-
uate the classifier on a validation set and update
weights of summarization training examples by
minimizing the validation loss. Cifra 1 shows an
overview of our framework. Próximo, we describe the
three stages in detail.

3.1 Overview

3.2 Stage I

i

i

c = {(X(tr)

, y(tr)
i
i
text and y(tr)

We assume the target task is text classification. Nosotros
train a BERT-based (Devlin et al., 2018) text clas-
sifier on a training set D(tr)
)}norte (tr)
yo=1
where x(tr)
es el
is an input
corresponding class label. Mientras tanto, tenemos
access to a classification validation set D(vale)
=
{(X(vale)
. In many application sce-
i
narios, the training data is limited, which incurs
a high risk of overfitting. To address this prob-
lem, we perform data augmentation of the training
data to enlarge the number of training examples.
We use a text summarization model to perform

)}norte (vale)
yo=1

, y(vale)
i

C

At the first stage, we train the text summariza-
tion model. We use BART (Lewis et al., 2019)
to perform summarization. BART is a pretrained
Transformador (Vaswani et al., 2017) model con-
sisting of an encoder and a decoder. The encoder
takes a text as input, and the decoder generates a
summary of the text. Let S denote the summariza-
tion model. The training data is Ds = {(de, si)}METRO
yo=1
where ti is an input text and si is the correspond-
ing summary. A menudo, the summarization dataset Ds
has a domain shift with the classification dataset
Dc. Por ejemplo, in Ds, if its domain difference

345

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

with the classification dataset is large, the summa-
rization model trained by this example may not be
suitable to perform data augmentation for Dc. A
address this problem, we associate each example
in Ds with a weight a ∈ [0, 1]. If a is close to 0,
the domain difference between
it means that
this example and Ds is large, and this example
should be down-weighted during training the
summarization model.

At this stage, we solve the following optimiza-

tion problem:

S∗(A) = min
S

METRO(cid:2)

yo=1

ail(S, de, si)

(1)

where A = {ai}METRO
i=1 and l(·) is the teacher-forcing
loss. The loss of (de, si) is weighted by the weight
ai of this example. Si (de, si) has large domain
difference with Ds, ai should be close to 0, entonces
ail(S, de, si) is made close to 0, which effectively
excludes (de, si) from the training process. El
optimally trained model S∗ depends on A since
S∗ depends on the loss function, and the loss
function depends on A. A is tentatively fixed at
this stage and will be updated at a later stage.
A cannot be updated at this stage. De lo contrario, a
trivial solution will be yielded where all values in
A are 0.

3.3 Stage II

i

i

, y(tr)
i

in D(tr)
C

At the second stage, we use the summarization
model S∗(A) trained at the first stage to perform
c = {(X(tr)
data augmentation of D(tr)
)}norte (tr)
yo=1 .
For each x(tr)
, we feed it into S∗(A)
to generate a summary g(X(tr)
, S∗(A)). En el
end, we obtain an augmented dataset G(D(tr)
,
S∗(A)) = {(gramo(X(tr)
)}norte (tr)
yo=1 . Nosotros
classifier C on
train a BERT-based text
the original data D(tr)
and augmented data
GRAMO(D(tr)
, S∗(A)).
C

, S∗(A)), y(tr)

C

C

i

i

i

At this stage, we solve the following optimiza-

tion problem:

C ∗(S∗(A)) =
minC L(C, D(tr)

C

) + γL(C, GRAMO(D(tr)

C

, S∗(A)))

(2)
where L(·) denotes a cross-entropy classification
loss and γ is a tradeoff parameter. The first loss
term is defined on the original training dataset, y
the second loss term is defined on the augmented
training dataset. The optimally trained classifier

346

C ∗ depends on S∗(A) since C ∗ depends on the
training loss, which depends on S∗(A).

3.4 Stage III

At the third stage, we evaluate the classifier trained
at the second stage on the classification validation
set D(vale)
, y(vale)
and update the
C
i
weights A by minimizing the validation loss. En
this stage, we solve the following optimization
problema:

= {(X(vale)
i

)}norte (vale)
yo=1

mín.
A

l(C ∗(S∗(A)), D(vale)

C

)

(3)

3.5 A Three-Level Optimization Framework

Putting all pieces together, we have the following
three-level optimization framework.

minA L(C ∗(S∗(A)), D(vale)
s.t.

C ∗(S∗(A)) = minC L(C, D(tr)

)

C

C

) +

, S∗(A)))

γL(C, GRAMO(D(tr)
C
METRO(cid:2)

S∗(A) = minS

ail(S, de, si)

(4)

yo=1

There are three optimization problems in this
estructura, each corresponding to a learning
stage. From bottom to top the optimization prob-
lems correspond to learning stages I, II, and III,
respectivamente. The first two optimization problems
are nested on the constraint of the third optimiza-
tion problem. These three stages are conducted
end-to-end in this unified framework. The solution
S∗(A) obtained in the first stage is used to per-
form text augmentation in the second stage. El
classification model trained in the second stage
is used to make predictions at the third stage.
The importance weights A updated in the third
stage change the training loss in the first stage and
consequently changes the solution S∗(A), cual
subsequently changes C ∗(S∗(A)).

3.6 Optimization Algorithm

En esta sección, we develop a gradient-based opti-
mization algorithm to solve the problem defined
in Eq. (4). Drawing inspiration from Liu et al.
(2018), we approximate S∗(A) using one step
gradient descent update of S:

S∗(A) ≈ S(cid:6) = S − ηs∇S

METRO(cid:2)

yo=1

ail(S, de, si)

(5)

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

We plug S∗(A) ≈ S(cid:6) into the objective func-
tion at the second stage and get an approximate
objetivo. We approximate C ∗(S∗(A)) usando
one-step gradient descent update of C:

Algoritmo 1 Optimization algorithm

while not converged do

Update weight parameters S using Eq. (5)
Update weight parameters C using Eq. (6)
Update meta parameters A using Eq. (7)

C ∗(A) ≈ C (cid:6) = C − ηc∇C(l(C, D(tr)
C

+ γL(C, GRAMO(D(tr)

, S(cid:6))))

C

)

(6)

end while

4 experimentos

)

Finalmente, we plug C ∗(A) ≈ C (cid:6) into the validation
loss and get an approximated objective. Then we
update A by gradient descent:

A ← A − ηa∇AL(C (cid:6), D(vale)

C

)

(7)

dónde

∂C (cid:6)
∂S(cid:6)

∂L(C (cid:6),D(vale)
C
∂C (cid:6)

)

)) = ∂S(cid:6)
∇AL(C (cid:6), D(vale)
∂A
METRO(cid:3)
= ηsηcγ∇2

C

A,S

∇2

S(cid:6),CL(C, GRAMO(D(tr)

C

yo=1

ail(S, de, si)
, S(cid:6)))∇C (cid:6)l(C (cid:6), D(vale)

C

(8)
ecuación. (8) involves an expensive matrix-vector prod-
uct, whose computational complexity can be
reduced by a finite difference approximation.
ecuación. (8) can be approximated as:

≈ ηsηcγ
2a

{[∇S(cid:6)l(C +, GRAMO(D(tr)
C
, S(cid:6)))]∇2

∇S(cid:6)l(C −, GRAMO(D(tr)

C

A,S

, S(cid:6))) −
METRO(cid:3)

ail(S, de, si)}

yo=1

(9)

dónde

un =

0.01
(cid:4)
(cid:4)
(cid:4)∇C (cid:6)l(C (cid:6), D(vale)

C

(cid:4)
(cid:4)
(cid:4)

)

,

2

C ± = C ± α∇C (cid:6)l(C (cid:6), D(vale)

C

)

The matrix-vector multiplication in Eq. (9) can be
further approximated by:

ail(S+

± ,de, si)−∇A

METRO(cid:2)

yo=1

ail(S−

± ,de, si)}

(10)

1
α±
S

{∇A

METRO(cid:2)

yo=1

dónde

α±
S =

0.01

(cid:4)
(cid:4)
(cid:4)∇S(cid:6)l(C ±, GRAMO(D(tr)

C

, S(cid:6)))

,

(cid:4)
(cid:4)
(cid:4)

± = S ± α+
S+
S
± = S ± α−
S−
S

∇T (cid:6)l(C +, GRAMO(D(tr)
∇T (cid:6)l(C −, GRAMO(D(tr)

C

C

Train Validation Test
Dataset
25k
6.7k
6.7k
IMDB
10k
10k
38k
Yelp Review
10k
10k
SST-2
10k
Amazon Review 10k

872

10k

Mesa 1: Split statistics of the classification
conjuntos de datos.

These update steps iterate until convergence. El
overall algorithm is summarized in Algorithm 1.

En esta sección, we report experimental results.
The key takeaways include: 1) our proposed
end-to-end augmentation method performs bet-
ter than baselines that conduct augmentation and
classification separately; 2) our framework is ag-
nostic to the choices of classification models and
can be applied to improve a variety of classifiers;
3) our framework is particularly effective when
the number of training examples is small; 4) nuestro
framework is more effective when the length of
input texts is large.

4.1 Dataset

For the text summarization data, we use the
CNN-DailyMail dataset (See et al., 2017). It con-
tains summaries of around 300k news articles.
The full CNN-DailyMail training set is used for
training the summarization model. We used four
text classification datasets: 1) IMDB containing
movie reviews (Maas et al., 2011), 2) Yelp Re-
vista: a binary sentiment classification dataset
(Zhang et al., 2015), 3) SST-2: Stanford Sen-
timent Treebank (Socher et al., 2013), y 4)
Amazon Review: a product review dataset from
Amazonas (McAuley and Leskovec, 2013). El
split statistics of these datasets are summarized
en mesa 1.

4.2 Base

We compare our method with the following
baseline methods.

2
, S(cid:6)))
, S(cid:6)))

347

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

• No-Aug: No data augmentation is applied.

• EDA (Wei and Zou, 2019): Augmented sen-
tences are created by randomly applying the
following operations: synonym replacement,
random insertion, random swap, and random
deletion.

• GECA (Andreas, 2020): A compositional
data augmentation approach that constructs
a synthetic training example by replacing
text fragments in a real example with other
fragments appearing in similar contexts.

• LAMBADA (Kumar et al., 2021): A
language model based data augmentation
método. This approach first finetunes a lan-
entonces
guage model on real
feeds class labels into the finetuned model
to generate augmented sentences.

training data,

• Sum-Sep: Summarization-based augmenta-
tion and text classification are performed
separately. We first train a summarization
modelo, use it to perform data augmentation,
then train the classification model on the
augmented training data.

• Multi-task learning (MTL): In this baseline,
the summarization model and classification
model are trained by minimizing a single
objective function, which is the weighted
sum of the summarization and classification
losses. The corresponding formulation is:

minA L(C ∗(A), D(vale)
s.t. C ∗(A) = minC,S L(C, D(tr)
C
METRO(cid:2)

)

C

)+

γL(C, GRAMO(D(tr)

C

, S)) + λ minS

ail(S, de, si)

yo=1

4.3 Hyperparameter Settings

For the text classification model, we use the one
in EDA (Wei and Zou, 2019). It contains an in-
put layer, a bi-directional hidden layer with 64
LSTM (Hochreiter and Schmidhuber, 1997) units,
a dropout layer with a probability of 0.5, otro
bi-directional layer with 32 LSTM units, otro
dropout layer with a probability of 0.5, ReLU ac-
tivación, a dense layer with 20 hidden units, y
a softmax output layer. We set the maximum text
length to 150. The loss function is cross-entropy
loss. Model parameters are optimized using the

348

Adán (Kingma and Ba, 2014) optimizer, con un
epsilon of 10−8. In Adam, β1 and β2 are set to
0.9 y 0.999, respectivamente. The learning rate is
a constant 10−3. The batch size used is 8. Para
the importance weights A of summarization data
examples, we optimize them using an Adam opti-
mizer, with a weight decay of 10−3. Epsilon, β1,
and β2 are set to 10−8, 0.5, y 0.999, respetar-
activamente. The learning rate is 3 × 10−4. The tradeoff
parameter γ is set to 1 for all experiments, unless
otherwise stated.

For the summarization model, we use the dis-
tilled BART model (Shleifer and Rush, 2020).
It has six encoder layers and six decoder layers.
We set the maximum text length for the article
a 1024 and the summary to 75. We use an SGD
optimizer with a momentum of 0.9 and a learning
rate of 10−3. We use a cosine annealing learning
rate scheduler with the minimum learning rate
set to 5 × 10−4. We randomly sample a subset of
CNN/DailyMail to train the summarization model.
We balance the CNN/DailyMail data and the
classification data by making the number of sam-
pled CNN/DailyMail examples roughly the same
as the classification data examples.

Accuracy is used as the evaluation metric. Cada
experiment runs five times with random initializa-
ciones. We report the mean and standard deviation
of the results. The experiments are conducted on
1080Ti GPUs.

4.4 Main Results

Following Wei and Zou (2019), for each classifi-
cation dataset, we randomly sample x percentage
of training data for model training. Tables 2, 3, 4,
y 5 show the average accuracies on the Yelp
Revisar, IMDB, Amazon Review, and SST-2
conjuntos de datos, under different percentages.

From these tables, we make the following ob-
servaciones. Primero, our method works better than
Sum-Sep. The reason is that our method performs
the summarization-based augmentation and text
classification in an end-to-end framework while
Sum-Sep performs these two tasks separately.
In our end-to-end framework, the summarization
and classification models mutually influence each
otro. The performance of the classification model
guides the training of the summarization model.
If the classification accuracy is not high, it indi-
cates that the augmented data generated by the
summarization model is not useful. When this

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

No-Aug
Perc.
63.45±4.21
5%
68.15±3.77
10%
72.43±3.16
20%
81.78±1.76
50%
84.52±1.55
75%
100% 85.42±1.52

EDA
68.45±2.49
74.46±1.82
75.07±4.22
81.57±2.44
82.70±2.30
84.66±1.36

GECA
69.72±1.72
74.91±2.07
78.29±0.95
83.61±0.93
85.02±0.94
86.45±0.26

LAMBADA
69.30±1.38
75.03±1.45
76.25±2.57
82.18±1.82
83.71±0.71
85.08±0.49

Sum-Sep
72.53±0.44
76.73±0.57
81.29±0.09
84.90±0.19
86.18±0.25
87.46±0.10

MTL
72.94±0.52
76.35±0.51
80.47±0.25
84.29±0.22
86.07±0.28
87.15±0.13

Ours
74.58±0.37
78.62±0.73
81.66±0.13
85.55±0.15
87.10±0.31
87.79±0.07

Mesa 2: Classification accuracy (%) on Yelp Review. Perc. denotes percentage.

No-Aug
Perc.
52.86±3.22
5%
56.76±3.38
10%
59.54±1.68
20%
66.90±1.98
50%
72.25±1.05
75%
100% 73.96±0.85

EDA
61.42±1.75
64.06±1.92
67.18±3.20
71.67±1.33
74.23±0.72
75.75±0.27

GECA
61.58±0.94
63.27±2.76
65.36±0.83
72.89±0.58
73.69±0.55
75.38±0.44

LAMBADA
61.03±1.46
64.95±1.83
67.61±1.94
71.52±1.57
74.02±0.73
76.45±0.03

Sum-Sep
63.74±0.27
68.42±0.11
69.82±0.61
72.86±0.31
74.78±0.25
76.70±0.08

MTL
63.27±0.41
67.92±0.14
68.47±0.69
72.40±0.19
74.63±0.41
76.11±0.09

Ours
64.79±0.32
68.75±0.08
72.04±0.52
73.98±0.25
75.78±0.37
77.00±0.05

Mesa 3: Classification accuracy (%) on IMDB.

No-Aug
Perc.
63.46±1.51
5%
66.03±1.44
10%
68.04±2.26
20%
76.03±0.75
50%
77.40±1.09
75%
100% 78.36±1.02

EDA
62.59±2.35
66.20±2.91
68.36±1.05
75.39±1.69
76.30±1.82
78.13±1.38

GECA
64.82±0.93
68.49±0.82
69.04±1.94
77.02±0.50
78.52±0.52
80.16±0.32

LAMBADA
63.21±1.07
66.15±1.35
70.96±2.45
76.31±0.37
77.02±0.81
78.94±0.69

Sum-Sep
65.44±0.31
69.76±0.21
72.64±0.58
77.07±0.88
78.78±0.52
81.43±0.12

MTL
64.81±0.69
69.33±0.26
72.51±0.99
77.05±0.19
78.93±0.37
81.22±0.33

Ours
67.53±0.42
70.84±0.11
75.28±0.73
78.80±0.62
80.46±0.62
82.01±0.25

Mesa 4: Classification accuracy (%) on Amazon Review.

No-Aug
Perc.
57.22±2.07
5%
62.04±1.44
10%
64.45±1.99
20%
71.90±1.51
50%
73.74±0.59
75%
100% 77.75±1.49

EDA
63.42±1.24
66.86±0.73
68.00±0.49
74.77±0.39
75.80±0.81
77.64±0.30

GECA
58.44±0.57
64.92±0.49
67.48±0.55
72.40±0.55
75.03±0.33
77.10±0.42

LAMBADA
59.31±0.99
65.79±0.75
66.12±0.87
73.35±0.64
74.29±0.28
75.41±0.85

Sum-Sep
59.05±0.91
65.15±0.37
67.09±1.27
72.25±0.27
75.34±0.07
75.80±0.21

MTL
60.51±0.32
64.86±0.66
67.84±0.49
74.02±0.44
74.25±0.10
75.39±0.43

Ours
63.76±0.85
68.23±0.51
69.61±0.21
75.23±0.26
76.00±0.14
77.92±0.05

Mesa 5: Classification accuracy (%) on SST-2, γ is set to 0.5 para 5% y 20%.

ocurre, the summarization model will adjust its
network weights and data weights to generate use-
ful augmentations in the next round of learning. En
contrast, in Sum-Sep, such a feedback loop (de
classification to summarization) does not exist.
Por lo tanto, its performance is inferior.

Segundo, our method works better than MTL. En
MTL, the summarization model and classification
model are trained simultaneously by minimizing
a single objective function, which is the weighted
sum of the summarization loss and classification
loss. This incurs a competition between these two
tareas: seeking for more decrease of the loss of
one task leads to less decrease of the loss of the
other task. Our method avoids task competition

by performing these two tasks sequentially and
minimizing two different objective functions. El
summarization model is trained by minimizing the
summarization loss. Then the classification model
is trained by minimizing the classification loss. En
this way, there is no competition between these
two tasks. Though performed in two stages, estos
two tasks can still mutually influence each other
in our end-to-end framework.

Tercero, our method performs better than No-Aug.
This shows that the augmented data generated by
our method has good utility for model training.

Cuatro, the improvement of our method over
baselines is more prominent when the number
of training examples is smaller (es decir., cuando el

349

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ours
Perc.
48.36±3.72
5%
64.24±2.14
20%
100% 72.36±1.29

EDA
40.16±1.94
51.48±2.44
57.96±1.21

No-Aug
44.40±5.26
58.00±4.10
69.52±2.43

Mesa 6: Classification accuracy (%) on TREC.

percentage is small). Under the 5% porcentaje,
we observe an improvement of 11.13%, 11.9%,
4.07%, y 6.59% on Yelp, IMDB, Amazonas,
and SST-2 datasets, respectivamente, over No-Aug.
When the training dataset size is smaller, el
necessity of data augmentation is more significant.
As the percentage increases, the impact of data
augmentation decreases. Para el 100% porcentaje,
we observe an improvement of 2.37%, 3.04%, y
3.65% on Yelp, IMDB, and Amazon, respectivamente,
over No-Aug.

Quinto, our method works better than EDA,
GECA, and LAMBADA. De nuevo, the reason is
that our method performs data augmentation and
classification end-to-end while these baselines
perform them separately. Compared to EDA, y-
der the 5% porcentaje, we observe an accuracy
gain of 6.13%, 3.37%, y 4.94% on Yelp, IMDB,
and Amazon, respectivamente. EDA augments each
input sentence to produce 8-16 new sentences.
Our method achieves a better accuracy (on Yelp,
IMDB, and Amazon) or a similar accuracy (en
SST-2) than EDA with just one augmentation per
input sentence.

EDA uses simple and heuristic rules to generate
augmented sentences which may be noisy and lack
semantic meaningfulness. These noisy augmenta-
tions may render the classification model trained
on them to perform worse. To verify this, we per-
formed experiments on the TREC (Li and Roth,
2002; Hovy et al., 2001) conjunto de datos (which is split
into a train/validation/test set with 3000, 2000, y
500 examples respectively). Mesa 6 shows the
resultados. As can be seen, with EDA as augmenta-
ción, the classification performance becomes much
worse (compared with No-Aug). A diferencia de, nuestro
method trains a summarization model to gener-
ate semantically meaningful augmentations and
perform much better than EDA.

Sixth, our method is more effective on long
textos. Por ejemplo, our framework outperforms
EDA under all percentages on datasets where
the input texts are relatively long, including Yelp,
IMDB, and Amazon (the average number of words

Original sentence: This review is for the Kindle edition of what is
supposedly the Christopher Gill translation of Plato’s Symposium.
Sin embargo, it turns out if you download it that it is the same as the
Benjamin Jowett translation which is available for free, mientras
here you have to pay upwards of $8 for it. I also checked Penguin Classics web site and they do not indicate that a eBook version of this book is available. So be careful when purchasing the Kindle edition. I had to return my purchase for this Kindle book. Augmentation: Critically reviewed the Christopher Gill trans- lation of Plato’s Symposium. An online version of the book is currently available for free. Sin embargo, if you download it it is the same as the Benjamin Jowett translation. A diferencia de, you pay upwards of $8.

Original sentence: My issue is not with anything Tom Robbins
wrote, but with narrator Barret Whitener: a white male who
opts to read all the dialogue from Japanese characters in a bad,
generic ‘‘Asian’’ accent. It’s unbearable to listen to. Además,
Whitener can’t even pronounce the simplest of Japanese words,
like ‘‘sake.’’ How many more years must transpire before an
Asian person can have a voice without being mocked?
Augmentation: The book is written by a white male who reads it
in a bad, generic ‘‘Asian’’ accent. He can’t even pronounce the
simplest of Japanese words, like ‘‘sake’’ Movie is currently being
played by Playmakers’ weekly series .series

Mesa 7: Two augments generated by our method
for Amazon Review.

per input text in Yelp, IMDB, and Amazon are
133, 234, y 75, respectivamente). On SST-2, dónde
the texts are relatively short, our method per-
forms marginally better than EDA under certain
percentages. In CNN-DailyMail, which is used
for training the summarization model, the input
texts are also long. Como resultado, the summarization
model trained on CNN-DailyMail can generate
good-quality summaries for Yelp, IMDB, y
Amazonas. The generated summaries preserve the
major information in the input texts, discard redun-
dant information, and align well with the original
sentences semantically. Mesa 7 shows some ex-
amples. En este caso, it is meaningful for the
summary to inherit the class label of the orig-
inal sentence, and the resulting (augmentation,
label) pairs are useful for training the classifi-
cation model. A diferencia de, in SST-2, the average
number of words per input text in training data
es solo 10. When a short sentence is fed into the
summarization model, the model tends to perform
open text generation by treating the input short
sentence as a prompt. The generated text may
deviate from the original sentence in semantics.
Mesa 8 shows some examples. En este caso, assign-
ing the class label of the original sentence to the
summary introduces labeling noises. Por lo tanto,
the improvement of our method over baselines
may not be substantial. We performed statistical
significance tests using double-sided t-tests. En

350

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Original sentence: An object lesson in period filmmaking.
Augmentation: An object lesson in period filmmaking is a
time-saving technique. The film was shot by the crew of the late
1960s and early 1980s. It’s been produced since then, with some
scenes being filmed on location . Unos años despues, it has become
an important part of this type of cinema.

Original sentence: Mean-spirited lashing.
Augmentation: Critically ill-tempered lashing. ‘‘I feel like I’m
going to get out of my own way,’’ says one person. An angry
person lashed out at him for being unprovoked and hateful.
demonstrated anger over the use of alcohol.

Ours
Dataset
68.75±0.08
IMDB
78.62±0.73
Yelp Review
Amazon Review 70.84±0.11
68.23±0.51
SST-2

Aug-Only
66.69±0.37
77.94±0.91
70.14±0.44
66.17±0.42

Mesa 9: Comparison with Aug-Only, en 10% de
the classification datasets.

Mesa 8: Two augmentation generated by our
method for SST-2.

most cases, the p-values of our method against
baselines are less than 1e − 3. This shows that
our method is significantly better than baselines
en general.

4.5 Ablation Studies

To evaluate the effectiveness of our proposed
método, we compare with the following ablation
settings.

• Aug-Only. We train the text classification
model on augmented texts only, without us-
ing the original classification data, cual
amounts to solving the following problem:

minA L(C ∗(S∗(A)), D(vale)
s.t. C ∗(S∗(A)) = minCL(C,GRAMO(D(tr)

)

C

C

, S∗(A)))

S∗(A) = minS

METRO(cid:2)

ail(S, de, si)

yo=1

(11)
• Ablation on classification models. En esto
estudiar, we investigate whether our framework
is effective for other classifiers, incluido
CNN and RoBERTa (Liu et al., 2019). Fol-
lowing Wei and Zou (2019), the CNN model
capa, 1D convolutional
consists of input
layer with 128 filters (kernel size is 5),
global 1D max-pooling layer, ReLU acti-
vation function, dense layer (con 20 hidden
units), softmax output layer. For RoBERTa,
we use a batch size of 16 and a learning rate
de 2 × 10−6 to train the model. The maxi-
mum classification text length is set to 128.
γ is set to 1 for all datasets. The rest of the
hyperparameters remain the same.

• Ablation on γ. en este estudio, we investi-
gate how the test accuracy is affected by the
hyperparameter γ.

351

Dataset
IMDB
Yelp
Amazonas
SST-2

Ours
69.84±0.15
78.95±0.69
73.46±0.22
67.32±0.18

EDA
69.33±0.08
77.25±0.56
72.86±0.48
66.74±0.05

No-Aug
65.57±0.52
76.85±0.90
70.62±0.79
62.38±0.44

Mesa 10: Comparison of our method with EDA
and No-Aug, with CNN as the text classifier,
trained on 10% of the classification datasets.

Dataset
IMDB
Yelp
Amazonas
SST-2

Ours
78.38±0.19
88.25±0.28
83.85±0.34
76.34±0.34

EDA
77.26±3.31
85.30±0.44
81.36±0.23
74.38±2.18

No-Aug
77.33±0.62
87.13±0.14
82.54±0.24
75.76±1.09

Mesa 11: Comparison of our method with EDA
and No-Aug, with CNN as the text classifier,
trained on 100% of the classification datasets.

Mesa 9 compares our method with Aug-Only.
As can be seen, our full method outperforms
Aug-Only. In Aug-Only, only augmented texts
train the classification model without leveraging
the original human-written texts. Because the aug-
mented data is noisier than human-provided data,
using augmented data only may lead to noisy
classification models.

Tables 10 y 11 compare our method with
EDA and No-Aug, with CNN as the text classi-
fier, trained on 10% y 100% of the classification
conjuntos de datos, respectivamente. Tables 12 y 13 com-
pare our method with EDA and No-Aug, con
RoBERTa as the text classifier, trained on 10%
y 100% of the classification datasets, respetar-
activamente. As can be seen, with CNN and RoBERTa
as classifiers, our method still outperforms EDA
and No-Aug. This shows that our method is ag-
nostic to text classifiers and can be leveraged to
improve different text classifiers.

Cifra 2 shows how test accuracy changes with
γ, where the model is trained on 10% of IMDB.
As can be seen, when γ increases from 0 a 0.5,

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Dataset
IMDB
Yelp
Amazonas
SST-2

Ours
84.40±0.37
91.12±0.29
90.56±0.37
87.73±0.86

EDA
84.35±0.35
90.38±0.39
89.90±0.41
87.10±0.83

No-Aug
83.89±0.64
90.23±0.39
89.60±0.36
87.57±0.74

Mesa 12: Comparison of our method with EDA
and No-Aug, with RoBERTa as the text classifier,
trained on 10% of the classification datasets.

Dataset
IMDB
Yelp
Amazonas
SST-2

Ours
89.06±0.07
93.60±0.11
92.71±0.04
91.28±0.06

EDA
88.59±0.10
93.12±0.05
92.33±0.06
91.09±0.03

No-Aug
88.50±0.14
92.97±0.37
92.20±0.14
91.05±0.07

Mesa 13: Comparison of our method with EDA
and No-Aug, with RoBERTa as the text classifier,
trained on 100% of the classification datasets.

the classification accuracy increases. This is be-
cause a larger γ renders the classification model
to be trained more by the augmented data. El
augmented data provides additional training re-
sources, which can mitigate the lack of original
datos. Sin embargo, as γ continues to increase, the ac-
curacy decreases. This is because an excessively
large γ renders too much emphasis on augmented
data and less attention paid to original data. origen-
inal data is less noisy than augmented data and
therefore is more valuable than augmented data
for training high-quality models. Similar results
can be observed in Figure 3, where the model is
trained on 10% of Amazon, SST-2, and Yelp.

We also perform an ablation study which re-
places the summarization data with paraphrase
datos. The model S remains the same, cual es
still distill-BART. The paraphrase data contains
3,900 sentence pairs from Microsoft Research
Paraphrase Corpus (MRPC) (Dolan and Brockett,
2005). These pairs are labeled as paraphrases by
humano. To ensure a fair comparison, we ran-
domly sample 3,900 data examples from CNN-
DailyMail, denoted by CNNDM-3900. Mesa 16
shows the results. As can be seen, CNNDM-3900
yields better performance than MRPC. This shows
that using the summarization model for sentence
augmentation is more effective than using the
paraphrase model. The possible reason is that a

Cifra 2: How test accuracy changes with γ, donde el
model is trained on 10% of IMDB.

summarization model discards less important in-
formation of the input text while a paraphrasing
model preserves most information of the input
texto; como resultado, augmentations generated by the
summarization model are less noisy and have a
larger semantic diversity to the input texts.

4.6 Analysis of Learned Weights of

Summarization Data

Mesa 14 shows some randomly sampled summa-
rization data examples whose weights learned by
our framework are close to zero when the classifi-
cation dataset is SST-2. Due to the space limit, nosotros
only show the summaries. As can be seen, estos
data are primarily about healthcare, energía, law,
and politics, which have a large domain discrep-
ancy with SST-2, which is about movie reviews.
This shows that our framework is able to success-
fully identify out-of-domain summarization data
and exclude them from training the summariza-
tion model.

Mesa 15 shows some randomly sampled sum-
marization data examples whose weights learned
by our framework are close to one when the clas-
sification dataset is SST-2. Due to the space limit,
we only show the summaries. As can be seen, estos
summarization data are primarily about movies,
recreation, and media, which are close to SST-2
in terms of domain similarity.

When the classification dataset is IMDB, el
weights of examples in Table 14 are also close
to zero, and the weights of examples in Table 15
are also close to one. This is because IMDB and
examples in Table 15 are about movies while
examples in Table 14 are not.

Given the learned weights, we split the sum-
marization data into two subsets: ONE, cual
contains examples whose weights are > 0.5, y

352

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 3: How test accuracy changes with γ, where the model is trained on 10% of Amazon, SST-2, and Yelp.

Half of states will expand Medicaid under Obamacare; half
refuse or are on the fence. Low-income citizens and their
advocates say Medicaid expansion necessary. States like Texas,
Florida say a Medicaid expansion is costly and will fail. Política
at play and most states will eventually expand the program,
political experts say.

The country plans to remove all subsidies in five years. El
price of gasoline goes up four-folds. Iran has trillions of dollars
in natural resources but still struggles, experts say. The changes
aim to dampen domestic demand for fuel and oil, and bolster
overall revenues.

Attorneys for three kidnapped women have ‘‘confidence and
faith’’ in prosecution. Indictment lists use of chains, tape,
vacuum cord against women. Castro also faces 139 counts of
rape, 177 counts of kidnapping, one aggravated murder charge.
A prosecutor’s committee will later consider whether seeking
death penalty is appropriate.

North Korea’s announcement of a planned satellite launch has
provoked alarm. Other countries say it is a way of testing
missile technology. The Japanese defense minister orders the
preparation of missile defenses.

Mesa 14: Some summarization data examples
whose weights learned by our framework are
cerca de 0. The classification dataset is SST-2.

Hollywood often misrepresents careers and the workplace in
cine. Work is portrayed as easy or non-existent, and any
outfit is appropriate. This can have a negative effect on younger
generations deciding on careers.

Hollywood bringing back box-office juggernauts Iron Man,
Batman and Spider-Man. 2012 may be a peak year for super-
heroes on film, with much more to come. Writer-director: corriente continua
Comics characters often face more challenges than Marvel.
’‘‘The genre has been popular for decades and is here to stay,''
Boxofficeguru.com editor says.

Sam Smith wins best new artist, best pop vocal album. Beyonce
now has 20 Grammys, passing Aretha Franklin.

Ted Turner turns 75 years old this month. He founded CNN,
the first 24-hour cable news network, en 1980. En 1990, Tornero
hired Wolf Blitzer, host of CNN’s ‘‘The Situation Room’’.
Blitzer reflects on what he learned from his former boss.

Mesa 15: Some summarization data examples
whose weights learned by our framework are
cerca de 1. The classification dataset is SST-2.

ZERO, which contains examples whose weights
are ≤ 0.5. For each subset, we use it to train
a summarization model (no reweighting of data
durante el entrenamiento), use the summarization model

353

Datos
Yelp
IMDB
Amazonas
SST-2

No-Aug
68.15±3.77
56.76±3.38
66.03±1.44
62.04±1.44

Summarize
75.06±0.62
64.94±0.15
67.36±0.08
67.44±0.27

Paraphrase
74.33±0.47
64.13±0.07
66.39±0.14
66.72±0.11

Mesa 16: Classification accuracy (%) en 10%
of different datasets, when using summarization
data and paraphrase data to train the augmentation
modelo.

Datos
Yelp
IMDB
Amazonas
SST-2

No-Aug
68.15±3.77
56.76±3.38
66.03±1.44
62.04±1.44

ONE
77.27±0.96
66.94±0.23
69.24±0.15
67.50±0.34

ZERO
69.92±2.09
57.85±0.41
67.74±0.22
63.69±0.46

Mesa 17: Classification accuracy (%) en 10% de
different datasets, in the study of how summari-
zation data weights affect downstream classifi-
cation performance.

to generate augmentations, and train the classifi-
cation model on augmented data (together with
real data). Mesa 17 shows the results. As can be
seen, augmentations generated by the summariza-
tion model trained on ONE improve classification
performance in most cases. A diferencia de, augmen-
tations generated by the summarization model
trained on ZERO are not helpful.

Using In-domain Data to Train a Summariza-
tion Model. We conduct an experiment that uses
in-domain data to train the summarization model.
Específicamente, we replace the CNN-Dailymail
dataset with a movie summarization (MovSum)
dataset crawled from IMDB. The MovSum dataset
contains 135K (movie synopsis, movie summary)
pairs where the summary (shorter) is treated as
a summarization of the synopsis (longer). Mov-
Sum is in the same domain as SST-2 since they
are both about movies. We randomly sample
135K examples from CNN-Dailymail (denotado
by CNNDM-135K) and train the summarization

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Summarization
Datos
MovSum
CNNDM-135K

Classification
Accuracy
67.72±0.36
67.15±0.21

Percentage
of ones
84.5
39.3

Mesa 18: Results of our proposed method
on SST-2 (10%) under different summarization
conjuntos de datos.

model on this subset for a fair comparison. Clas-
sification is conducted on SST-2 (usando 10%
training examples).

Mesa 18 shows the results. As can be seen, usando
MovSum to train the summarization model leads
to better classification performance on SST-2.
This is because, compared with CNNDM-135K,
MovSum has a higher domain similarity with
SST-2. A summarization model
trained using
in-domain data can generate in-domain augmen-
taciones, which are more suitable to train the
classification model. Además, we measure the
percentage of ones in the learned weights of sum-
marization data examples. Under MovSum, el
percentage is larger. This is because more data
examples in MovSum are in the same domain as
SST-2, compared with CNNDM-135K.

4.7 Analysis of Generated Augmentations

We measure the diversity of generated augmenta-
ción. Two types of diversity measures are used: I)
how different the generated augmentation is from
the input text; and II) how different sentences in
a generated augmentation are. We measure type-I
diversity in the following way: Given an input
text t and an augmentation s generated from t,
calculate the BLEU (Papineni et al., 2002b) puntaje
between s and t. We measure type-II diversity in
the following way: for each pair of sentences in
a generated augmentation, calculate their BLEU
puntaje, then take the average over all pairs. En el
BLEU score, the number of grams is set to 4.
Mesa 19 shows the results. As can be seen, el
BLEU scores are low for both types of diversity,
which indicates that the augmentations generated
by our method are diverse.

We also check whether generated augmenta-
tions are abstractive. We randomly sample 300
generated augmentations and ask 5 undergrad-
uates to manually annotate whether they are
abstractive (with score 1) or extractive (puntaje 0).
Mesa 20 shows the results. The average score

Dataset
Yelp
IMDB
Amazonas
SST-2

Type-I
0.15
0.22
0.11
0.13

Type-II
0.07
0.06
0.04
0.06

Mesa 19: Diversity of generated augmentations.

Dataset
Yelp
IMDB
Amazonas
SST-2

Score
0.83
0.79
0.88
0.92

Mesa 20: Abstractiveness of
generated augmentations.

Modelo
GPT-2 (Radford et al., 2019)
Sum-Sep
MTL-SST2
Ours-SST2
MTL-IMDB
Ours-IMDB
MTL-Yelp
Ours-Yelp
MTL-Amazon
Ours-Amazon

Rouge-2 F1
8.27
16.06
11.06
12.77
10.79
13.06
12.15
12.97
11.73
13.13

Rouge-L F1
26.58
34.07
28.44
30.27
29.20
30.72
29.43
30.62
29.16
30.71

Mesa 21: Evaluation of summarization models.

is close to 1, indicating that the augmentations
generated by our method are abstractive.

4.8 Performance of Summarization Models

We report the performance of the summarization
models in Table 21. For our method and MTL,
the reported scores are averages for different per-
centages of classification training data, donde el
classification model is LSTM. From this table, nosotros
make two observations. Primero, our method outper-
forms GPT2 (Radford et al., 2019), a competitive
model used for text summarization. This shows
that our method can generate meaningful sum-
maries. Segundo, our method performs worse than
Sum-Sep. The reason is that in our method, el
summarization model is trained for the sake of
generating good augmentations for the classifi-
cation model while Sum-Sep is trained solely to
maximize the summarization performance. Nuestro

354

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

goal is not to generate best summaries, but rather
generating best augmentations for classification
using the summarization model.

5 Conclusions and Discussion

en este documento, we propose a three-level optimiza-
tion framework to perform text augmentation and
classification end-to-end. Our framework consists
of three learning stages performed end-to-end:
1) training a text summarization model; 2) train-
ing a text classification model; y 3) updating
weights of summarization examples by minimiz-
ing the validation loss of the classification model.
Each learning stage corresponds to one level of the
optimization problem in the framework. The three
levels of optimization problems are nested and
solved in a unified way. Our framework enables
the augmentation process to be influenced by the
performance of the text classification task so that
the augmented texts are specifically suitable for
training the classification model. Experiments on
various datasets demonstrate the effectiveness of
our method.

Our framework can be extended to other down-
stream tasks beyond classification, including but
are not limited to text-to-image generation, visual
question answering, dialog generation, etc.. A
extend our framework to a downstream task T ,
we need to change the loss functions in Eq. (2)
and Eq. (3) to the loss of task T . Por ejemplo,
to apply our framework for text-to-image genera-
ción, given each text-image (t, i) pair, we perform
augmentation of the input text t using the sum-
marization model trained in Eq. (1), to get an
augmented text ˆt. (ˆt, i) would be treated as an
augmented data pair. Then we define GAN-based
losses (Goodfellow et al., 2014) on each (ˆt, i)
and each (t, i) to train a text-to-image genera-
tion model. We plan to study such extensions in
future work.

Our current framework has the following limi-
taciones. Primero, it incurs additional computation and
memory costs due to the usage of a summarization
modelo. Segundo, currently, our method uses a text
summarization model to generate augmentations.
Publicly available summarization datasets are lim-
ited in size, limiting the augmentation model’s
quality. We plan to address these limitations in
future works. To reduce memory and computa-
tion costs, we will perform parameter sharing

where the encoder weights of the summariza-
tion model and those of the classification model
are tied together. We plan to leverage data-rich
text generation tasks for augmentation to address
the second limitation, such as using machine
translation models to perform back translation.

Expresiones de gratitud

This work is partially supported by a gift fund
from CCF and Tencent. We also thank the editor
and the anonymous reviewers for their valuable
suggestions.

Referencias

jacob andreas. 2020. Good-enough composi-
tional data augmentation. https://doi.org
/10.18653/v1/2020.acl-main.676

Atilim Gunes Baydin, Robert Cornish, David
Mart´ınez-Rubio, Mark Schmidt, and Frank D.
Wood. 2017. Online learning rate adaptation
with hypergradient descent. CORR, abs/1703
.04782.

Jiaao Chen, Zichao Yang, y diyi yang. 2020.
Mixtext: Linguistically-informed interpolation
of hidden space for semi-supervised text clas-
sification. arXiv preimpresión arXiv:2004.12239.
https://doi.org/10.18653/v1/2020
.acl-main.194

Jacob Devlin, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2018. BERT: Pre-
training of deep bidirectional transformers for
language understanding. arXiv preimpresión arXiv:
1810.04805.

Guillermo B.. Dolan and Chris Brockett. 2005.
Automatically constructing a corpus of sen-
el
tential paraphrases.
Third International Workshop on Paraphrasing
(IWP2005).

En procedimientos de

Marzieh Fadaee, Arianna Bisazza, and Christof
Monz. 2017. Data augmentation for
bajo-
resource neural machine translation. arXiv pre-
print arXiv:1705.00440. https://doi.org
/10.18653/v1/P17-2090

Steven Y. feng, Varun Gangal, Jason Wei, Sarath
chandar, Soroush Vosoughi, Teruko Mitamura,
and Eduard Hovy. 2021. A survey of data

355

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

augmentation approaches for NLP. arXiv pre-
print arXiv:2105.03075. https://doi.org
/10.18653/v1/2021.findings-acl.84

Diederik P. Kingma and Jimmy Ba. 2014. Adán:
A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.

Matthias Feurer, Jost Springenberg, and Frank
Hutter. 2015. Initializing bayesian hyperpa-
rameter optimization via meta-learning.
En
Proceedings of the AAAI Conference on Arti-
ficial Intelligence, volumen 29.

Sosuke Kobayashi. 2018. Contextual augmenta-
ción: Data augmentation by words with para-
digmatic relations. arXiv preimpresión arXiv:1805
.06201. https://doi.org/10.18653/v1
/N18-2072

Chelsea Finn, Pieter Abbeel, and Sergey Levine.
2017. Model-agnostic meta-learning for fast
adaptation of deep networks. En procedimientos
of the 34th International Conference on Ma-
chine Learning-Volume 70, pages 1126–1135.
JMLR.org.

Mahak Gambhir

and Vishal Gupta. 2017.
Recent automatic text summarization tech-
niques: A survey. Artificial
Inteligencia
47(1):1–66. https://doi.org
Revisar,
/10.1007/s10462-016-9475-9

Ian Goodfellow, Jean Pouget-Abadie, Mehdi
Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio.
2014. Generative adversarial nets. Avances en
Neural Information Processing Systems, 27.

Demi Guo, Yoon Kim, and Alexander M. Rush.
2020. Sequence-level mixed sample data aug-
mentation. arXiv preimpresión arXiv:2011.09039.
https://doi.org/10.18653/v1/2020
.emnlp-main.447

Sepp Hochreiter y Jürgen Schmidhuber. 1997.
Memoria larga a corto plazo. Neural Computa-
ción, 9(8):1735–1780. https://doi.org
/10.1162/neco.1997.9.8.1735

Eduard Hovy, Laurie Gerber, Ulf Hermjakob,
Chin-Yew Lin, and Deepak Ravichandran.
2001. Toward semantics-based answer pin-
En-
apuntando.
ternational Conference on Human Language
Technology Research. https://doi.org
/10.3115/1072133.1072221

En procedimientos de

the First

Oleksandr Kolomiyets, Steven Bethard, y
Marie-Francine Moens. 2011. Model-portability
experiments for textual temporal analysis. En
Proceedings of the 49th Annual Meeting of
the Association for Computational Linguis-
tics: Tecnologías del lenguaje humano, volumen 2,
pages 271–276. LCA; East Stroudsburg, Pensilvania.

Varun Kumar, Ashutosh Choudhary, and Eunah
Dar. 2021. Data augmentation using pre-trained
transformer models.

mike lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
Exacción, Ves Stoyanov, and Luke Zettlemoyer.
2019. BART: Denoising sequence-to-sequence
pre-training for natural language generation,
traducción, and comprehension. arXiv preprint
arXiv:1910.13461. https://doi.org/10
.18653/v1/2020.acl-main.703

Xin Li and Dan Roth. 2002. Learning ques-
tion classifiers. In COLING 2002: The 19th
International Conference on Computational
Lingüística. https://doi.org/10.3115
/1072228.1072378

Hanxiao Liu, Karen Simonyan, and Yiming Yang.
2018. Darts: Differentiable architecture search.
arXiv preimpresión arXiv:1806.09055.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly opti-
mized bert pretraining approach. arXiv preprint
arXiv:1907.11692.

Kushal Kafle, Mohammed Yousefhussien, y
Christopher Kanan. 2017. Data augmentation
for visual question answering. En procedimientos
of the 10th International Conference on Nat-
ural Language Generation, pages 198–202.
https://doi.org/10.18653/v1/W17
-3529

andres. Maas, Raymond E. Daly, Peter
t. Pham, Dan Huang, Andrew Y. Ng, y
Christopher Potts. 2011. Learning word vec-
tors for sentiment analysis. En procedimientos de
the 49th Annual Meeting of the Association
para Lingüística Computacional: Human Lan-
guage Technologies, pages 142–150, Portland,

356

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Oregón, EE.UU. Asociación de Computación
Lingüística.

Julian McAuley and Jure Leskovec. 2013. Hidden
factors and hidden topics: Understanding rating
dimensions with review text. En procedimientos de
the 7th ACM Conference on Recommender Sys-
tems, pages 165–172. https://doi.org
/10.1145/2507157.2507163

Junghyun Min, R. Thomas McCoy, Dipanjan
El, Emily Pitler, and Tal Linzen. 2020. Syn-
tactic data augmentation increases robustness
to inference heuristics.

Kishore Papineni, Salim Roukos, Todd Ward,
y Wei-Jing Zhu. 2002a. AZUL: A method
for automatic evaluation of machine transla-
ción. In ACL. https://doi.org/10.3115
/1073083.1073135

Kishore Papineni, Salim Roukos, Todd Ward, y
Wei-Jing Zhu. 2002b. AZUL: Un método para
evaluación automática de la traducción automática.
In Proceedings of the 40th Annual Meeting of
la Asociación de Lingüística Computacional,
páginas 311–318. Asociación de Computación
Lingüística. https://doi.org/10.3115
/1073083.1073135

Alec Radford, Jeffrey Wu, niño rewon, David
Luan, Dario Amodei, Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9.

Zhongzheng Ren, Raymond Yeh, y alejandro
Schwing. 2020. Not all unlabeled data are
igual: Learning to weight data in semi-
supervised learning. En avances en neurología
Sistemas de procesamiento de información, volumen 33,
pages 21786–21797. Asociados Curran, Cª.

G¨ozde G¨ul S¸ ahin and Mark Steedman. 2019.
Data augmentation via dependency tree morph-
ing for low-resource languages. arXiv preprint
arXiv:1903.09460. https://doi.org/10
.18653/v1/D18-1545

Abigail See, Peter J. Liu, and Christopher D.
to the point: Summa-
Manning. 2017. Get
rization with pointer-generator networks. En
Proceedings of the 55th Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 1073–1083,
Asociación de Lingüística Computacional,
vancouver, Canada.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2015. Improving neural machine trans-
lation models with monolingual data. arXiv
preprint arXiv:1511.06709. https://doi
.org/10.18653/v1/P16-1009

Sam Shleifer and Alexander M. Rush. 2020.
Pre-trained summarization distillation. arXiv
preprint arXiv:2010.13002.

Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, san-
ping Zhou, Zongben Xu, and Deyu Meng.
2019. Meta-weight-net: Learning an explicit
mapping for sample weighting. In Advances
en sistemas de procesamiento de información neuronal,
pages 1919–1930.

Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Cristóbal D.. Manning, Andrew Ng,
and Christopher Potts. 2013. Recursive deep
models for semantic compositionality over a
sentiment treebank. En Actas de la 2013
Jornada sobre Métodos Empíricos en Natu-
Procesamiento del lenguaje oral, pages 1631–1642,
seattle, Washington, EE.UU. Asociación para
Ligüística computacional.

Felipe Petroski Such, Aditya Rawal, Joel Lehman,
Kenneth O. Stanley, and Jeff Clune. 2019. generación-
erative teaching networks: Accelerating neural
architecture search by learning to generate
synthetic training data. CORR, abs/1912.07768.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N..
Gómez, lucas káiser, y Illia Polosukhin.
2017. Attention is all you need. In Advances
en sistemas de procesamiento de información neuronal,
pages 5998–6008.

William Yang Wang and Diyi Yang. 2015. That’s
so annoying!!!: A lexical and frame-semantic
embedding based data augmentation approach
to automatic categorization of annoying be-
haviors using# petpeeve tweets. En curso-
cosas de
el 2015 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural,
pages 2557–2563. https://doi.org/10
.18653/v1/D15-1306

Xinyi Wang, Hieu Pham, Zihang Dai, y graham
Neubig. 2018. Switchout: An efficient data
augmentation algorithm for neural machine
traducción. arXiv preimpresión arXiv:1808.07512.

357

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

https://doi.org/10.18653/v1/D18
-1100

print arXiv:1901.11196. https://doi.org
/10.18653/v1/D19-1670

Yulin Wang, Jiayi Guo, Shiji Song, and Gao
Huang. 2020. Meta-semi: A meta-learning ap-
proach for semi-supervised learning. CORR,
abs/2007.02394.

Jason Wei and Kai Zou. 2019. EDA: Easy data
augmentation techniques for boosting perfor-
mance on text classification tasks. arXiv pre-

Xiang Zhang, Junbo Zhao, and Yann LeCun.
2015. Character-level convolutional networks
for text classification. arXiv:1509.01626 [cs].

Guoqing Zheng, Ahmed Hassan Awadallah, y
Susan T. Dumais. 2019. Meta label correc-
tion for learning with weak supervision. CORR,
abs/1911.03809.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
6
4
2
0
0
6
9
6
5

/

/
t

yo

a
C
_
a
_
0
0
4
6
4
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

358
Descargar PDF