Generate, Annotate, and Learn: NLP with Synthetic Text

Xuanli He1

Islam Nassar1 Jamie Kiros2 Gholamreza Haffari1 Mohammad Norouzi2
1Monash University, Australia

2Google Research, Brain Team, Canada

{xuanli.he1, gholamreza.haffari}@monash.edu, mnorouzi@google.com

Abstracto

This paper studies the use of language mod-
els as a source of synthetic unlabeled text for
NLP. We formulate a general framework called
‘‘generate, annotate, and learn (GAL)’’ to
take advantage of synthetic text within knowl-
edge distillation, self-training, and few-shot
learning applications. To generate high-quality
task-specific text, we either fine-tune LMs on
inputs from the task of interest, or prompt
large LMs with few examples. We use the best
available classifier to annotate synthetic text
with soft pseudo labels for knowledge distilla-
tion and self-training, and use LMs to obtain
hard labels for few-shot learning. We train new
supervised models on the combination of la-
beled and pseudo-labeled data, which results
in significant gains across several applications.
We investigate key components of GAL and
present theoretical and empirical arguments
against the use of class-conditional LMs to
generate synthetic labeled text instead of unla-
beled text. GAL achieves new state-of-the-art
knowledge distillation results for 6-layer trans-
formers on the GLUE leaderboard.

Introducción

There is an abundance of unlabeled data in the real
world, but task-specific unlabeled data within the
scope of a given machine learning problem can be
challenging to find. Por ejemplo, one cannot easily
find in-domain unlabeled text conforming to the
input distribution of a specific Natural Language
Procesando (NLP) task from the GLUE benchmark
(Wang y cols., 2019C). Some NLP tasks require an
input comprising a pair of sentences with a partic-
ular relationship between them. Además, clas-
sification datasets typically represent a tailored
distribution of data and only include a limited
number of class labels. If task-specific unlabeled
data were available, one could adopt self-training
(Yarowsky, 1995) to automatically annotate unla-
beled data with pseudo labels to improve accuracy
and robustness of classifiers (Xie et al., 2020;
Carmon et al., 2019). Además, one can use

826

knowledge distillation (Hinton et al., 2015) en
fresh task-specific unlabeled data to more effec-
tively compress deep neural networks and ensem-
bles (Bucilu˘a et al., 2006; Chen et al., 2020a).

In the absence of task-specific unlabeled data,
one could retrieve unlabeled examples from a
large and diverse open-domain dataset (Du et al.,
2020). Sin embargo, such a retrieval-based approach
may not scale to problems with complex input
schemes, Por ejemplo, sentence pairs with certain
relaciones. Recent work (Yang et al., 2020; Kumar
et al., 2020b) has considered the use of Language
Modelos (LMs) like GPT-2 (Radford et al., 2019)
as a means of data augmentation, showing the
effectiveness of this approach for commonsense
reasoning and classification tasks. Existing ap-
proaches often consider class-conditional genera-
ción, where the synthetic data is produced by con
ditioning on a specified class label. Sin embargo, él
is unclear whether class-conditional generation is
best suited for NLP tasks. Además, existing
pipelines often make synthetic data generation
complicated as one needs to detect and discard
low-quality synthetic labeled data or optionally
re-label data (Yang et al., 2020; Vu et al., 2021b).
Por ejemplo, Kumar et al. (2020b) observe that
it is difficult for sentences generated by label-
conditioned GPT-2 to retain the semantics/prag-
matics of the conditioning label, leading to poor
performance on downstream tasks.

We unify and simplify existing work on LMs
as a data source for NLP and develop a general
framework called ‘‘generate, annotate, and learn
(GAL )''. The generality of GAL allows us to use
LM-generated synthetic data within novel appli-
cations such as Knowledge Distillation (KD) y
few-shot learning. GAL builds on recent advances
in text generation (Radford et al., 2019; gao
et al., 2021) and uses powerful LMs to synthesize
task-specific unlabeled text by fine-tuning or con-
ditioning a large LM on in-distribution examples.
We use state-of-the-art classifiers to annotate gen-
erated text with soft pseudo labels when possible.

Transacciones de la Asociación de Lingüística Computacional, volumen. 10, páginas. 826–842, 2022. https://doi.org/10.1162/tacl a 00492
Editor de acciones: Andr´e F.T. Martins. Lote de envío: 2/2022; Lote de revisión: 4/2022; Publicado 8/2022.
C(cid:2) 2022 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

We then combine labeled data and pseudo-labeled
data to train more effective supervised models,
resulting in significant gains on a range of NLP
tasks like KD and few-shot learning.

We present a justification for GAL based on
the empirical and vicinal risk minimization frame-
obras (Vapnik, 1992; Chapelle et al., 2001). Nosotros
also investigate key components of GAL. We find
that even if class-conditional LMs are available
for text generation, it is more effective to discard
the conditioning labels and let the teacher models
produce pseudo labels. This observation is sup-
ported by our theoretical and empirical results.
Respectivamente, in contrast to prior work (Yang et al.,
2020; Vu et al., 2021b), we advocate for the use
of simple unconditional LMs for text synthesis.
Más, we avoid any form of data filtering. No
surprisingly, we find that the diversity of synthetic
text matters. dicho eso, simple unconditional gen-
eration given random seeds provides sufficient
diversity, and crafting diverse LM prompts is
not needed.
En resumen:

• We develop GAL, a simple and effective
approach to the use of LMs for task-specific
unlabeled text generation. We show that GAL
can be used effectively for KD, self-training,
and few-shot learning in NLP.

• We present theoretical and empirical investi-
gations for GAL, explaining why it works and
why using class-conditional LMs to generate
synthetic labeled data is not as effective.

• GAL advances KD for NLP and establishes
a new state-of-the-art (SoTA) resu lt for a sin-
gle 6-layer transformer on the GLUE test set.
It further improves prompt-based few-shot
aprendiendo, providing an average improvement
de 1.3% on four 4-shot learning NLP tasks,
outperforming GPT-3-6B.

2 Trabajo relacionado

Data synthesis with large pre-trained language
models is closely related to our work (Kumar
et al., 2020b; Yang et al., 2020; Vu et al., 2021b;
Norouzi et al., 2020). Yang y otros. (2020) propose
a complex scheme, including label-conditioned
data generation, data relabeling, data filtering,

and two-stage training, to utilize synthetic data.
Por el contrario, we show that a simple mixture of the
original data and synthetic unconditionally gener-
ated data can provide sizable gains. Además,
we show a broader use of generative models on KD
and few-shot learning. Vu et al. (2021b) take a task
augmentation approach and employ conditional
generation to produce in-domain synthetic data
for an auxiliary language inference (NLI) tarea,
which is then used to initialize the target-task clas-
sifier. Sin embargo, not all tasks (p.ej., grammatical
acceptability judgments) can benefit from the NLI-
style auxiliary task (Wang y cols., 2019a). We aim
to directly generate the unlabeled in-domain data
for the target task. Unlike Norouzi et al. (2020),
we do not use instance-based generative models.
More broadly, there has been a recent surge in
data synthesis and augmentation in NLP, incluir-
ing rule-based and model-based approaches; ver
Feng et al. (2021) for a recent survey. Data synthe-
sis with grammars has been explored in semantic
parsing and natural language understanding (p.ej.,
see Wang et al., 2015, 2021; Marzoev et al.,
2020). Existing approaches to data augmentation
for NLP include lexicon replacement, sentence re-
trieval, and round-trip machine translation (Wang
and Yang, 2015; Yu et al., 2018; Kobayashi,
2018; Wu et al., 2019; Lichtarge et al., 2019; Wei
and Zou, 2019; Alberti et al., 2019; Du et al.,
2020; Shen et al., 2020). Nosotros, en cambio, propose
the use of unconditional autoregressive LMs for
data augmentation. This is simple, flexible, y
powerful.

Self-training is one of the oldest approaches
for semi-supervised learning (Scudder, 1965,
Fralick, 1967; Agrawala, 1970; Yarowsky, 1995;
Eisner and Karakos, 2005; Ueffing et al., 2007;
Du et al., 2020). Abney (2004) and Haffari and
Sarkar (2007) have theoretically analyzed self-
training for simple decision lists. Recent theoreti-
cal work analyzes self-training for linear models,
often under the assumption that the data distri-
bution is (cerca de) Gaussian (Carmon et al., 2019;
Raghunathan et al., 2020; Chen et al., 2020b;
Kumar et al., 2020a; Oymak and Gulcu, 2020).
Wei et al. (2021) prove that, under ‘‘expansion’’
and ‘‘class separation’’ assumptions, self-training
can lead to more accurate neural network classi-
fiers. We present a theoretical framing of GAL in
terms of empirical and vicinal risk minimization
(Vapnik, 1992; Chapelle et al., 2001).

827

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Knowledge Distillation (KD) (Bucilu˘a et al.,
2006; Hinton et al., 2015) uses a procedure sim-
ilar to self-training to distill knowledge of an
expressive teacher model into a smaller student
modelo. A diferencia de, self-distillation (Furlanello
et al., 2018; Zhang et al., 2019; Mobahi et al.,
2020) uses teacher and student models of equal
tamaño, hoping to iteratively refine class labels. Pre-
vious work uses unlabeled data (Bucilu˘a et al.,
2006) and adversarial training (Wang y cols., 2018)
to improve KD. We demonstrate that synthetic
data generated by unconditional generative mod-
els can improve KD on NLP, outperforming strong
KD baselines, which often add more complexity
and additional hyperparameters (p.ej., Sun et al.,
2019a; Jiao et al., 2019; Xu et al., 2020, Rashid
et al., 2021).

3 Generate, Annotate, and Learn (GAL)

Given a labeled dataset L = {(xi, yi)}norte
yo=1, nosotros
first train an unconditional domain-specific gen-
erative model g(X) on Lx = {xi}norte
yo=1, y luego
use it to synthesize unlabeled data. Such synthetic
unlabeled data is used within self-training and
KD even in the absence of in-domain unlabeled
datos. We restrict our attention to basic KD and self-
training methods, even though GAL can be com-
bined with more sophisticated semi-supervised
técnicas, también.

The effectiveness of GAL depends on the fi-
delity and diversity of synthetic examples. Si nosotros
had access to the oracle generative process, nosotros
would be able to obtain the best KD and SSL
resultados, as if we had access to real task-specific
unlabeled data. Our preliminary experiments sug-
gest that large language models are particularly
effective within the GAL framework. Por eso, como
como se muestra en la figura 1, to build the best domain-
specific language model, we adopt a large lan-
guage model pretrained on lots of open-domain
texto, and fine-tune it on a given dataset’s inputs,
eso es, Lx, ignoring class labels. Both our theory
and ablations confirm that ignoring class labels is
a good idea (c.f., Sección 4 y 5). Transferring
the knowledge of large language models is par-
ticularly beneficial when a small input dataset Lx
of text is available (Hernandez et al., 2021).

To improve computational efficiency of GAL,
we do not generate unlabeled data on the fly,
but generate as many unconditional samples as
possible and store them in a synthetic unlabeled

Cifra 1: An illustration of GAL for NLP. Usamos
open-domain data once for self-supervised pretraining
(p.ej., BERT) and once for training a large LM (p.ej.,
GPT-2). BERT is fine-tuned on labeled data to yield a
classifier for the task of interest. GPT-2 is fine-tuned on
the same data without labels to obtain an unconditional
task-specific LM, which is used to generate lots of
synthetic in-domain unlabeled data for self-training
and KD.

dataset U . We use soft pseudo labels within self-
training and KD, as we empirically found it is
more effective than using hard labels on synthe-
tic data.

3.1 Knowledge Distillation with GAL

KD distills knowledge of an expressive teacher
model into a smaller student model (Hinton et al.,
2015). We pose the following objective function
for KD with labeled and synthetic unlabeled data:

(cid:2)kd = λ E(X,y)∼LH(y, fs(X))+

(1 − λ) mi(cid:2)x∼g(X)h(h((cid:2)X), fs((cid:2)X)),

(1)

where h is the teacher model, fs is the student
modelo, and g is the large pre-trained language
modelo (p.ej., GPT2) fine-tuned on the text in the
training data Lx. h(q, pag) = q(cid:4) log p is the soft-
max cross entropy loss. Note the use of g(X),
approximating the unknown real data distribu-
tion P (X) en (1). Algoritmo 1 summarizes the
GAL-KD process.

3.2 Self-Training with GAL

Self-training encourages knowledge transfer be-
tween a teacher and a student model in such a
way that the student can outperform the teacher.
Algoritmo 2 summarizes the GAL -self-training
proceso. Given the labeled dataset L and the
synthetic unlabeled dataset U , an initial model de-
noted f1 is trained using supervised learning on the
labeled dataset L. Entonces, at iteration t, one adopts

828

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

on each dataset of interest after removing class
labels. We find that training from scratch on these
datasets is hopeless, but the larger the pretrained
GPT-2 variant, the better the validation perplex-
ity scores are. For tasks modeling a relationship
between multiple sentences, we concatenate a
separator token [SEP] between consecutive sen-
tenencias. To alleviate an over-fitting on the train-
ing set, we use the best checkpoint evaluated on
the dev set as our generation engine. Once a fine-
tuned GPT-2 model is obtained, we generate new
domain-specific data by using top-k random sam-
pling similar to Radford et al. (2019). We do not
feed any prompt to the LM, but a special [BOS]
token to initiate the generation chain. A genera-
tion episode is terminated when a special [EOS]
token is produced. We generate diverse sentences
by varying the random seed. After collecting
enough synthetic data, we only retain unique sen-
tenencias. For tasks with α input sentences, we dis-
card generated samples that violate this constraint
(aproximadamente 10% of samples were rejected). fi-
finalmente, we obtain task-specific synthetic data up to
40× larger than the original training sets. For some
samples of generated text for GLUE see Tables 11
y 12. We believe using bigger LMs and larger
synthetic datasets will improve our results, but we
are constrained by computer resources.

4 An Empirical Risk Minimization

Perspective

In supervised learning, one seeks to learn a map-
ping f that, given an input x, predicts a reasonable
output y. To define the supervised learning prob-
lem formally, one assumes that input-output pairs
are drawn from a joint distribution P , a saber,
(X, y) ∼ P (X, y), and a loss function H(y, F (X))
is used to assess the quality of a mapping f . Este
loss is used to define a notion of expected risk:

R(F ) = EP (X,y)h(y, F (X)) .

(3)

In almost all practical applications P (X, y) es
unknown. Por eso, a labeled dataset of examples
L = {(xi, yi)}norte
i=1 is used to approximate R(F ) como

(cid:3)R(F ) =

(cid:4)norte
yo=1

1
norte

h(yi, F (xi)) .

(4)

ft as the teacher model to annotate the unlabeled
dataset U using pseudo labels. In self-training
GAL, the student model ft+1 is trained to opti-
mize a classification loss on the combination of
L and U :

(cid:2)t+1 = λ E(X,y)∼LH(y, ft+1(X))+

(1 − λ) mi(cid:2)x∼g(X)h(ft((cid:2)X), ft+1((cid:2)X)) ,

(2)

where λ = 0.5 unless stated otherwise. A pesar de
many different variants of the basic self-training
algorithm discussed above exist in the literature,
we adopt the simplest variant of self-training and
limit hyperparameter tuning to a bare minimum.

3.3 Domain-Specific Text Generation

We take a pretrained GPT-2 language model
(Radford et al., 2019) and fine-tune it separately

This objective function is known as empirical
riesgo, and learning f through minimizing (cid:3)R(F ) es

829

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

known as the empirical risk minimization princi-
por ejemplo (Vapnik, 1992). To compensate for the finite
sample size in (4), one typically combines (cid:3)R(F )
with a regularizer to improve generalization.

Beyond Empirical Risk Minimization. Empir-
ical risk minimization (4) is motivated as a way
to approximate P (X, y) through a set of Dirac
delta functions on labeled examples: Pδ(X, y) =
(cid:5)
i δ(x = xi, y = yi)/norte . Sin embargo, this approx-
imation is far from perfect, hence one uses a
heldout validation set for early stopping and
hyperparameter tuning.

Vicinal risk minimization (Chapelle et al.,
2001) approximates expected risk as EPν (X,y)h
(y, F (X)), using a vicinity distribution, for exam-
por ejemplo, norte(˜x, ˜y | X, y) = N (˜x − x, p2)δ(˜y = y) a
approximate P (X, y) como

Pν(X, y) =

(cid:4)norte
yo=1

1
norte

norte(˜x = x, ˜y = y | xi, yi) .
(5)

The goal is to increase the support of each labeled
data point and improve the quality and robustness
of the risk function.

Recent work on mixup regularization (zhang
et al., 2018) proposes an effective way to con-
struct another vicinity distribution by interpolating
between two data points and their labels. A pesar de
their simplicity, these smoothing techniques tend
to improve matters.

Generative Models for Risk Minimization.
One can factorize the joint distribution of input-
output pairs as P (X, y) =P (X)PAG (y | X). C.A-
cordingly, if one is able to learn a reasonable
unconditional generative model of x denoted
gramo(X), then one can draw a pair (X, y) by first
drawing x ∼ g(X) and then using the current
instance of ft to draw y ∼ ft(X). Entonces, one can
use ft and g to approximate expected risk as

Rt(ft+1) = Ex∼g(X)Ey∼ft(X)h(y, ft+1(X)) .

(6)

The quality of this approximation highly depends
on the quality of ft and g. If ft is far from an
optimal classifier f ∗ or g(X) is far from P (X), (6)
yields a poor approximation.

is applicable to any continuous, discreto, or struc-
tured domain as long as expressive generative
models of P (X) are available. dicho eso, for al-
most all reasonable loss functions H (p.ej., softmax
cross entropy and squared error), (6) is minimized
when ft+1 = ft, which is not ideal, especially
when ft is far from f ∗. Por otro lado, em-
pirical risk (4) anchors the problem in real la-
beled examples that are provided as ground truth.
GAL -self-training aims to combine the benefits

de (4) y (6) a través de:

Rt(ft+1) =

(cid:4)norte
h(yi, ft+1(xi))+
yo=1
(1 − λ)Ex∼g(X)Ey∼ft(X)h(y, ft+1(X)).

λ
norte

(7)

In this formulation, if ft represents the minimizer
of empirical risk (4), then ft+1 = ft is the mini-
mizer of (7), también. Sin embargo, one does not seek the
global minimizer of empirical risk, but rather the
best performance on heldout data. If ft is obtained
by stochastic gradient descent on any risk func-
ción, but early-stopped according to empirical risk
on a heldout set, then using such ft in (7) to define
Rt(ft+1) promotes the selection of a mapping ft+1
that minimizes empirical risk while staying close
to the best performing mapping so far (es decir., ft).
This formulation motivates self-training and GAL
as regularizers in the functional space and explains
why they can conceivably work. Although the ar-
guments are provided here for GAL-self-training,
extending them to GAL-KD is straightforward
(omitted due to the space constraints).

How About Class-conditional Generative
Modelos? One can also factorize the joint dis-
tribution P (X, y) as P (y)PAG (X | y) and accord-
ingly utilize a class-conditional generative model
gramo(X | y) to derive the following expected risk
formulation:

R(F ) = Ey∼P (y)Ex∼g(X|y)h(y, ft+1(X)) .

(8)

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

In this setting pseudo labeling is not needed as
synthetic data is already labeled. One can show
that the optimal classifier f ∗
g that minimizes (8)
for the cross-entropy loss is given by

f ∗
gramo (y | X) = g(X|y)PAG (y)

(cid:6)(cid:4)

y(cid:6) gramo(X|y(cid:6))PAG (y(cid:6)) ,
(9)

The expected risk in (6) smoothens the risk
landscape in complex ways beyond simple Gaus-
sian smoothing and interpolation. This smoothing

eso es, turning the class-conditional generative
model into a classifier by using the Bayes rule
yields the optimal solution.

830

Modelo

MNLI(m/mm) CoLA SST-2 MRPC

STS-B

QQP QNLI RTE Avg

Trabajo previo:
BERT-Theseus
BERT-PKD
tinyBERT
MATE-KD

Nuestros resultados:
DistilRoBERTa
DistilRoBERTa + KD
DistilRoBERTa + WS
DistilRoBERTa + RT
DistilRoBERTa + GAL

82.4/82.1
81.5/81.0
84.6/83.2
86.2/85.6

83.8/83.4
84.5/84.1
86.2/85.9
86.2/85.6
86.9/86.4

47.8
−
51.1
58.6

55.9
53.0
52.2
55.0
58.6

92.2
92.0
93.1
95.1

93.2
93.5
94.0
94.9
95.3

87.6/83.2 85.6/84.1 71.6/89.3
−
85.0/79.9
70.7/88.9
87.3/82.6 85.0/83.7 71.6/89.1
91.2/88.1 88.5/88.4 73.0/89.7

87.4/83.1 87.5/87.5 71.7/89.1
88.9/85.1 88.0/87.4 71.9/89.2
89.9/86.4 88.7/88.3 71.7/89.2
90.1/86.5 89.2/88.9 72.5/89.7
91.6/88.7 89.9/89.5 73.0/89.9

89.6
89.0
90.4
92.4

90.6
91.0
91.5
92.1
92.7

66.2 78.6
65.5 −
70.0 79.8
76.6 83.5

73.3 81.2
75.0 81.5
76.2 82.1
77.2 82.9
79.7 84.3

Mesa 1: GLUE test results for a 6-layer transformer. GAL establishes a new state of the art on KD
for NLP. Líneas de base: BERT-Theseus (Xu et al., 2020), BERT-PKD (Sun et al., 2019a), tinyBERT
(Jiao et al., 2019), MATE-KD (Rashid et al., 2021), DistilRoBERTa (Sanh et al., 2019), and Distil-
RoBERTa + KD (standard KD), DistilRoBERTa + WS (word substitution), and DistilRoBERTa + RT
(round-trip translation). MNLI-m and MNLI-mm indicate matched and mismatched, respectivamente.

Provided that the accuracy of generative clas-
sifiers on text classification is behind their dis-
criminate counterparts (p.ej., Ravuri and Vinyals,
2019), we think substituting (8) en (7) is not
a good idea. Esencialmente, by substituting (8) en
the classification objective, one is regularizing
f to remain close to f ∗
gramo , which is not an ef-
fective strategy if f ∗
g is not competitive. Este
argument corroborates the evidence from our ab-
lation studies and recent work showing that using
class-conditional generative models to augment
supervised learning does not provide big gains
(Ravuri and Vinyals, 2019).

dicho eso, one can still use class-conditional
generative models to synthesize high-fidelity sam-
ples. As long as these samples are treated as un-
labeled examples and annotated using a classifier,
Por ejemplo, ft, we believe this is a reasonable
approach falling under GAL. Note that our ar-
gument above only applies to the scenario that
class-conditional generative models are used to
synthesize labeled examples. En otras palabras, GAL
emphasizes prediction of the labels in the course
of the algorithm, rather than having the labels
predefinido. If one uses the unlabeled synthetic
examples from class-conditional generative mod-
los, it still aligns to (7), which will be verified in
Sección 5.4.

5 experimentos

En esta sección, we assess the effectiveness of
GAL on KD, self-training, and few-shot learning.

5.1 State-of-the-art Results of Knowledge
Distillation with GAL on GLUE

We use the GLUE benchmark (Wang y cols.,
2019C) for our KD experiments; see Appendix A.1
for benchmark details. Our synthetic unlabeled
dataset U includes 40× as many examples as the
original dataset for each task in GLUE.

It is known that KD on fresh data, unseen during
training, performs better (Bucilu˘a et al., 2006;
Chen et al., 2020a) than KD on original training
datos. Por eso, we investigate the effectiveness of
KD using generated unlabeled data through GAL.
We use the HuggingFace implementation (Lobo
et al., 2020) for KD experiments and adopt a stan-
dard experimental setup consistent with previous
trabajar (Sun et al., 2019a; Xu et al., 2020). Follow-
ing Rashid et al. (2021), fine-tuned RoBERTa-
grande (24-layer transformer) represents the teacher
and a DistilRoBERTa (6-layer transformer) (Sanh
et al., 2019) is used as the student. We train the
student model on U and L, where U is annotated
by the best RoBERTa-large model, achieving an
average score of 86.5. We then mix L and U at a
ratio of 1:4, which is equivalent to λ = 0.2. Este
ratio works best on the dev set.

Mesa 1 shows the results of individual 6-layer
transformers on the GLUE test set. All of the base-
lines use an identical student architecture. GAL
achieves the best entry on the GLUE leaderboard,
marking a new state-of-the-art for KD on NLP. Él
outperforms strong KD baselines such as Distil-
RoBERTa (Sanh et al., 2019), BERT-PKD (Sol

831

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

RoBERTa base

+ GAL (iter 1)
+ GAL (iter 2)
+ GAL (iter 3)

MNLI CoLA SST-2 MRPC STS-B QQP QNLI RTE Avg

87.7 0.1 63.6 0.4 94.8 0.1 90.1 0.4 90.8 0.1 91.5 0.1 92.6 0.1 78.8 0.4 86.2
87.9 0.1 65.1 0.5 95.3 0.1 91.7 0.5 91.4 0.1 91.8 0.1 93.1 0.1 81.4 0.4 87.2
88.0 0.1 65.2 0.5 95.3 0.1 92.2 0.4 91.5 0.1 91.7 0.1 93.2 0.1 82.4 0.5 87.4
87.9 0.1 65.5 0.5 95.3 0.1 92.2 0.5 91.7 0.2 91.7 0.1 93.2 0.1 82.0 0.5 87.4

RoBERTa base + self-distillation 88.1 0.1 63.7 0.5 95.2 0.1 90.3 0.4 90.4 0.1 91.5 0.1 93.1 0.1 79.7 0.5 86.5

Mesa 2: RoBERTa base and GAL self-training results on GLUE dev sets, averaged across 5 indepen-
dent runs (numbers in the subscript indicate the error bar, es decir., standard deviation divided by

5.).

√

et al., 2019a), BERT-Theseus (Xu et al., 2020),
tinyBERT (Jiao et al., 2019), and MATE-KD
(Rashid et al., 2021). It also outperforms our own
DistilRoBERTa+KD baseline, which learns from
soft labels produced by an identical RoBERTa-
large ensemble on the original labeled dataset.
While the use of soft labels outperform the vanilla
fine-tuned DistilRoBERTa model, it significantly
underperforms our KD+GAL baseline. Nosotros también
compare with two strong data-augmentation base-
líneas, round-trip translation (RT) (Yu et al., 2018;
Schleifer, 2019) and word substitutions (WS) (Jiao
et al., 2019; Wei and Zou, 2019). For RT, We gen-
erate 40× unlabeled data using German as the
bridge language (English→German→English). El
translations are generated via the best model in
WMT19 (Ng et al., 2019). We use the codebase
from Jiao et al. (2019) to conduct WS data aug-
mentation. We mirror the KD experimental setup
of GAL for both RT and WS. Although Distil-
RoBERTa+RT and DistilRoBERTa+WS are bet-
ter than vanilla DistilRoBERTa and KD variants,
they still drastically underperform our approach.

5.2 Self-Training with GAL on GLUE

We fine-tune a pretrained RoBERTa model pro-
vided by fairseq (Ott et al., 2019) on each GLUE
tarea. Fine-tuned RoBERTa serves as the first
teacher model for self-training. Each student
model is initialized with the original pretrained
RoBERTa and fine-tuned with exactly the same
hyperparameters as suggested by fairseq (Ott
et al., 2019). We combine the labeled dataset
L and the synthetic dataset U with a ratio of 1:1,
by oversampling labeled data. This corresponds
to λ = 0.5 in Eq. (7).

Mesa 2 shows that GAL provides an aver-
age improvement of +1.3% over RoBERTa-base.
We see consistent improvements with more GAL
iterations, but performance saturates after three
iterations. We further compare our approach with

a self-distillation (Furlanello et al., 2018) base-
line, in which the teacher and student models
use the same architecture and transfer knowledge
via the original labeled training set. A pesar de
self-distillation provides a slight improvement,
the gains from GAL are more significant.

We delve deeper and combine GAL self-
training with RoBERTa-large and report test re-
sults for both single model and ensemble model
en mesa 3. We observe consistent gains coming
from GAL on RoBERTa-large. Our results un-
derperform the latest and largest LMs from the
GLUE leaderboard, but we are optimistic that
GAL can be effectively combined with enormous
LMs to provide additional gains.

5.3 Prompt-based Few-shot Experiments

GPT3 (Brown y cols., 2020) has introduced an
optimization-free paradigm for few-shot learning
for NLP. Without updating the parameters, grande
LMs can correctly predict the labels of the in-
puts by conditioning on a prompt, which consists
of an instruction, a few labeled instances and a
new unlabeled input. We apply GAL to prompt-
based few-shot learning. Específicamente, we present
k labeled examples as a prompt to GPT-J (Wang
and Komatsuzaki, 2021), an open-sourced re-
implementation of GPT-3-6B, and generate m
synthetic examples, followed by the correspond-
ing labels. Note that to mitigate noisy outputs, el
generation of each synthetic example only condi-
tions on the original k labeled examples. Finalmente,
we concatenate the original k examples and m
synthetic examples, and conduct a (k + metro)-shot
learning experiment with GPT-J.

Brown et al. (2020) studied a total of 51 pocos-
shot learning tasks. Studying all of these tasks is
prohibitively expensive. De este modo, we filter tasks by
following these two steps. Primero, since generating
m synthetic examples for each test instance is
computationally expensive, we exclude tasks that

832

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo

MNLI(m/mm) CoLA SST-2 MRPC

STS-B

QQP

QNLI RTE Avg

Individual Models (our implementation):
RoBERTa-large
RoBERTa-large + GAL

90.1/89.7
90.2/89.8

Ensemble Models (our implementation):
RoBERTa-large
RoBERTa-large + GAL

91.2/90.5
91.0/90.7

63.8
66.2

96.1
96.4

91.2/88.3
92.0/89.2

90.9/90.7
90.7/90.5

72.5/89.6
73.6/89.9

94.5
95.0

85.9
86.3

86.5
87.1

66.8
67.9

96.9
97.1

92.8/90.3
93.1/90.8

91.9/91.6
91.6/91.4

74.5/90.4
74.5/90.4

95.5
95.8

87.7
88.2

87.9
88.2

State-of-the-art:
RoBERTa-large
ELECTRA
T5
ERNIE
DeBERTa

90.8/90.2
91.3/90.8
92.2/91.9
91.9/91.4
91.9/91.6

67.8
71.7
71.6
74.4
71.5

96.7
97.1
97.5
97.8
97.5

92.3/89.8
93.1/90.7
92.8/90.4
93.9/91.8
94.0/92.0

92.2/91.9
92.9/92.5
93.1/92.8
93.0/92.6
92.9/92.6

74.3/90.3
75.6/90.8
75.1/90.6
75.2/90.9
76.2/90.8

95.4
95.8
96.9
97.3
99.2

88.2
89.8
92.8
92.0
93.2

88.0
89.2
89.8
90.2
90.3

Mesa 3: RoBERTa-large with GAL self-training and SoTA methods evaluated on GLUE test sets. The benefit
of GAL on single models is larger than ensembles. It appears that self-training reduce the variance of models.
Baselines including much larger models: RoBERTa-large (Liu et al., 2019), ELECTRA (Clark et al., 2020),
T5 (Rafael y col., 2020), ERNIE (Sun et al., 2019b), and DeBERTa (He et al., 2020). MNLI-m and MNLI-mm
indicate matched and mismatched, respectivamente.

Modelo

4-shot
8-shot
16-shot

SST-2

PIQA COPA BoolQ Avg

89.8 0.8
91.3 0.8
92.7 0.6

76.0 1.4
76.2 1.2
77.0 0.9

79.0 1.5
79.0 1.5
81.0 1.1

64.3 0.8
66.2 0.8
66.8 0.8

77.3
78.2
79.4

78.5

4-shot + synthetic 12-shot (GAL )

91.5 0.7

76.7 1.0

80.0 1.2

65.9 0.8

Mesa 4: Few-shot learning results for GPT-J (6B) (Wang and Komatsuzaki, 2021)
on four NLP datasets. Accuracy is reported for these datasets.

have more than 5k test examples. Segundo, we fil-
ter tasks on which GPT-3-6B achieves a score
lower than 65% (please refer to Table H.1 in
Brown et al. [2020] for more details). After ap-
plying the filtering steps, we use four datasets:
SST-2 (Wang y cols., 2019C), PIQA (Bisk et al.,
2020), COPA, and BoolQ (Wang y cols., 2019b) como
the testbed. We notice that in order to generate
valid synthetic data, GPT-J requires to see at
el menos 4 labeled examples. Además, a lo sumo
16 examples of BoolQ can be fed into GPT-J
without truncation. De este modo, we set k and m to 4
y 12, respectivamente. As seen in Table 4, GAL
leads to an average improvement of 1.2% encima
4-shot learning, and reduces the gap between
4-shot and 16-shot learning. We noticed that the
quality of some generated examples is low. Nosotros
believe the performance of few-shot learning can
be further improved with high-quality instances.
One solution is to generate many synthetic ex-
amples, and select a high-quality subset. Desde
each test instance conditions on distinct labeled

instancias, one has to generate different synthetic
instances for each test example from GPT-J, cual
causes expensive computation. Due to such com-
putational constraints, we leave the investigation
of data selection strategies to the future work.

5.4 Ablating Components of GAL on GLUE

We conduct an in-depth study of different com-
ponents of GAL on GLUE datasets. Unless stated
de lo contrario, we use a RoBERTa-base model with a
combination of the original training data and 40×
synthetic data for each self-training experiment.

GPT-2 Model Size. Radford et al.
(2019)
present a few variants of the GPT-2 model includ-
ing GPT-2, GPT-2-medium, GPT-2-large, y
GPT-2-XL. Larger GPT-2 models yield better
perplexity scores and higher generation quality.
We utilize these models except GPT-2-XL within
the GAL framework to study the impact of the
generative model’s quality on downstream task’s
actuación. Mesa 5 shows that regardless of the

833

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

GPT-2

SST-2 RTE MRPC CoLA

NA
pequeño
medio
grande

94.8
95.5
95.3
95.3

78.8
81.3
81.3
81.4

90.1
90.9
91.3
91.7

63.6
63.9
63.7
65.1

Mesa 5: GAL with various GPT-2 model sizes on
GLUE dev sets. NA indicates a RoBERTa base
modelo. We bold the best numbers.

Pseudo label

SST-2 RTE MRPC CoLA

hard
soft

95.0
95.3

80.7
81.4

90.8
91.7

63.0
65.1

Mesa 6: GAL with soft vs. hard pseudo labels on
GLUE dev sets. We bold the best numbers.

GPT-2 model sizes, GAL consistently surpasses
the vanilla RoBERTa base. Además, SST-2 and
RTE datasets are not sensitive to the capacity of
GPT-2, but higher quality synthetic text improves
the results on MRPC and CoLA datasets. We leave
investigation of GPT-2-XL and even larger LMs
such as GPT-3 (Brown y cols., 2020) to future work.

Soft vs. Hard Pseudo Label. We investigate
the use of soft and hard pseudo labels within the
GAL framework. The results in Table 6 sugerir
that GAL using soft pseudo labels is more effective
than hard labels on the GLUE benchmark. Este
finding is compatible with the intuition that soft
labels enable measuring the functional similarity
of neural networks better (Hinton et al., 2015).

él

Class-conditional Synthetic Data Generation.
Trabajo previo (Kumar et al., 2020b; Ravuri
and Vinyals, 2019) suggests that
is chal-
lenging to utilize labeled synthetic data from
class-conditional generative models to boost the
accuracy of text and image classifiers. Our theory
en la sección 4 points to the potential drawback of
class-conditional synthetic data. We empirically
study this phenomenon, by fine-tuning GPT-2 in a
class-conditional manner. Then we utilize its syn-
thetic examples in two different cases: 1) labeled
synthetic examples and 2) unlabeled synthetic ex-
amples. Mesa 7 shows that not only do class-
conditional LMs underperform unconditional LMs
in our GAL framework, but also they are much
worse than the baseline, when using the pre-
defined labels. Sin embargo, if we apply GAL
to these examples, the class-conditional LM is

on par with the unconditional one, which cor-
roborates the importance of the annotation step in
GAL. We provide more analysis in Appendix A.3.

6 Limitaciones

This work demonstrates that one can leverage
synthetic in-domain data generated by powerful
pre-trained generative models. Por simplicidad, nosotros
do not employ any filtering avenue to retain di-
verse but high-quality data points. Sin embargo, pre-
vious work has shown that advanced filtering
approaches can further improve the performance
(Sohn et al., 2020; Du et al., 2020; Yang et al.,
2020). Given that the improvements in the self-
training are not sizeable, we believe it is worth
imposing filtering methods on the synthetic data
to mitigate the side effects caused by the noisy
data points.

Although we examine the effectiveness of GAL
on various classification tasks, we still focus on
the sentence-level tasks. Because of the superior
performance on sentence-level tasks, there has
been a surge of interest shift to document-level
tareas, such as document-level machine transla-
ción (Miculicich et al., 2018; Voita et al., 2018;
Maruf and Haffari, 2018), document summariza-
ción (Rush et al., 2015; Nallapati et al., 2016),
Etcétera. As these tasks suffer from data scar-
city, one can leverage GAL to synthesize more
data points. Sin embargo, previous work has shown
that GPT-2 has difficulty generating coherent text
requiring long-range dependency (Orbach and
Goldberg, 2020; Guan et al., 2020). De este modo, semejante
a limitation may hinder the application of GAL
to document-level tasks.

Además, the label space of the studied tasks
is not as complex as the structured prediction
tareas, such as machine translation, dialog system,
question answering, etcétera. Sin embargo, we be-
lieve one can smoothly adapt GAL to these tasks
también. Let us consider machine translation (MONTE)
as a canonical structured prediction task. Previo
work has shown that one can use (real) monolin-
gual data, in either source or the target language,
through data augmentation (Sennrich et al., 2016)
or knowledge distillation (Kim and Rush, 2016)
to improve the structured prediction tasks. Este
suggests a promising avenue for future research
on using synthetically generate monolingual data
to improve MT for specialized domains where
even monolingual data is scarce.

834

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Generative model

Labeled synthetic data SST-2 RTE MRPC CoLA

None (base)

Class-conditional LM
Unconditional LM (GAL )
Class-conditional LM (GAL)

−

94.8

92.9
95.3
95.4

78.8

74.4
81.4
81.0

90.1

86.0
91.7
91.4

63.6

58.4
65.1
65.2

Mesa 7: Synthetic data from class-conditional LMs underperforms GAL and RoBERTa on
GLUE dev sets.

Además, Vu et al. (2021a) suggest that one
can leverage a retrieval-based approach to ob-
tain monolingual sentences from the generic data
historias. This retrieved monolingual data is then
employed to improve the translation quality in
a domain adaptation setting. This suggests that
a GAL-based approach to synthetically generate
monolingual text is a promising method to im-
prove MT for specialized domains—an interest-
ing direction for future research.

7 Conclusión

We present Generate, Annotate, and Learn (GAL):
a framework for self-training and knowledge dis-
tillation with generated unlabeled data. We mo-
tivate GAL from an expected risk minimization
perspective and demonstrate both theoretically
and empirically that the use of unconditional gen-
erative models for synthetic data generation is
more effective than class-conditional generative
models previously used in the literature. GAL
leverages advances in large pretrained language
models to help supervised learning and can have
implications for learning from limited labeled
datos. GAL significantly helps improve knowledge
distillation and prompt-based few-shot learning.
Además, a concurrent work (Gowal et al.,
2021) has shown that using generated images can
enhance the robustness of images classifiers. Nosotros
will explore this direction on NLP tasks in the
future. Finalmente, we hope that GAL will stimulate
new research on the evaluation and development
of large language models.

Expresiones de gratitud

We would like to thank the anonymous review-
ers and action editor Andr´e F.T. Martins for their
comments and suggestions on this work. El COM-
putational resources of this work are partly sup-
ported by the Multi-modal Australian ScienceS

Imaging and Visualisation Environment (MAS-
SIVE) (www.massive.org.au). This mate-
rial is partly based on research sponsored by Air
Force Research Laboratory and DARPA under
agreement number FA8750-19-2-0501. Estados Unidos.
Government is authorized to reproduce and dis-
tribute reprints for Governmental purposes not-
withstanding any copyright notation thereon.

Referencias

Steven Abney.

2004. Comprensión

el
Yarowsky algorithm. Computational Linguis-
tics, 30(3):365–395. https://doi.org/10
.1162/0891201041850876

A. Agrawala. 1970. Learning with a probabilis-
tic teacher. Transacciones IEEE sobre información
Teoría, 16(4):373–379. https://doi.org
/10.1109/TIT.1970.1054472

Chris Alberti, Daniel Andor, Emily Pitler, Jacob
Devlin, and Michael Collins. 2019. Synthetic
QA corpora generation with roundtrip con-
sistency. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Lingüística, pages 6168–6173. https://doi
.org/10.18653/v1/P19-1620

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, y
Yejin Choi. 2020. PIQA: Reasoning about
physical commonsense in natural language. En
Proceedings of the AAAI Conference on Artifi-
cial Intelligence, volumen 34, pages 7432–7439.
https://doi.org/10.1609/aaai.v34i05
.6239

Tom B. Marrón, Benjamín Mann, Nick Ryder,
Melanie Subbiah,
Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, sandhi
agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, niño rewon, Aditya Ramesh,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,

835

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

✓
✗
✗

Christopher Hesse, Marcos Chen, Eric Sigler,
Mateusz Litwin, Scott Gris, Benjamín
Ajedrez, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever, y
Dario Amodei. 2020. Language models are
few-shot learners. arXiv:2005.14165.

Steven Y. feng, Varun Gangal, Jason Wei, Sarath
chandar, Soroush Vosoughi, Teruko Mitamura,
and Eduard Hovy. 2021. A survey of data aug-
mentation approaches for NLP. En hallazgos de
la Asociación de Lingüística Computacional:
ACL-IJCNLP 2021, pages 968–988.

Cristian Bucilu˘a, Rich Caruana, and Alexandru
Niculescu-Mizil. 2006. Model compression.
Proceedings of the 12th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and
Data Mining, pages 535–541. https://doi
.org/10.1145/1150402.1150464

Yair Carmon, Aditi Raghunathan, Ludwig
Schmidt, John C. Duchi, and Percy S. Liang.
2019. Unlabeled data improves adversarial
robustez. Advances in Neural Information
Sistemas de procesamiento, 32.

Olivier Chapelle, Jason Weston, L´eon Bottou,
and Vladimir Vapnik. 2001. Vicinal risk min-
imization. Advances in Neural Information
Sistemas de procesamiento.

Ting Chen, Simon Kornblith, Kevin Swersky,
Mohammad Norouzi, and Geoffrey Hinton.
2020a. Big self-supervised models are strong
semi-supervised learners. NeurIPS.

Yining Chen, Colin Wei, Ananya Kumar, y
Tengyu Ma. 2020b. Self-training avoids us-
ing spurious features under domain shift. En
Avances en el procesamiento de información neuronal
Sistemas 33: Annual Conference on Neural In-
formation Processing Systems 2020, NeurIPS
2020, December 6-12, 2020, virtual.

Kevin Clark, Minh-Thang Luong, Quoc V. Le,
and Christopher D. Manning. 2020. Electra:
Pre-training text encoders as discriminators
rather than generators. International Confer-
ence on Learning Representations.

Jingfei Du, Edouard Grave, Beliz Gunel, Vishrav
Chaudhary, Onur Celebi, Michael Auli, Ves
Stoyanov, and Alexis Conneau. 2020. Self-
training improves pre-training for natural lan-
guage understanding. arXiv:2010.02194.

Jason Eisner and Damianos Karakos. 2005.
Bootstrapping without the boot. En curso-
ings of Human Language Technology Confer-
ence and Conference on Empirical Methods in
Natural Language Processing, pages 395–402.
https://doi.org/10.3115/1220575
.1220625

S. Fralick. 1967. Learning to recognize patterns
without a teacher. IEEE Transactions on In-
formation Theory. https://doi.org/10
.1109/TIT.1967.1053952

Tommaso Furlanello, Zachary Lipton, Miguel
Tschannen, Laurent Itti, and Anima Anandkumar.
Enterrar-
2018. Born again neural networks.
Conferencia nacional sobre aprendizaje automático,
pages 1607–1616.

Leo Gao, Jonathan Tow, Stella Biderman, Sid
Negro, Anthony DiPofi, Charles Foster, Laurence
Golding, Jeffrey Hsu, Kyle McDonell, Niklas
Muennighoff, Jason Phang, Laria Reynolds,
Eric Tang, Anish Thite, Ben Wang, Kevin
Wang, and Andy Zou. 2021. A framework for
few-shot language model evaluation.

Sven Gowal, Sylvestre-Alvise Rebuffi, Olivia
Wiles, Florian Stimberg, Dan Andrei Calian,
and Timothy A. Mann. 2021. Improving robust-
ness using generated data. Avances en Neurología
Sistemas de procesamiento de información, 34.

Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan
Zhu, and Minlie Huang. 2020. A knowledge-
enhanced pretraining model for commonsense
story generation. Transacciones de la Asociación-
ción para la Lingüística Computacional, 8:93–108.
https://doi.org/10.1162/tacl a 00302

Gholamreza Haffari and Anoop Sarkar. 2007.
Analysis of semi-supervised learning with the
yarowsky algorithm. In UAI 2007, Proceed-
ings of the Twenty-Third Conference on Un-
certainty in Artificial Intelligence, vancouver,
BC, Canada, Julio 19-22, 2007, pages 159–166.
AUAI Press.

Pengcheng He, Xiaodong Liu, Jianfeng Gao,
and Weizhu Chen. 2020. Deberta: Decoding-
enhanced BERT with disentangled attention.
arXiv:2006.03654.

Danny Hernandez, Jared Kaplan, Tom Henighan,
and Sam McCandlish. 2021. Scaling laws for
transfer.

836

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.
2015. Distilling the knowledge in a neural
network. arXiv:1503.02531.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,
Xiao Chen, Linlin Li, Fang Wang, and Qun
Liu. 2019. TinyBERT: Distilling BERT for
language understanding. arXiv:1909
natural
.10351. https://doi.org/10.18653/v1
/2020.findings-emnlp.372

Yoon Kim and Alexander M. Rush. 2016.
Sequence-level knowledge distillation. En profesional-
cesiones de la 2016 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural,
pages 1317–1327. https://doi.org/10
.18653/v1/D16-1139

Sosuke Kobayashi. 2018. Contextual augmenta-
ción: Data augmentation by words with paradig-
matic relations. En Actas de la 2018
Conferencia del Capítulo Norteamericano
de la Asociación de Linguis Computacional-
tics: Tecnologías del lenguaje humano, Volumen 2
(Artículos breves), pages 452–457, Nueva Orleans,
Luisiana. Asociación de Computación
Lingüística. https://doi.org/10.18653
/v1/N18-2072

Ananya Kumar, Tengyu Ma, y Percy Liang.
2020a. Understanding self-training for gradual
domain adaptation. In Proceedings of the 37th
Conferencia internacional sobre aprendizaje automático-
En g, volumen 119 de Actas de Máquina
Investigación del aprendizaje, pages 5468–5479. PMLR.

Varun Kumar, Ashutosh Choudhary, and Eunah
Dar. 2020b. Data augmentation using pre-
trained transformer models. arXiv:2003.02245.

Jared Lichtarge, Chris Alberti, Shankar Kumar,
Noam Shazeer, Niki Parmar, and Simon Tong.
2019. Corpora generation for grammatical error
correction. arXiv:1904.05780. https://doi
.org/10.18653/v1/N19-1333

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019. Roberta: A robustly optimized
bert pretraining approach. arXiv:1907.11692.

Sameen Maruf and Gholamreza Haffari. 2018.
Document context neural machine translation
with memory networks. En Actas de la
56ª Reunión Anual de la Asociación de
Ligüística computacional (Volumen 1: Largo

Documentos), pages 1275–1284. https://doi.org
/10.18653/v1/P18-1118

Alana Marzoev, Samuel Madden, METRO. Frans
Kaashoek, miguel j.. Cafarella, and Jacob
Andreas. 2020. Unnatural language processing:
Bridging the gap between synthetic and natural
language data. ArXiv, abs/2004.13645.

Lesly Miculicich, Dhananjay Ram, Nikolaos
Pappas, and James Henderson. 2018. Document-
level neural machine translation with hierarchi-
cal attention networks. En Actas de la
2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 2947–2954,
Bruselas, Bélgica. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D18-1325

Hossein Mobahi, Mehrdad Farajtabar, and Peter
l. Bartlett. 2020. Self-distillation amplifies
regularization in hilbert space. In Advances in
Neural Information Processing Systems 33:
Annual Conference on Neural Information Pro-
cessing Systems 2020, NeurIPS 2020, Decem-
ber 6-12, 2020, virtual.

Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre,
Bing Xiang. 2016. Abstractive text summari-
zation using sequence-to-sequence RNNs and
más allá de. In Proceedings of The 20th SIGNLL
Conference on Computational Natural Lan-
guage Learning, pages 280–290. https://
doi.org/10.18653/v1/K16-1028

Nathan Ng, Kyra Yee, Alexei Baevski, Myle
Ott, Michael Auli, and Sergey Edunov. 2019.
Facebook fair’s wmt19 news translation task
envío. In Proceedings of the Fourth Con-
ference on Machine Translation (Volumen 2:
Shared Task Papers, Day 1), pages 314–319.

Sajad Norouzi, David J. Fleet, and Mohammad
Norouzi. 2020. Exemplar vaes for exem-
plar based generation and data augmentation.
arXiv:2004.04795.

Eyal Orbach

and Yoav Goldberg.

2020.
Facts2Story: Controlling text generation by
key facts. In Proceedings of the 28th Interna-
tional Conference on Computational Linguistics,
pages 2329–2345, Barcelona, España (En línea).
International Committee on Computational
Lingüística. https://doi.org/10.18653
/v1/2020.coling-main.211

Myle Ott, Sergey Edunov, Alexei Baevski,
Angela Fan, Sam Gross, Nathan Ng, David

837

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Grangier, and Michael Auli. 2019. fairseq: A
fast, extensible toolkit for sequence modeling.
Actas de la 2019 Conference of the
North American Chapter of the Association for
Ligüística computacional (Demonstrations),
pages 48–53. https://doi.org/10.18653
/v1/N19-4009

Samet Oymak and Talha Cihad Gulcu. 2020.
Statistical and algorithmic insights for semi-
supervised learning with self-training. CORR,
abs/2006.11006.

Alec Radford, Jeffrey Wu, niño rewon, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. https://d4mucfpksywv-models
-models/language models are unsupervised
multitask learners.pdf

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Miguel
mañana, Yanqi Zhou, wei li, y Pedro J.. Liu.
2020. Explorando los límites del aprendizaje por transferencia
con un transformador unificado de texto a texto. Diario
de Investigación sobre Aprendizaje Automático, 21:1–67.

Aditi Raghunathan, Sang Michael Xie, Fanny
Cual, John Duchi, y Percy Liang. 2020.
Understanding and mitigating the tradeoff be-
tween robustness and accuracy. En procedimientos
of the 37th International Conference on Ma-
chine Learning, volumen 119 of Proceedings of
Machine Learning Research, pages 7909–7919.
PMLR.

Ahmad Rashid, Vasileios Lioutas, and Mehdi
Rezagholizadeh. 2021. Mate-kd: Masked ad-
versarial text, a companion to knowledge dis-
tillation. arXiv preimpresión arXiv:2105.05912.
https://doi.org/10.18653/v1/2021
.acl-long.86

Suman Ravuri and Oriol Vinyals. 2019. Classi-
fication accuracy score for conditional genera-
tive models. Advances in Neural Information
Sistemas de procesamiento, pages 12268–12279.

Alejandro M.. Rush, Sumit Chopra, and Jason
Weston. 2015. A neural attention model for
abstractive sentence summarization. En profesional-
cesiones de la 2015 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural,
pages 379–389.

Víctor Sanh, Debut de Lysandre, Julien Chaumond,
and Thomas Wolf. 2019. DistilBERT, a dis-
tilled version of BERT: Menor, faster, cheaper
and lighter. ArXiv, abs/1910.01108.

h. Scudder. 1965. Probability of error of some
adaptive pattern-recognition machines. IEEE
Transactions on Information Theory. https://
doi.org/10.1109/TIT.1965.1053799

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Improving neural machine trans-
lation models with monolingual data. En profesional-
ceedings of the 54th Annual Meeting of the
Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 86–96, Berlina,
Alemania. Asociación
for Computational
Lingüística. https://doi.org/10.18653/v1
/P16-1009

Dinghan Shen, Mingzhi Zheng, Yelong Shen,
Yanru Qu, and Weizhu Chen. 2020. un sim-
ple but tough-to-beat data augmentation ap-
proach for natural language understanding and
generación. arXiv preimpresión arXiv:2009.13818.

Sam Shleifer. 2019. Low resource text classifi-
cation with ulmfit and backtranslation. arXiv
preprint arXiv:1903.09244.

Kihyuk Sohn, David Berthelot, Chun-Liang Li,
Zizhao Zhang, Nicholas Carlini, Ekin D. Cubuk,
Alex Kurakin, Han Zhang, y Colin Raffel.
2020. Fixmatch: Simplifying semi-supervised
learning with consistency and confidence.
arXiv:2001.07685.

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing
Liu. 2019a. Patient knowledge distillation for
BERT model compression. Actas de
el 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
Procesamiento del lenguaje oral (EMNLP-IJCNLP),
pages 4314–4323. https://doi.org/10
.18653/v1/D19-1441

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng,
Xuyi Chen, Han Zhang, Xin Tian, Danxiang
Zhu, Hao Tian, and Hua Wu. 2019b. Ernie:
Enhanced representation through knowledge
integración. arXiv preimpresión arXiv:1904.09223.

Nicola Ueffing, Gholamreza Haffari, and Anoop
Sarkar. 2007. Transductive learning for sta-
tistical machine translation. En procedimientos de
the 45th Annual Meeting of the Association

838

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

of Computational Linguistics, pages 25–32,
Prague, Czech Republic. Asociación para Com-
Lingüística putacional.

Vladimir Vapnik. 1992. Principles of risk mini-
mization for learning theory. Advances in Neu-
ral Information Processing Systems.

Elena Voita, Pavel Serdyukov, Rico Sennrich,
and Ivan Titov. 2018. Context-aware neural
machine translation learns anaphora resolution.
In Proceedings of the 56th Annual Meeting
de la Asociación de Linguis Computacional-
tics (Volumen 1: Artículos largos), pages 1264–1274.
https://doi.org/10.18653/v1/P18
-1117

Thuy Vu, Xuanli He, Dinh Phung,

y
Gholamreza Haffari. 2021a. Generalised unsu-
pervised domain adaptation of neural machine
translation with cross-lingual data selection. En
Actas de la 2021 Conferencia sobre el Imperio-
Métodos icales en el procesamiento del lenguaje natural,
pages 3335–3346.

aprendiendo. En procedimientos de

Tu Vu, Minh-Thang Luong, Quoc Le, grady
Simón, and Mohit Iyyer. 2021b. Strata: Self-
training with task augmentation for better few-
el 2021
shot
Jornada sobre Métodos Empíricos en Natu-
Procesamiento del lenguaje oral, pages 5715–5731.
https://doi.org/10.18653/v1/2021
.emnlp-main.462

Alex Wang, Jan Hula, Patrick Xia, Raghavendra
Pappagari, R. Thomas McCoy, Roma Patel,
Najoung Kim, Ian Tenney, Yinghui Huang,
Katherin Yu, Shuning Jin, Berlin Chen,
Benjamin Van Durme, Edouard Grave, Ellie
Pavlick, and Samuel R. Bowman. 2019a. Can
you tell me how to get past sesame street?
Sentence-level pretraining beyond language
modelado. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Lingüística, pages 4465–4476. https://doi
.org/10.18653/v1/P19-1439

Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill,
Omer Levy, and Samuel R. Bowman. 2019b.
SuperGLUE: A stickier benchmark for general-
purpose language understanding systems. arXiv:
1905.00537.

Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R.

Bowman. 2019C. GLUE: A multi-task bench-
mark and analysis platform for natural lan-
guage understanding. International Conference
on Learning Representations. https://doi
.org/10.18653/v1/W18-5446

Bailin Wang, Wenpeng Yin, Xi Victoria Lin, y
Caiming Xiong. 2021. Learning to synthesize
data for semantic parsing. En procedimientos de
the Meeting of the North-American Chapter
of Association for Computational Linguistics
(NAACL). https://doi.org/10.18653
/v1/2021.naacl-main.220

Ben Wang and Aran Komatsuzaki. 2021.
GPT-J-6B: A 6 billion parameter autoregres-
sive language model. https://github.com
/kingoflolz/mesh-transformer-jax.

William Yang Wang and Diyi Yang. 2015.
That’s so annoying!!!: A lexical and frame-
semantic embedding based data augmenta-
tion approach to automatic categorization of
annoying behaviors using# petpeeve tweets.
Actas de la 2015 Conferencia sobre el Imperio-
Métodos icales en el procesamiento del lenguaje natural,
pages 2557–2563. https://doi.org/10
.18653/v1/D15-1306

Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong
chi. 2018. Kdgan: Knowledge distillation with
generative adversarial networks. NeurIPS.

Yushi Wang, Jonathan Berant, y Percy Liang.
2015. Building a semantic parser overnight.
En procedimientos de
the 53rd Annual Meet-
ing of
the Association for Computational
Linguistics and the 7th International Joint
Conferencia sobre procesamiento del lenguaje natural
(Volumen 1: Artículos largos), pages 1332–1342,
Beijing, Porcelana. Asociación de Computación
Lingüística. https://doi.org/10.3115
/v1/P15-1129

Colin Wei, Kendrick Shen, Yining Chen, y
Tengyu Ma. 2021. Theoretical analysis of
self-training with deep networks on unlabeled
datos. En Conferencia Internacional sobre Aprendizaje
Representaciones.

Jason Wei and Kai Zou. 2019. Eda: Easy data
augmentation techniques for boosting perfor-
mance on text classification tasks. En profesional-
cesiones de la 2019 Conferencia sobre Empirismo
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),

839

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

pages 6382–6388. https://doi.org/10
.18653/v1/D19-1670

Tomás Lobo, Debut de Lysandre, Víctor Sanh,
Julien Chaumond, Clemente Delangue, Antonio
moi, Pierric Cistac, Tim Rault, Remi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drama, Quintín
Lhoest, y Alejandro Rush. 2020. Transform-
ers: State-of-the-art natural language process-
En g. Actas de la 2020 Conferencia sobre
Métodos empíricos en Natural Language Pro-
cesando: Demostraciones del sistema, páginas 38–45.
https://doi.org/10.18653/v1/2020
.emnlp-demos.6

Xing Wu, Shangwen Lv, Liangjun Zang,
Jizhong Han, and Songlin Hu. 2019. Condi-
tional BERT contextual augmentation. interna-
tional Conference on Computational Science,
pages 84–95, Saltador. https://doi.org
/10.1007/978-3-030-22747-0_7

Qizhe Xie, Minh-Thang Luong, Eduard Hovy,
and Quoc V. Le. 2020. Self-training with
noisy student
improves imagenet classifica-
ción. 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR),
pages 10684–10695. https://doi.org
/10.1109/CVPR42600.2020.01070

Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu
Wei, y Ming Zhou. 2020. Bert-of-theseus:
Compressing BERT by progressive module re-
placing. Actas de la 2020 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando (EMNLP), pages 7859–7869.

Yiben Yang, Chaitanya Malaviya,

Jared
Fernandez, Swabha Swayamdipta, Ronan Le
Bras, Ji-Ping Wang, Chandra Bhagavatula,
Yejin Choi, and Doug Downey. 2020. G-daug:
Generative data augmentation for common-
sense reasoning. arXiv:2004.11546. https://
doi.org/10.18653/v1/2020.findings
-emnlp.90

David Yarowsky. 1995. Unsupervised word sense
disambiguation rivaling supervised methods.
33rd Annual Meeting of the Association for
Ligüística computacional, pages 189–196.
https://doi.org/10.3115/981658.981684

Norouzi, and Quoc V Le. 2018. QANet: Com-
bining local convolution with global self-
attention for reading comprehension. ICLR.

Hongyi Zhang, Moustapha Cisse, Yann N.
Dauphin, and David Lopez-Paz. 2018. mixup:
Beyond empirical risk minimization. ICLR.

Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei
Chen, Chenglong Bao, and Kaisheng Ma. 2019.
Be your own teacher: Improve the perfor-
mance of convolutional neural networks via
self distillation. Proceedings of the IEEE/CVF
International Conference on Computer Vision,
pages 3713–3722. https://doi.org/10
.1109/ICCV.2019.00381

A Appendices

A.1 Datasets

The statistics of GLUE are reported in Table 8.

A.2 GPT-2 for Classification

is the GPT-2 model

We have conducted additional experiments, dónde
we fine-tune GPT-2 as a classifier. Tenemos
considered two variants of the GPT-2 model.
The first varant
is the original GPT-2 model
(GPT2-original) pre-trained on open-domain text.
eso
The second variant
was fine-tuned on the inputs of each task sepa-
rately (GPT-2-finetuned). This model was used to
generate task-specific (synthetic) unlabeled data.
Finalmente, we also consider self-training with GAL
on top of GPT2-original. Específicamente, we use
the GPT-2-finetuned model
to synthesize 40x
in-domain unlabeled data. Then we apply self-
training to GPT-2-original, where the data is
a combination of the original labeled data and
pseudo-labeled synthetic data. Mesa 9 suggests
that the gains of GAL come from the pseudo-
labeled synthetic data, es decir., both synthetic unla-
beled data and teacher’s knowledge. Sin
the generation of synthetic unlabeled data, el
domain-specific knowledge embedded in GPT-2-
finetuned model cannot be utilized. Tal como,
GPT-2-finetuned model is inferior to the GPT2-
modelo original. Since RoBERTa-large is superior
to GPT-2 models, RoBERTa-large+GAL also
significantly outperform the GPT-2 counterpart.

A.3 Importance of Pseudo-labels

Adams Wei Yu, David Dohan, Minh-Thang
Luong, Rui Zhao, Kai Chen, Mohammad

We have argued and demonstrated that using
class-conditional generative models to generate

840

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Dataset

tarea

domain

#train #dev

#prueba

#classes

SST-2
QQP
QNLI
RTE
MNLI
MRPC
CoLA
STS-B

movie reviews
social QA questions

sentiment analysis
paraphrase
QA/natural language inference Wikipedia
natural language inference
natural language inference
paraphrase
acceptability
similitud de oraciones

noticias, Wikipedia
misc.
noticias
misc.
misc.

67k
364k
105k
2.5k
393k
3.7k
8.5k
5.8k

872
40k
5k
277
20k
408
1043
15k

1.8k
391k
5.4k
3k
20k
1.7k
1k
1.4k

2
2
2
2
3
2
2
−

Mesa 8: Summary of the three sets of tasks used for evaluation of GAL. STS-B is a regression task, entonces
#classes is not applicable.

Modelo

MNLI

CoLA SST-2 MRPC

STS-B

QQP

QNLI RTE Avg

GPT-2-original
GPT-2-finetuned
GPT-2-original+GAL

85.9/85.6
85.8/85.5
86.2/85.8

RoBERTa-large
RoBERTa-large + GAL

90.1/89.7
90.2/89.8

54.8
40.9
55.7

63.8
66.2

94.5
94.5
94.7

96.1
96.4

86.9/82.2
87.0/81.0
87.9/83.4

86.3/85.2
85.6/84.3
86.9/85.9

72.5/89.3
71.4/88.5
72.6/89.4

91.2/88.3
92.0/89.2

90.9/90.7
90.7/90.5

72.5/89.6
73.6/89.9

91.2
91.5
91.9

94.5
95.0

69.8
69.0
70.6

85.9
86.3

80.9
78.8
81.5

86.5
87.1

Mesa 9: GLUE test results of using GPT-2 and RoBERTa-large as classification models.

Label type

Accuracy F1 Precision Recall

GPT2
RoBERTa
conditioning label

86.0
90.0
72.0

87.0
91.4
71.4

88.7
100.0
66.0

85.5
84.1
77.8

Mesa 10: Performance of GPT2 annotation,
RoBERTa annotation and conditioning labels on
100 random examples from the synthetic RTE
dataset generated by a class-conditional LM.

labeled synthetic examples is less effective than
GAL in Section 3 y Sección 5. To further
verify this argument, we sample 100 instancias
from the synthetic RTE dataset generated by the
label-prompted GPT2, as the class-conditional
LM. Then we annotate these examples using a
human annotator, GPT2 classifier, and RoBERTa
classifier. Finalmente, we compute the Accuracy, F1,

Precision, and Recall scores between human la-
bels and GPT2 labels, between human labels and
RoBERTa labels, and between human labels
and conditioned labels used by GPT2 when the
data was generated. Mesa 10 shows that class-
conditional LM has difficulty generating sen-
tences retaining the semantics or pragmatics of
a specified category, which also corroborates our
theoretical analysis in Section 3. En el otro
mano, discriminative models, such as GPT2 clas-
sifier and RoBERTa classifier, are able to produce
higher quality labels that correlate better with
human annotations.

A.4 Generated Unlabeled Examples
Annotated with Pseudo Labels

We provide some synthetic sentences generated
by GAL in Tables 11 y 12.

841

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

are more deeply thought through than in most ‘ right-thinking ’ films (positivo)

KNN:
1: is far more sophisticated, insightful and thought-provoking than his previous films .
(positivo)
2: is more sophisticated than its more obvious and less-than-dazzling counterparts (positivo)
3: is about as well-thought as the idea of a bad hair day, (negative)

contains no wit, only labored gags (negative)

KNN:
1: lacks insight, and lacks empathy (negative)
2: has little humor or intelligence (negative)
3: lacks all wit and humanity (negative)

Mesa 11: SST-2: Two labeled examples, junto con 3 nearest neighbors (based on RoBERTa
representaciones) from our synthetic dataset. We include labels for original examples and
pseudo-labels for synthetic examples in parenthesis.

How is the life of a math student? Could you describe your own experiences? [SEP] Which
level of prepration is enough for the exam jlpt5? (not duplicated)

KNN:
1: What are the best courses for a mechanical engineering student? [SEP] What is the best
course to do after completing a B.Tech in mechanical engineering? (not duplicated)
2: How much marks are needed to get through the GATE with electronics? [SEP] Cuál es el
average score of the Gate EE exam? What are the cut-offs? (not duplicated)
3: What is the best time table for students to prepare for IAS? [SEP] How can one study for
IAS in a best time? (not duplicated)

How does an IQ test work and what is determined from an IQ test? [SEP] How does IQ test
obras? (duplicated)

KNN:
1: What is the average IQ of the U.S. población? [SEP] How does an IQ test work? (no
duplicated)
2: Is the Iq test an effective way to measure intelligence? [SEP] How do IQ tests work?
(duplicated)
3: How is an IQ test on a scale from 1 a 100 scored? [SEP] How do you get your IQ tested?
(not duplicated)

Mesa 12: QQP: Two labeled examples, junto con 3 nearest neighbors (based on RoBERTa
representaciones) from our synthetic dataset. We include labels for original examples and
pseudo-labels for synthetic examples in parenthesis.

842

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

a
C
_
a
_
0
0
4
9
2
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Descargar PDF