Generate, Annotate, and Learn: NLP with Synthetic Text

Generate, Annotate, and Learn: NLP with Synthetic Text

Xuanli He1

Islam Nassar1 Jamie Kiros2 Gholamreza Haffari1 Mohammad Norouzi2
1Monash University, Australia

2Google Research, Brain Team, Canada

{xuanli.he1, gholamreza.haffari}@monash.edu, mnorouzi@google.com

Abstrait

This paper studies the use of language mod-
els as a source of synthetic unlabeled text for
NLP. We formulate a general framework called
‘‘generate, annotate, et apprendre (GAL)’’ to
take advantage of synthetic text within knowl-
edge distillation, self-training, and few-shot
learning applications. To generate high-quality
task-specific text, we either fine-tune LMs on
inputs from the task of interest, or prompt
large LMs with few examples. We use the best
available classifier to annotate synthetic text
with soft pseudo labels for knowledge distilla-
tion and self-training, and use LMs to obtain
hard labels for few-shot learning. We train new
supervised models on the combination of la-
beled and pseudo-labeled data, which results
in significant gains across several applications.
We investigate key components of GAL and
present theoretical and empirical arguments
against the use of class-conditional LMs to
generate synthetic labeled text instead of unla-
beled text. GAL achieves new state-of-the-art
knowledge distillation results for 6-layer trans-
formers on the GLUE leaderboard.

1

Introduction

There is an abundance of unlabeled data in the real
monde, but task-specific unlabeled data within the
scope of a given machine learning problem can be
challenging to find. Par exemple, one cannot easily
find in-domain unlabeled text conforming to the
input distribution of a specific Natural Language
Processing (NLP) task from the GLUE benchmark
(Wang et al., 2019c). Some NLP tasks require an
input comprising a pair of sentences with a partic-
ular relationship between them. De plus, clas-
sification datasets typically represent a tailored
distribution of data and only include a limited
number of class labels. If task-specific unlabeled
data were available, one could adopt self-training
(Yarowsky, 1995) to automatically annotate unla-
beled data with pseudo labels to improve accuracy
and robustness of classifiers (Xie et al., 2020;
Carmon et al., 2019). En outre, one can use

826

knowledge distillation (Hinton et al., 2015) sur
fresh task-specific unlabeled data to more effec-
tively compress deep neural networks and ensem-
bles (Bucilu˘a et al., 2006; Chen et al., 2020un).

In the absence of task-specific unlabeled data,
one could retrieve unlabeled examples from a
large and diverse open-domain dataset (Du et al.,
2020). Cependant, such a retrieval-based approach
may not scale to problems with complex input
schemes, Par exemple, sentence pairs with certain
relations. Recent work (Yang et al., 2020; Kumar
et coll., 2020b) has considered the use of Language
Models (LMs) like GPT-2 (Radford et al., 2019)
as a means of data augmentation, showing the
effectiveness of this approach for commonsense
reasoning and classification tasks. Existing ap-
proaches often consider class-conditional genera-
tion, where the synthetic data is produced by con
ditioning on a specified class label. Cependant, it
is unclear whether class-conditional generation is
best suited for NLP tasks. En outre, existing
pipelines often make synthetic data generation
complicated as one needs to detect and discard
low-quality synthetic labeled data or optionally
re-label data (Yang et al., 2020; Vu et al., 2021b).
Par exemple, Kumar et al. (2020b) observe that
it is difficult for sentences generated by label-
conditioned GPT-2 to retain the semantics/prag-
matics of the conditioning label, leading to poor
performance on downstream tasks.

We unify and simplify existing work on LMs
as a data source for NLP and develop a general
framework called ‘‘generate, annotate, et apprendre
(GAL )’’. The generality of GAL allows us to use
LM-generated synthetic data within novel appli-
cations such as Knowledge Distillation (KD) et
few-shot learning. GAL builds on recent advances
in text generation (Radford et al., 2019; Gao
et coll., 2021) and uses powerful LMs to synthesize
task-specific unlabeled text by fine-tuning or con-
ditioning a large LM on in-distribution examples.
We use state-of-the-art classifiers to annotate gen-
erated text with soft pseudo labels when possible.

Transactions of the Association for Computational Linguistics, vol. 10, pp. 826–842, 2022. https://doi.org/10.1162/tacl a 00492
Action Editor: Andr´e F.T. Martins. Submission batch: 2/2022; Revision batch: 4/2022; Published 8/2022.
c(cid:2) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

We then combine labeled data and pseudo-labeled
data to train more effective supervised models,
resulting in significant gains on a range of NLP
tasks like KD and few-shot learning.

We present a justification for GAL based on
the empirical and vicinal risk minimization frame-
travaux (Vapnik, 1992; Chapelle et al., 2001). Nous
also investigate key components of GAL. We find
that even if class-conditional LMs are available
for text generation, it is more effective to discard
the conditioning labels and let the teacher models
produce pseudo labels. This observation is sup-
ported by our theoretical and empirical results.
Accordingly, in contrast to prior work (Yang et al.,
2020; Vu et al., 2021b), we advocate for the use
of simple unconditional LMs for text synthesis.
Plus loin, we avoid any form of data filtering. Not
surprisingly, we find that the diversity of synthetic
text matters. That said, simple unconditional gen-
eration given random seeds provides sufficient
diversity, and crafting diverse LM prompts is
not needed.
En résumé:

• We develop GAL, a simple and effective
approach to the use of LMs for task-specific
unlabeled text generation. We show that GAL
can be used effectively for KD, self-training,
and few-shot learning in NLP.

• We present theoretical and empirical investi-
gations for GAL, explaining why it works and
why using class-conditional LMs to generate
synthetic labeled data is not as effective.

• GAL advances KD for NLP and establishes
a new state-of-the-art (SoTA) resu lt for a sin-
gle 6-layer transformer on the GLUE test set.
It further improves prompt-based few-shot
learning, providing an average improvement
de 1.3% on four 4-shot learning NLP tasks,
outperforming GPT-3-6B.

2 Related Work

Data synthesis with large pre-trained language
models is closely related to our work (Kumar
et coll., 2020b; Yang et al., 2020; Vu et al., 2021b;
Norouzi et al., 2020). Yang et al. (2020) propose
a complex scheme, including label-conditioned
data generation, data relabeling, data filtering,

and two-stage training, to utilize synthetic data.
Par contre, we show that a simple mixture of the
original data and synthetic unconditionally gener-
ated data can provide sizable gains. En outre,
we show a broader use of generative models on KD
and few-shot learning. Vu et al. (2021b) take a task
augmentation approach and employ conditional
generation to produce in-domain synthetic data
for an auxiliary language inference (NLI) task,
which is then used to initialize the target-task clas-
sifier. Cependant, not all tasks (par exemple., grammatical
acceptability judgments) can benefit from the NLI-
style auxiliary task (Wang et al., 2019un). We aim
to directly generate the unlabeled in-domain data
for the target task. Unlike Norouzi et al. (2020),
we do not use instance-based generative models.
More broadly, there has been a recent surge in
data synthesis and augmentation in NLP, inclure-
ing rule-based and model-based approaches; voir
Feng et al. (2021) for a recent survey. Data synthe-
sis with grammars has been explored in semantic
parsing and natural language understanding (par exemple.,
see Wang et al., 2015, 2021; Marzoev et al.,
2020). Existing approaches to data augmentation
for NLP include lexicon replacement, sentence re-
trieval, and round-trip machine translation (Wang
and Yang, 2015; Yu et al., 2018; Kobayashi,
2018; Wu et al., 2019; Lichtarge et al., 2019; Wei
and Zou, 2019; Alberti et al., 2019; Du et al.,
2020; Shen et al., 2020). Nous, instead, propose
the use of unconditional autoregressive LMs for
data augmentation. This is simple, flexible, et
powerful.

Self-training is one of the oldest approaches
for semi-supervised learning (Scudder, 1965,
Fralick, 1967; Agrawala, 1970; Yarowsky, 1995;
Eisner and Karakos, 2005; Ueffing et al., 2007;
Du et al., 2020). Abney (2004) and Haffari and
Sarkar (2007) have theoretically analyzed self-
training for simple decision lists. Recent theoreti-
cal work analyzes self-training for linear models,
often under the assumption that the data distri-
bution is (presque) Gaussian (Carmon et al., 2019;
Raghunathan et al., 2020; Chen et al., 2020b;
Kumar et al., 2020un; Oymak and Gulcu, 2020).
Wei et al. (2021) prove that, under ‘‘expansion’’
and ‘‘class separation’’ assumptions, self-training
can lead to more accurate neural network classi-
fiers. We present a theoretical framing of GAL in
terms of empirical and vicinal risk minimization
(Vapnik, 1992; Chapelle et al., 2001).

827

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Knowledge Distillation (KD) (Bucilu˘a et al.,
2006; Hinton et al., 2015) uses a procedure sim-
ilar to self-training to distill knowledge of an
expressive teacher model into a smaller student
model. In contrast, self-distillation (Furlanello
et coll., 2018; Zhang et al., 2019; Mobahi et al.,
2020) uses teacher and student models of equal
size, hoping to iteratively refine class labels. Pre-
vious work uses unlabeled data (Bucilu˘a et al.,
2006) and adversarial training (Wang et al., 2018)
to improve KD. We demonstrate that synthetic
data generated by unconditional generative mod-
els can improve KD on NLP, outperforming strong
KD baselines, which often add more complexity
and additional hyperparameters (par exemple., Sun et al.,
2019un; Jiao et al., 2019; Xu et al., 2020, Rashid
et coll., 2021).

3 Generate, Annotate, and Learn (GAL)

Given a labeled dataset L = {(xi, yi)}N
je = 1, nous
first train an unconditional domain-specific gen-
erative model g(X) on Lx = {xi}N
je = 1, and then
use it to synthesize unlabeled data. Such synthetic
unlabeled data is used within self-training and
KD even in the absence of in-domain unlabeled
data. We restrict our attention to basic KD and self-
training methods, even though GAL can be com-
bined with more sophisticated semi-supervised
techniques, aussi.

The effectiveness of GAL depends on the fi-
delity and diversity of synthetic examples. If we
had access to the oracle generative process, nous
would be able to obtain the best KD and SSL
résultats, as if we had access to real task-specific
unlabeled data. Our preliminary experiments sug-
gest that large language models are particularly
effective within the GAL framework. Ainsi, comme
shown in Figure 1, to build the best domain-
specific language model, we adopt a large lan-
guage model pretrained on lots of open-domain
text, and fine-tune it on a given dataset’s inputs,
c'est, Lx, ignoring class labels. Both our theory
and ablations confirm that ignoring class labels is
a good idea (c.f., Section 4 et 5). Transferring
the knowledge of large language models is par-
ticularly beneficial when a small input dataset Lx
of text is available (Hernandez et al., 2021).

To improve computational efficiency of GAL,
we do not generate unlabeled data on the fly,
but generate as many unconditional samples as
possible and store them in a synthetic unlabeled

Chiffre 1: An illustration of GAL for NLP. We use
open-domain data once for self-supervised pretraining
(par exemple., BERT) and once for training a large LM (par exemple.,
GPT-2). BERT is fine-tuned on labeled data to yield a
classifier for the task of interest. GPT-2 is fine-tuned on
the same data without labels to obtain an unconditional
task-specific LM, which is used to generate lots of
synthetic in-domain unlabeled data for self-training
and KD.

dataset U . We use soft pseudo labels within self-
training and KD, as we empirically found it is
more effective than using hard labels on synthe-
tic data.

3.1 Knowledge Distillation with GAL

KD distills knowledge of an expressive teacher
model into a smaller student model (Hinton et al.,
2015). We pose the following objective function
for KD with labeled and synthetic unlabeled data:

(cid:2)kd = λ E(X,oui)∼LH(oui, fs(X))+

(1 − λ) E(cid:2)x∼g(X)H(h((cid:2)X), fs((cid:2)X)),

(1)

where h is the teacher model, fs is the student
model, and g is the large pre-trained language
model (par exemple., GPT2) fine-tuned on the text in the
training data Lx. H(q, p) = q(cid:4) log p is the soft-
max cross entropy loss. Note the use of g(X),
approximating the unknown real data distribu-
tion P (X) dans (1). Algorithm 1 summarizes the
GAL-KD process.

3.2 Self-Training with GAL

Self-training encourages knowledge transfer be-
tween a teacher and a student model in such a
way that the student can outperform the teacher.
Algorithm 2 summarizes the GAL -self-training
processus. Given the labeled dataset L and the
synthetic unlabeled dataset U , an initial model de-
noted f1 is trained using supervised learning on the
labeled dataset L. Alors, at iteration t, one adopts

828

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

on each dataset of interest after removing class
labels. We find that training from scratch on these
datasets is hopeless, but the larger the pretrained
GPT-2 variant, the better the validation perplex-
ity scores are. For tasks modeling a relationship
between multiple sentences, we concatenate a
separator token [SEP] between consecutive sen-
tences. To alleviate an over-fitting on the train-
ing set, we use the best checkpoint evaluated on
the dev set as our generation engine. Once a fine-
tuned GPT-2 model is obtained, we generate new
domain-specific data by using top-k random sam-
pling similar to Radford et al. (2019). We do not
feed any prompt to the LM, but a special [BOS]
token to initiate the generation chain. A genera-
tion episode is terminated when a special [EOS]
token is produced. We generate diverse sentences
by varying the random seed. After collecting
enough synthetic data, we only retain unique sen-
tences. For tasks with α input sentences, we dis-
card generated samples that violate this constraint
(environ 10% of samples were rejected). Fi-
enfin, we obtain task-specific synthetic data up to
40× larger than the original training sets. For some
samples of generated text for GLUE see Tables 11
et 12. We believe using bigger LMs and larger
synthetic datasets will improve our results, but we
are constrained by computer resources.

4 An Empirical Risk Minimization

Perspective

In supervised learning, one seeks to learn a map-
ping f that, given an input x, predicts a reasonable
output y. To define the supervised learning prob-
lem formally, one assumes that input-output pairs
are drawn from a joint distribution P , namely,
(X, oui) ∼ P (X, oui), and a loss function H(oui, F (X))
is used to assess the quality of a mapping f . Ce
loss is used to define a notion of expected risk:

R.(F ) = EP (X,oui)H(oui, F (X)) .

(3)

In almost all practical applications P (X, oui) est
unknown. Ainsi, a labeled dataset of examples
L = {(xi, yi)}N
i=1 is used to approximate R(F ) comme

(cid:3)R.(F ) =

(cid:4)N
je = 1

1
N

H(yi, F (xi)) .

(4)

ft as the teacher model to annotate the unlabeled
dataset U using pseudo labels. In self-training
GAL, the student model ft+1 is trained to opti-
mize a classification loss on the combination of
L and U :

(cid:2)t+1 = λ E(X,oui)∼LH(oui, ft+1(X))+

(1 − λ) E(cid:2)x∼g(X)H(ft((cid:2)X), ft+1((cid:2)X)) ,

(2)

where λ = 0.5 unless stated otherwise. Although
many different variants of the basic self-training
algorithm discussed above exist in the literature,
we adopt the simplest variant of self-training and
limit hyperparameter tuning to a bare minimum.

3.3 Domain-Specific Text Generation

We take a pretrained GPT-2 language model
(Radford et al., 2019) and fine-tune it separately

This objective function is known as empirical
risk, and learning f through minimizing (cid:3)R.(F ) est

829

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

known as the empirical risk minimization princi-
ple (Vapnik, 1992). To compensate for the finite
sample size in (4), one typically combines (cid:3)R.(F )
with a regularizer to improve generalization.

Beyond Empirical Risk Minimization. Empir-
ical risk minimization (4) is motivated as a way
to approximate P (X, oui) through a set of Dirac
delta functions on labeled examples: (X, oui) =
(cid:5)
i δ(x = xi, y = yi)/N . Cependant, this approx-
imation is far from perfect, hence one uses a
heldout validation set for early stopping and
hyperparameter tuning.

Vicinal risk minimization (Chapelle et al.,
2001) approximates expected risk as EPν (X,oui)H
(oui, F (X)), using a vicinity distribution, for exam-
ple, ν(˜x, ˜y | X, oui) = N (˜x − x, σ2)d(˜y = y) à
approximate P (X, oui) comme

(X, oui) =

(cid:4)N
je = 1

1
N

ν(˜x = x, ˜y = y | xi, yi) .
(5)

The goal is to increase the support of each labeled
data point and improve the quality and robustness
of the risk function.

Recent work on mixup regularization (Zhang
et coll., 2018) proposes an effective way to con-
struct another vicinity distribution by interpolating
between two data points and their labels. Malgré
their simplicity, these smoothing techniques tend
to improve matters.

Generative Models for Risk Minimization.
One can factorize the joint distribution of input-
output pairs as P (X, oui) = P (X)P. (oui | X). Ac-
cordingly, if one is able to learn a reasonable
unconditional generative model of x denoted
g(X), then one can draw a pair (X, oui) by first
drawing x ∼ g(X) and then using the current
instance of ft to draw y ∼ ft(X). Alors, one can
use ft and g to approximate expected risk as

Rt(ft+1) = Ex∼g(X)Ey∼ft(X)H(oui, ft+1(X)) .

(6)

The quality of this approximation highly depends
on the quality of ft and g. If ft is far from an
optimal classifier f ∗ or g(X) is far from P (X), (6)
yields a poor approximation.

is applicable to any continuous, discrete, or struc-
tured domain as long as expressive generative
models of P (X) are available. That said, for al-
most all reasonable loss functions H (par exemple., softmax
cross entropy and squared error), (6) is minimized
when ft+1 = ft, which is not ideal, especially
when ft is far from f ∗. On the other hand, em-
pirical risk (4) anchors the problem in real la-
beled examples that are provided as ground truth.
GAL -self-training aims to combine the benefits

de (4) et (6) via:

Rt(ft+1) =

(cid:4)N
H(yi, ft+1(xi))+
je = 1
(1 − λ)Ex∼g(X)Ey∼ft(X)H(oui, ft+1(X)).

λ
N

(7)

In this formulation, if ft represents the minimizer
of empirical risk (4), then ft+1 = ft is the mini-
mizer of (7), aussi. Cependant, one does not seek the
global minimizer of empirical risk, but rather the
best performance on heldout data. If ft is obtained
by stochastic gradient descent on any risk func-
tion, but early-stopped according to empirical risk
on a heldout set, then using such ft in (7) to define
Rt(ft+1) promotes the selection of a mapping ft+1
that minimizes empirical risk while staying close
to the best performing mapping so far (c'est à dire., ft).
This formulation motivates self-training and GAL
as regularizers in the functional space and explains
why they can conceivably work. Although the ar-
guments are provided here for GAL-self-training,
extending them to GAL-KD is straightforward
(omitted due to the space constraints).

How About Class-conditional Generative
Models? One can also factorize the joint dis-
tribution P (X, oui) as P (oui)P. (X | oui) and accord-
ingly utilize a class-conditional generative model
g(X | oui) to derive the following expected risk
formulation:

R.(F ) = Ey∼P (oui)Ex∼g(X|oui)H(oui, ft+1(X)) .

(8)

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

In this setting pseudo labeling is not needed as
synthetic data is already labeled. One can show
that the optimal classifier f ∗
g that minimizes (8)
for the cross-entropy loss is given by

f ∗
g (oui | X) = g(X|oui)P. (oui)

(cid:6)(cid:4)

oui(cid:6) g(X|oui(cid:6))P. (oui(cid:6)) ,
(9)

The expected risk in (6) smoothens the risk
landscape in complex ways beyond simple Gaus-
sian smoothing and interpolation. This smoothing

c'est, turning the class-conditional generative
model into a classifier by using the Bayes rule
yields the optimal solution.

830

Model

MNLI(m/mm) CoLA SST-2 MRPC

STS-B

QQP QNLI RTE Avg

Previous work:
BERT-Theseus
BERT-PKD
tinyBERT
MATE-KD

Our results:
DistilRoBERTa
DistilRoBERTa + KD
DistilRoBERTa + WS
DistilRoBERTa + RT
DistilRoBERTa + GAL

82.4/82.1
81.5/81.0
84.6/83.2
86.2/85.6

83.8/83.4
84.5/84.1
86.2/85.9
86.2/85.6
86.9/86.4

47.8

51.1
58.6

55.9
53.0
52.2
55.0
58.6

92.2
92.0
93.1
95.1

93.2
93.5
94.0
94.9
95.3

87.6/83.2 85.6/84.1 71.6/89.3

85.0/79.9
70.7/88.9
87.3/82.6 85.0/83.7 71.6/89.1
91.2/88.1 88.5/88.4 73.0/89.7

87.4/83.1 87.5/87.5 71.7/89.1
88.9/85.1 88.0/87.4 71.9/89.2
89.9/86.4 88.7/88.3 71.7/89.2
90.1/86.5 89.2/88.9 72.5/89.7
91.6/88.7 89.9/89.5 73.0/89.9

89.6
89.0
90.4
92.4

90.6
91.0
91.5
92.1
92.7

66.2 78.6
65.5 −
70.0 79.8
76.6 83.5

73.3 81.2
75.0 81.5
76.2 82.1
77.2 82.9
79.7 84.3

Tableau 1: GLUE test results for a 6-layer transformer. GAL establishes a new state of the art on KD
for NLP. Baselines: BERT-Theseus (Xu et al., 2020), BERT-PKD (Sun et al., 2019un), tinyBERT
(Jiao et al., 2019), MATE-KD (Rashid et al., 2021), DistilRoBERTa (Sanh et al., 2019), and Distil-
RoBERTa + KD (standard KD), DistilRoBERTa + WS (word substitution), and DistilRoBERTa + RT
(round-trip translation). MNLI-m and MNLI-mm indicate matched and mismatched, respectivement.

Provided that the accuracy of generative clas-
sifiers on text classification is behind their dis-
criminate counterparts (par exemple., Ravuri and Vinyals,
2019), we think substituting (8) into (7) is not
a good idea. Essentially, by substituting (8) into
the classification objective, one is regularizing
f to remain close to f ∗
g , which is not an ef-
fective strategy if f ∗
g is not competitive. Ce
argument corroborates the evidence from our ab-
lation studies and recent work showing that using
class-conditional generative models to augment
supervised learning does not provide big gains
(Ravuri and Vinyals, 2019).

That said, one can still use class-conditional
generative models to synthesize high-fidelity sam-
ples. As long as these samples are treated as un-
labeled examples and annotated using a classifier,
Par exemple, ft, we believe this is a reasonable
approach falling under GAL. Note that our ar-
gument above only applies to the scenario that
class-conditional generative models are used to
synthesize labeled examples. Autrement dit, GAL
emphasizes prediction of the labels in the course
de l'algorithme, rather than having the labels
predefined. If one uses the unlabeled synthetic
examples from class-conditional generative mod-
le, it still aligns to (7), which will be verified in
Section 5.4.

5 Experiments

Dans cette section, we assess the effectiveness of
GAL on KD, self-training, and few-shot learning.

5.1 State-of-the-art Results of Knowledge
Distillation with GAL on GLUE

We use the GLUE benchmark (Wang et al.,
2019c) for our KD experiments; see Appendix A.1
for benchmark details. Our synthetic unlabeled
dataset U includes 40× as many examples as the
original dataset for each task in GLUE.

It is known that KD on fresh data, unseen during
entraînement, performs better (Bucilu˘a et al., 2006;
Chen et al., 2020un) than KD on original training
data. Ainsi, we investigate the effectiveness of
KD using generated unlabeled data through GAL.
We use the HuggingFace implementation (Loup
et coll., 2020) for KD experiments and adopt a stan-
dard experimental setup consistent with previous
travail (Sun et al., 2019un; Xu et al., 2020). Follow-
ing Rashid et al. (2021), fine-tuned RoBERTa-
grand (24-layer transformer) represents the teacher
and a DistilRoBERTa (6-layer transformer) (Sanh
et coll., 2019) is used as the student. We train the
student model on U and L, where U is annotated
by the best RoBERTa-large model, achieving an
average score of 86.5. We then mix L and U at a
ratio of 1:4, which is equivalent to λ = 0.2. Ce
ratio works best on the dev set.

Tableau 1 shows the results of individual 6-layer
transformers on the GLUE test set. All of the base-
lines use an identical student architecture. GAL
achieves the best entry on the GLUE leaderboard,
marking a new state-of-the-art for KD on NLP. Il
outperforms strong KD baselines such as Distil-
RoBERTa (Sanh et al., 2019), BERT-PKD (Sun

831

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Model

RoBERTa base

+ GAL (iter 1)
+ GAL (iter 2)
+ GAL (iter 3)

MNLI CoLA SST-2 MRPC STS-B QQP QNLI RTE Avg

87.7 0.1 63.6 0.4 94.8 0.1 90.1 0.4 90.8 0.1 91.5 0.1 92.6 0.1 78.8 0.4 86.2
87.9 0.1 65.1 0.5 95.3 0.1 91.7 0.5 91.4 0.1 91.8 0.1 93.1 0.1 81.4 0.4 87.2
88.0 0.1 65.2 0.5 95.3 0.1 92.2 0.4 91.5 0.1 91.7 0.1 93.2 0.1 82.4 0.5 87.4
87.9 0.1 65.5 0.5 95.3 0.1 92.2 0.5 91.7 0.2 91.7 0.1 93.2 0.1 82.0 0.5 87.4

RoBERTa base + self-distillation 88.1 0.1 63.7 0.5 95.2 0.1 90.3 0.4 90.4 0.1 91.5 0.1 93.1 0.1 79.7 0.5 86.5

Tableau 2: RoBERTa base and GAL self-training results on GLUE dev sets, averaged across 5 indepen-
dent runs (numbers in the subscript indicate the error bar, c'est à dire., standard deviation divided by

5.).

et coll., 2019un), BERT-Theseus (Xu et al., 2020),
tinyBERT (Jiao et al., 2019), and MATE-KD
(Rashid et al., 2021). It also outperforms our own
DistilRoBERTa+KD baseline, which learns from
soft labels produced by an identical RoBERTa-
large ensemble on the original labeled dataset.
While the use of soft labels outperform the vanilla
fine-tuned DistilRoBERTa model, it significantly
underperforms our KD+GAL baseline. We also
compare with two strong data-augmentation base-
lines, round-trip translation (RT) (Yu et al., 2018;
Shleifer, 2019) and word substitutions (WS) (Jiao
et coll., 2019; Wei and Zou, 2019). For RT, We gen-
erate 40× unlabeled data using German as the
bridge language (English→German→English). Le
translations are generated via the best model in
WMT19 (Ng et al., 2019). We use the codebase
from Jiao et al. (2019) to conduct WS data aug-
mentation. We mirror the KD experimental setup
of GAL for both RT and WS. Although Distil-
RoBERTa+RT and DistilRoBERTa+WS are bet-
ter than vanilla DistilRoBERTa and KD variants,
they still drastically underperform our approach.

5.2 Self-Training with GAL on GLUE

We fine-tune a pretrained RoBERTa model pro-
vided by fairseq (Ott et al., 2019) on each GLUE
task. Fine-tuned RoBERTa serves as the first
teacher model for self-training. Each student
model is initialized with the original pretrained
RoBERTa and fine-tuned with exactly the same
hyperparameters as suggested by fairseq (Ott
et coll., 2019). We combine the labeled dataset
L and the synthetic dataset U with a ratio of 1:1,
by oversampling labeled data. This corresponds
to λ = 0.5 in Eq. (7).

Tableau 2 shows that GAL provides an aver-
age improvement of +1.3% over RoBERTa-base.
We see consistent improvements with more GAL
iterations, but performance saturates after three
iterations. We further compare our approach with

a self-distillation (Furlanello et al., 2018) base-
line, in which the teacher and student models
use the same architecture and transfer knowledge
via the original labeled training set. Although
self-distillation provides a slight improvement,
the gains from GAL are more significant.

We delve deeper and combine GAL self-
training with RoBERTa-large and report test re-
sults for both single model and ensemble model
in Table 3. We observe consistent gains coming
from GAL on RoBERTa-large. Our results un-
derperform the latest and largest LMs from the
GLUE leaderboard, but we are optimistic that
GAL can be effectively combined with enormous
LMs to provide additional gains.

5.3 Prompt-based Few-shot Experiments

GPT3 (Brown et al., 2020) has introduced an
optimization-free paradigm for few-shot learning
for NLP. Without updating the parameters, grand
LMs can correctly predict the labels of the in-
puts by conditioning on a prompt, which consists
of an instruction, a few labeled instances and a
new unlabeled input. We apply GAL to prompt-
based few-shot learning. Spécifiquement, we present
k labeled examples as a prompt to GPT-J (Wang
and Komatsuzaki, 2021), an open-sourced re-
implementation of GPT-3-6B, and generate m
synthetic examples, followed by the correspond-
ing labels. Note that to mitigate noisy outputs, le
generation of each synthetic example only condi-
tions on the original k labeled examples. Enfin,
we concatenate the original k examples and m
synthetic examples, and conduct a (k + m)-shot
learning experiment with GPT-J.

Brown et al. (2020) studied a total of 51 few-
shot learning tasks. Studying all of these tasks is
prohibitively expensive. Ainsi, we filter tasks by
following these two steps. D'abord, since generating
m synthetic examples for each test instance is
computationally expensive, we exclude tasks that

832

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Model

MNLI(m/mm) CoLA SST-2 MRPC

STS-B

QQP

QNLI RTE Avg

Individual Models (our implementation):
RoBERTa-large
RoBERTa-large + GAL

90.1/89.7
90.2/89.8

Ensemble Models (our implementation):
RoBERTa-large
RoBERTa-large + GAL

91.2/90.5
91.0/90.7

63.8
66.2

96.1
96.4

91.2/88.3
92.0/89.2

90.9/90.7
90.7/90.5

72.5/89.6
73.6/89.9

94.5
95.0

85.9
86.3

86.5
87.1

66.8
67.9

96.9
97.1

92.8/90.3
93.1/90.8

91.9/91.6
91.6/91.4

74.5/90.4
74.5/90.4

95.5
95.8

87.7
88.2

87.9
88.2

State-of-the-art:
RoBERTa-large
ELECTRA
T5
ERNIE
DeBERTa

90.8/90.2
91.3/90.8
92.2/91.9
91.9/91.4
91.9/91.6

67.8
71.7
71.6
74.4
71.5

96.7
97.1
97.5
97.8
97.5

92.3/89.8
93.1/90.7
92.8/90.4
93.9/91.8
94.0/92.0

92.2/91.9
92.9/92.5
93.1/92.8
93.0/92.6
92.9/92.6

74.3/90.3
75.6/90.8
75.1/90.6
75.2/90.9
76.2/90.8

95.4
95.8
96.9
97.3
99.2

88.2
89.8
92.8
92.0
93.2

88.0
89.2
89.8
90.2
90.3

Tableau 3: RoBERTa-large with GAL self-training and SoTA methods evaluated on GLUE test sets. The benefit
of GAL on single models is larger than ensembles. It appears that self-training reduce the variance of models.
Baselines including much larger models: RoBERTa-large (Liu et al., 2019), ELECTRA (Clark et al., 2020),
T5 (Raffel et al., 2020), ERNIE (Sun et al., 2019b), and DeBERTa (He et al., 2020). MNLI-m and MNLI-mm
indicate matched and mismatched, respectivement.

Model

4-shot
8-shot
16-shot

SST-2

PIQA COPA BoolQ Avg

89.8 0.8
91.3 0.8
92.7 0.6

76.0 1.4
76.2 1.2
77.0 0.9

79.0 1.5
79.0 1.5
81.0 1.1

64.3 0.8
66.2 0.8
66.8 0.8

77.3
78.2
79.4

78.5

4-shot + synthetic 12-shot (GAL )

91.5 0.7

76.7 1.0

80.0 1.2

65.9 0.8

Tableau 4: Few-shot learning results for GPT-J (6B) (Wang and Komatsuzaki, 2021)
on four NLP datasets. Accuracy is reported for these datasets.

have more than 5k test examples. Deuxième, we fil-
ter tasks on which GPT-3-6B achieves a score
lower than 65% (please refer to Table H.1 in
Brown et al. [2020] for more details). After ap-
plying the filtering steps, we use four datasets:
SST-2 (Wang et al., 2019c), PIQA (Bisk et al.,
2020), COPA, and BoolQ (Wang et al., 2019b) comme
the testbed. We notice that in order to generate
valid synthetic data, GPT-J requires to see at
least 4 labeled examples. En outre, at most
16 examples of BoolQ can be fed into GPT-J
without truncation. Ainsi, we set k and m to 4
et 12, respectivement. As seen in Table 4, GAL
leads to an average improvement of 1.2% over
4-shot learning, and reduces the gap between
4-shot and 16-shot learning. We noticed that the
quality of some generated examples is low. Nous
believe the performance of few-shot learning can
be further improved with high-quality instances.
One solution is to generate many synthetic ex-
amples, and select a high-quality subset. Since
each test instance conditions on distinct labeled

instances, one has to generate different synthetic
instances for each test example from GPT-J, lequel
causes expensive computation. Due to such com-
putational constraints, we leave the investigation
of data selection strategies to the future work.

5.4 Ablating Components of GAL on GLUE

We conduct an in-depth study of different com-
ponents of GAL on GLUE datasets. Unless stated
otherwise, we use a RoBERTa-base model with a
combination of the original training data and 40×
synthetic data for each self-training experiment.

GPT-2 Model Size. Radford et al.
(2019)
present a few variants of the GPT-2 model includ-
ing GPT-2, GPT-2-medium, GPT-2-large, et
GPT-2-XL. Larger GPT-2 models yield better
perplexity scores and higher generation quality.
We utilize these models except GPT-2-XL within
the GAL framework to study the impact of the
generative model’s quality on downstream task’s
performance. Tableau 5 shows that regardless of the

833

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

GPT-2

SST-2 RTE MRPC CoLA

NA
petit
moyen
grand

94.8
95.5
95.3
95.3

78.8
81.3
81.3
81.4

90.1
90.9
91.3
91.7

63.6
63.9
63.7
65.1

Tableau 5: GAL with various GPT-2 model sizes on
GLUE dev sets. NA indicates a RoBERTa base
model. We bold the best numbers.

Pseudo label

SST-2 RTE MRPC CoLA

hard
soft

95.0
95.3

80.7
81.4

90.8
91.7

63.0
65.1

Tableau 6: GAL with soft vs. hard pseudo labels on
GLUE dev sets. We bold the best numbers.

GPT-2 model sizes, GAL consistently surpasses
the vanilla RoBERTa base. De plus, SST-2 and
RTE datasets are not sensitive to the capacity of
GPT-2, but higher quality synthetic text improves
the results on MRPC and CoLA datasets. We leave
investigation of GPT-2-XL and even larger LMs
such as GPT-3 (Brown et al., 2020) to future work.

Soft vs. Hard Pseudo Label. We investigate
the use of soft and hard pseudo labels within the
GAL framework. The results in Table 6 suggérer
that GAL using soft pseudo labels is more effective
than hard labels on the GLUE benchmark. Ce
finding is compatible with the intuition that soft
labels enable measuring the functional similarity
of neural networks better (Hinton et al., 2015).

it

Class-conditional Synthetic Data Generation.
Previous work (Kumar et al., 2020b; Ravuri
and Vinyals, 2019) suggests that
is chal-
lenging to utilize labeled synthetic data from
class-conditional generative models to boost the
accuracy of text and image classifiers. Our theory
in Section 4 points to the potential drawback of
class-conditional synthetic data. We empirically
study this phenomenon, by fine-tuning GPT-2 in a
class-conditional manner. Then we utilize its syn-
thetic examples in two different cases: 1) labeled
synthetic examples and 2) unlabeled synthetic ex-
amples. Tableau 7 shows that not only do class-
conditional LMs underperform unconditional LMs
in our GAL framework, but also they are much
worse than the baseline, when using the pre-
defined labels. Nevertheless, if we apply GAL
to these examples, the class-conditional LM is

on par with the unconditional one, which cor-
roborates the importance of the annotation step in
GAL. We provide more analysis in Appendix A.3.

6 Limitations

This work demonstrates that one can leverage
synthetic in-domain data generated by powerful
pre-trained generative models. For simplicity, nous
do not employ any filtering avenue to retain di-
verse but high-quality data points. Cependant, pre-
vious work has shown that advanced filtering
approaches can further improve the performance
(Sohn et al., 2020; Du et al., 2020; Yang et al.,
2020). Given that the improvements in the self-
training are not sizeable, we believe it is worth
imposing filtering methods on the synthetic data
to mitigate the side effects caused by the noisy
points de données.

Although we examine the effectiveness of GAL
on various classification tasks, we still focus on
the sentence-level tasks. Because of the superior
performance on sentence-level tasks, there has
been a surge of interest shift to document-level
tasks, such as document-level machine transla-
tion (Miculicich et al., 2018; Voita et al., 2018;
Maruf and Haffari, 2018), document summariza-
tion (Rush et al., 2015; Nallapati et al., 2016),
and so forth. As these tasks suffer from data scar-
city, one can leverage GAL to synthesize more
points de données. Cependant, previous work has shown
that GPT-2 has difficulty generating coherent text
requiring long-range dependency (Orbach and
Goldberg, 2020; Guan et al., 2020). Ainsi, tel
a limitation may hinder the application of GAL
to document-level tasks.

En outre, the label space of the studied tasks
is not as complex as the structured prediction
tasks, such as machine translation, dialog system,
question answering, et ainsi de suite. Cependant, we be-
lieve one can smoothly adapt GAL to these tasks
aussi. Let us consider machine translation (MT)
as a canonical structured prediction task. Prior
work has shown that one can use (réel) monolin-
gual data, in either source or the target language,
through data augmentation (Sennrich et al., 2016)
or knowledge distillation (Kim and Rush, 2016)
to improve the structured prediction tasks. Ce
suggests a promising avenue for future research
on using synthetically generate monolingual data
to improve MT for specialized domains where
even monolingual data is scarce.

834

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Generative model

Labeled synthetic data SST-2 RTE MRPC CoLA

None (baseline)

Class-conditional LM
Unconditional LM (GAL )
Class-conditional LM (GAL)

94.8

92.9
95.3
95.4

78.8

74.4
81.4
81.0

90.1

86.0
91.7
91.4

63.6

58.4
65.1
65.2

Tableau 7: Synthetic data from class-conditional LMs underperforms GAL and RoBERTa on
GLUE dev sets.

En outre, Vu et al. (2021un) suggest that one
can leverage a retrieval-based approach to ob-
tain monolingual sentences from the generic data
stores. This retrieved monolingual data is then
employed to improve the translation quality in
a domain adaptation setting. This suggests that
a GAL-based approach to synthetically generate
monolingual text is a promising method to im-
prove MT for specialized domains—an interest-
ing direction for future research.

7 Conclusion

We present Generate, Annotate, and Learn (GAL):
a framework for self-training and knowledge dis-
tillation with generated unlabeled data. We mo-
tivate GAL from an expected risk minimization
perspective and demonstrate both theoretically
and empirically that the use of unconditional gen-
erative models for synthetic data generation is
more effective than class-conditional generative
models previously used in the literature. GAL
leverages advances in large pretrained language
models to help supervised learning and can have
implications for learning from limited labeled
data. GAL significantly helps improve knowledge
distillation and prompt-based few-shot learning.
En outre, a concurrent work (Gowal et al.,
2021) has shown that using generated images can
enhance the robustness of images classifiers. Nous
will explore this direction on NLP tasks in the
avenir. Enfin, we hope that GAL will stimulate
new research on the evaluation and development
of large language models.

Remerciements

We would like to thank the anonymous review-
ers and action editor Andr´e F.T. Martins for their
comments and suggestions on this work. The com-
putational resources of this work are partly sup-
ported by the Multi-modal Australian ScienceS

Imaging and Visualisation Environment (MAS-
SIVE) (www.massive.org.au). This mate-
rial is partly based on research sponsored by Air
Force Research Laboratory and DARPA under
agreement number FA8750-19-2-0501. The U.S.
Government is authorized to reproduce and dis-
tribute reprints for Governmental purposes not-
withstanding any copyright notation thereon.

Les références

Steven Abney.

2004. Compréhension

le
Yarowsky algorithm. Computational Linguis-
tics, 30(3):365–395. https://est ce que je.org/10
.1162/0891201041850876

UN. Agrawala. 1970. Learning with a probabilis-
tic teacher. IEEE Transactions on Information
Theory, 16(4):373–379. https://doi.org
/10.1109/TIT.1970.1054472

Chris Alberti, Daniel Andor, Emily Pitler, Jacob
Devlin, and Michael Collins. 2019. Synthetic
QA corpora generation with roundtrip con-
sistency. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Linguistics, pages 6168–6173. https://est ce que je
.org/10.18653/v1/P19-1620

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, et
Yejin Choi. 2020. PIQA: Reasoning about
physical commonsense in natural language. Dans
Proceedings of the AAAI Conference on Artifi-
cial Intelligence, volume 34, pages 7432–7439.
https://doi.org/10.1609/aaai.v34i05
.6239

Tom B. Brun, Benjamin Mann, Nick Ryder,
Melanie Subbiah,
Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,

835

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3



Christopher Hesse, Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever, et
Dario Amodei. 2020. Language models are
few-shot learners. arXiv:2005.14165.

Steven Y. Feng, Varun Gangal, Jason Wei, Sarath
Chandar, Soroush Vosoughi, Teruko Mitamura,
and Eduard Hovy. 2021. A survey of data aug-
mentation approaches for NLP. In Findings of
the Association for Computational Linguistics:
ACL-IJCNLP 2021, pages 968–988.

Cristian Bucilu˘a, Rich Caruana, and Alexandru
Niculescu-Mizil. 2006. Model compression.
Proceedings of the 12th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and
Data Mining, pages 535–541. https://est ce que je
.org/10.1145/1150402.1150464

Yair Carmon, Aditi Raghunathan, Louis
Schmidt, John C. Duchi, and Percy S. Liang.
2019. Unlabeled data improves adversarial
robustness. Advances in Neural Information
Processing Systems, 32.

Olivier Chapelle, Jason Weston, L´eon Bottou,
and Vladimir Vapnik. 2001. Vicinal risk min-
imization. Advances in Neural Information
Processing Systems.

Ting Chen, Simon Kornblith, Kevin Swersky,
Mohammad Norouzi, and Geoffrey Hinton.
2020un. Big self-supervised models are strong
semi-supervised learners. NeurIPS.

Yining Chen, Colin Wei, Ananya Kumar, et
Tengyu Ma. 2020b. Self-training avoids us-
ing spurious features under domain shift. Dans
Advances in Neural Information Processing
Systems 33: Annual Conference on Neural In-
formation Processing Systems 2020, NeurIPS
2020, Décembre 6-12, 2020, virtual.

Kevin Clark, Minh-Thang Luong, Quoc V. Le,
and Christopher D. Manning. 2020. Electra:
Pre-training text encoders as discriminators
rather than generators. International Confer-
ence on Learning Representations.

Jingfei Du, Edouard Grave, Beliz Gunel, Vishrav
Chaudhary, Onur Celebi, Michael Auli, Ves
Stoyanov, and Alexis Conneau. 2020. Self-
training improves pre-training for natural lan-
guage understanding. arXiv:2010.02194.

Jason Eisner and Damianos Karakos. 2005.
Bootstrapping without the boot. In Proceed-
ings of Human Language Technology Confer-
ence and Conference on Empirical Methods in
Natural Language Processing, pages 395–402.
https://doi.org/10.3115/1220575
.1220625

S. Fralick. 1967. Learning to recognize patterns
without a teacher. IEEE Transactions on In-
formation Theory. https://est ce que je.org/10
.1109/TIT.1967.1053952

Tommaso Furlanello, Zachary Lipton, Michael
Tschannen, Laurent Itti, and Anima Anandkumar.
Inter-
2018. Born again neural networks.
national Conference on Machine Learning,
pages 1607–1616.

Leo Gao, Jonathan Tow, Stella Biderman, Sid
Noir, Anthony DiPofi, Charles Foster, Laurence
Golding, Jeffrey Hsu, Kyle McDonell, Niklas
Muennighoff, Jason Phang, Laria Reynolds,
Eric Tang, Anish Thite, Ben Wang, Kevin
Wang, and Andy Zou. 2021. A framework for
few-shot language model evaluation.

Sven Gowal, Sylvestre-Alvise Rebuffi, Olivia
Wiles, Florian Stimberg, Dan Andrei Calian,
and Timothy A. Mann. 2021. Improving robust-
ness using generated data. Advances in Neural
Information Processing Systems, 34.

Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan
Zhu, and Minlie Huang. 2020. A knowledge-
enhanced pretraining model for commonsense
story generation. Transactions of the Associa-
tion for Computational Linguistics, 8:93–108.
https://doi.org/10.1162/tacl a 00302

Gholamreza Haffari and Anoop Sarkar. 2007.
Analysis of semi-supervised learning with the
yarowsky algorithm. In UAI 2007, Proceed-
ings of the Twenty-Third Conference on Un-
certainty in Artificial Intelligence, Vancouver,
BC, Canada, Juillet 19-22, 2007, pages 159–166.
AUAI Press.

Pengcheng He, Xiaodong Liu, Jianfeng Gao,
and Weizhu Chen. 2020. Deberta: Decoding-
enhanced BERT with disentangled attention.
arXiv:2006.03654.

Danny Hernandez, Jared Kaplan, Tom Henighan,
and Sam McCandlish. 2021. Scaling laws for
transfer.

836

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.
2015. Distilling the knowledge in a neural
réseau. arXiv:1503.02531.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,
Xiao Chen, Linlin Li, Fang Wang, and Qun
Liu. 2019. TinyBERT: Distilling BERT for
language understanding. arXiv:1909
naturel
.10351. https://doi.org/10.18653/v1
/2020.findings-emnlp.372

Yoon Kim and Alexander M. Rush. 2016.
Sequence-level knowledge distillation. En Pro-
ceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing,
pages 1317–1327. https://est ce que je.org/10
.18653/v1/D16-1139

Sosuke Kobayashi. 2018. Contextual augmenta-
tion: Data augmentation by words with paradig-
matic relations. In Proceedings of the 2018
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 2
(Short Papers), pages 452–457, La Nouvelle Orléans,
Louisiana. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/N18-2072

Ananya Kumar, Tengyu Ma, and Percy Liang.
2020un. Understanding self-training for gradual
domain adaptation. In Proceedings of the 37th
International Conference on Machine Learn-
ing, volume 119 of Proceedings of Machine
Learning Research, pages 5468–5479. PMLR.

Varun Kumar, Ashutosh Choudhary, and Eunah
Cho. 2020b. Data augmentation using pre-
trained transformer models. arXiv:2003.02245.

Jared Lichtarge, Chris Alberti, Shankar Kumar,
Noam Shazeer, Niki Parmar, and Simon Tong.
2019. Corpora generation for grammatical error
correction. arXiv:1904.05780. https://est ce que je
.org/10.18653/v1/N19-1333

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. Roberta: A robustly optimized
bert pretraining approach. arXiv:1907.11692.

Sameen Maruf and Gholamreza Haffari. 2018.
Document context neural machine translation
with memory networks. In Proceedings of the
56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long

Papers), pages 1275–1284. https://doi.org
/10.18653/v1/P18-1118

Alana Marzoev, Samuel Madden, M.. Frans
Kaashoek, Michael J. Cafarella, and Jacob
Andreas. 2020. Unnatural language processing:
Bridging the gap between synthetic and natural
language data. ArXiv, abs/2004.13645.

Lesly Miculicich, Dhananjay Ram, Nikolaos
Pappas, and James Henderson. 2018. Document-
level neural machine translation with hierarchi-
cal attention networks. In Proceedings of the
2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 2947–2954,
Brussels, Belgium. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/D18-1325

Hossein Mobahi, Mehrdad Farajtabar, and Peter
L. Bartlett. 2020. Self-distillation amplifies
regularization in hilbert space. In Advances in
Neural Information Processing Systems 33:
Annual Conference on Neural Information Pro-
cessing Systems 2020, NeurIPS 2020, Decem-
ber 6-12, 2020, virtual.

Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre,
Bing Xiang. 2016. Abstractive text summari-
zation using sequence-to-sequence RNNs and
au-delà. In Proceedings of The 20th SIGNLL
Conference on Computational Natural Lan-
guage Learning, pages 280–290. https://
doi.org/10.18653/v1/K16-1028

Nathan Ng, Kyra Yee, Alexei Baevski, Myle
Ott, Michael Auli, and Sergey Edunov. 2019.
Facebook fair’s wmt19 news translation task
submission. In Proceedings of the Fourth Con-
ference on Machine Translation (Volume 2:
Shared Task Papers, Day 1), pages 314–319.

Sajad Norouzi, David J. Fleet, and Mohammad
Norouzi. 2020. Exemplar vaes for exem-
plar based generation and data augmentation.
arXiv:2004.04795.

Eyal Orbach

and Yoav Goldberg.

2020.
Facts2Story: Controlling text generation by
key facts. In Proceedings of the 28th Interna-
tional Conference on Computational Linguistics,
pages 2329–2345, Barcelona, Espagne (En ligne).
International Committee on Computational
Linguistics. https://doi.org/10.18653
/v1/2020.coling-main.211

Myle Ott, Sergey Edunov, Alexei Baevski,
Angela Fan, Sam Gross, Nathan Ng, David

837

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Grangier, and Michael Auli. 2019. fairseq: UN
fast, extensible toolkit for sequence modeling.
Actes du 2019 Conference of the
North American Chapter of the Association for
Computational Linguistics (Demonstrations),
pages 48–53. https://doi.org/10.18653
/v1/N19-4009

Samet Oymak and Talha Cihad Gulcu. 2020.
Statistical and algorithmic insights for semi-
supervised learning with self-training. CoRR,
abs/2006.11006.

Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. https://d4mucfpksywv-models
-models/language models are unsupervised
multitask learners.pdf

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. Journal
of Machine Learning Research, 21:1–67.

Aditi Raghunathan, Sang Michael Xie, Fanny
Lequel, John Duchi, and Percy Liang. 2020.
Understanding and mitigating the tradeoff be-
tween robustness and accuracy. In Proceedings
of the 37th International Conference on Ma-
chine Learning, volume 119 of Proceedings of
Machine Learning Research, pages 7909–7919.
PMLR.

Ahmad Rashid, Vasileios Lioutas, and Mehdi
Rezagholizadeh. 2021. Mate-kd: Masked ad-
versarial text, a companion to knowledge dis-
tillation. arXiv preprint arXiv:2105.05912.
https://doi.org/10.18653/v1/2021
.acl-long.86

Suman Ravuri and Oriol Vinyals. 2019. Classi-
fication accuracy score for conditional genera-
tive models. Advances in Neural Information
Processing Systems, pages 12268–12279.

Alexander M. Rush, Sumit Chopra, and Jason
Weston. 2015. A neural attention model for
abstractive sentence summarization. En Pro-
ceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing,
pages 379–389.

Victor Sanh, Lysandre Debut, Julien Chaumond,
and Thomas Wolf. 2019. DistilBERT, a dis-
tilled version of BERT: Smaller, faster, cheaper
and lighter. ArXiv, abs/1910.01108.

H. Scudder. 1965. Probability of error of some
adaptive pattern-recognition machines. IEEE
Transactions on Information Theory. https://
doi.org/10.1109/TIT.1965.1053799

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Improving neural machine trans-
lation models with monolingual data. En Pro-
ceedings of the 54th Annual Meeting of the
Association for Computational Linguistics
(Volume 1: Long Papers), pages 86–96, Berlin,
Allemagne. Association
for Computational
Linguistics. https://doi.org/10.18653/v1
/P16-1009

Dinghan Shen, Mingzhi Zheng, Yelong Shen,
Yanru Qu, and Weizhu Chen. 2020. A sim-
ple but tough-to-beat data augmentation ap-
proach for natural language understanding and
generation. arXiv preprint arXiv:2009.13818.

Sam Shleifer. 2019. Low resource text classifi-
cation with ulmfit and backtranslation. arXiv
preprint arXiv:1903.09244.

Kihyuk Sohn, David Berthelot, Chun-Liang Li,
Zizhao Zhang, Nicholas Carlini, Ekin D. Cubuk,
Alex Kurakin, Han Zhang, and Colin Raffel.
2020. Fixmatch: Simplifying semi-supervised
learning with consistency and confidence.
arXiv:2001.07685.

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing
Liu. 2019un. Patient knowledge distillation for
BERT model compression. Proceedings of
le 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 4314–4323. https://est ce que je.org/10
.18653/v1/D19-1441

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng,
Xuyi Chen, Han Zhang, Xin Tian, Danxiang
Zhu, Hao Tian, and Hua Wu. 2019b. Ernie:
Enhanced representation through knowledge
l'intégration. arXiv preprint arXiv:1904.09223.

Nicola Ueffing, Gholamreza Haffari, and Anoop
Sarkar. 2007. Transductive learning for sta-
tistical machine translation. In Proceedings of
the 45th Annual Meeting of the Association

838

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

of Computational Linguistics, pages 25–32,
Prague, Czech Republic. Association for Com-
putational Linguistics.

Vladimir Vapnik. 1992. Principles of risk mini-
mization for learning theory. Advances in Neu-
ral Information Processing Systems.

Elena Voita, Pavel Serdyukov, Rico Sennrich,
and Ivan Titov. 2018. Context-aware neural
machine translation learns anaphora resolution.
In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguis-
tics (Volume 1: Long Papers), pages 1264–1274.
https://doi.org/10.18653/v1/P18
-1117

Thuy Vu, Xuanli He, Dinh Phung,

et
Gholamreza Haffari. 2021un. Generalised unsu-
pervised domain adaptation of neural machine
translation with cross-lingual data selection. Dans
Actes du 2021 Conference on Empir-
ical Methods in Natural Language Processing,
pages 3335–3346.

learning. In Proceedings of

Tu Vu, Minh-Thang Luong, Quoc Le, Grady
Simon, and Mohit Iyyer. 2021b. Strata: Self-
training with task augmentation for better few-
le 2021
shot
Conference on Empirical Methods in Natu-
ral Language Processing, pages 5715–5731.
https://doi.org/10.18653/v1/2021
.emnlp-main.462

Alex Wang, Jan Hula, Patrick Xia, Raghavendra
Pappagari, R.. Thomas McCoy, Roma Patel,
Najoung Kim, Ian Tenney, Yinghui Huang,
Katherin Yu, Shuning Jin, Berlin Chen,
Benjamin Van Durme, Edouard Grave, Ellie
Pavlick, and Samuel R. Bowman. 2019un. Can
you tell me how to get past sesame street?
Sentence-level pretraining beyond language
modeling. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Linguistics, pages 4465–4476. https://est ce que je
.org/10.18653/v1/P19-1439

Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill,
Omer Levy, and Samuel R. Bowman. 2019b.
SuperGLUE: A stickier benchmark for general-
purpose language understanding systems. arXiv:
1905.00537.

Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R.

Bowman. 2019c. GLUE: A multi-task bench-
mark and analysis platform for natural lan-
guage understanding. International Conference
on Learning Representations. https://est ce que je
.org/10.18653/v1/W18-5446

Bailin Wang, Wenpeng Yin, Xi Victoria Lin, et
Caiming Xiong. 2021. Learning to synthesize
data for semantic parsing. In Proceedings of
the Meeting of the North-American Chapter
of Association for Computational Linguistics
(NAACL). https://doi.org/10.18653
/v1/2021.naacl-main.220

Ben Wang and Aran Komatsuzaki. 2021.
GPT-J-6B: UN 6 billion parameter autoregres-
sive language model. https://github.com
/kingoflolz/mesh-transformer-jax.

William Yang Wang and Diyi Yang. 2015.
That’s so annoying!!!: A lexical and frame-
semantic embedding based data augmenta-
tion approach to automatic categorization of
annoying behaviors using# petpeeve tweets.
Actes du 2015 Conference on Empir-
ical Methods in Natural Language Processing,
pages 2557–2563. https://est ce que je.org/10
.18653/v1/D15-1306

Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong
Qi. 2018. Kdgan: Knowledge distillation with
generative adversarial networks. NeurIPS.

Yushi Wang, Jonathan Berant, and Percy Liang.
2015. Building a semantic parser overnight.
In Proceedings of
the 53rd Annual Meet-
ing of
the Association for Computational
Linguistics and the 7th International Joint
Conference on Natural Language Processing
(Volume 1: Long Papers), pages 1332–1342,
Beijing, Chine. Association for Computational
Linguistics. https://doi.org/10.3115
/v1/P15-1129

Colin Wei, Kendrick Shen, Yining Chen, et
Tengyu Ma. 2021. Theoretical analysis of
self-training with deep networks on unlabeled
data. In International Conference on Learning
Representations.

Jason Wei and Kai Zou. 2019. Eda: Easy data
augmentation techniques for boosting perfor-
mance on text classification tasks. En Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),

839

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

pages 6382–6388. https://est ce que je.org/10
.18653/v1/D19-1670

Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Antoine
Moi, Pierric Cistac, Tim Rault, Remi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drame, Quentin
Lhoest, and Alexander Rush. 2020. Transform-
ers: State-of-the-art natural language process-
ing. Actes du 2020 Conference on
Empirical Methods in Natural Language Pro-
cessation: System Demonstrations, pages 38–45.
https://doi.org/10.18653/v1/2020
.emnlp-demos.6

Xing Wu, Shangwen Lv, Liangjun Zang,
Jizhong Han, and Songlin Hu. 2019. Condi-
tional BERT contextual augmentation. Interna-
tional Conference on Computational Science,
pages 84–95, Springer. https://doi.org
/10.1007/978-3-030-22747-0_7

Qizhe Xie, Minh-Thang Luong, Eduard Hovy,
and Quoc V. Le. 2020. Self-training with
noisy student
improves imagenet classifica-
tion. 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR),
pages 10684–10695. https://doi.org
/10.1109/CVPR42600.2020.01070

Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu
Wei, and Ming Zhou. 2020. Bert-of-theseus:
Compressing BERT by progressive module re-
placing. Actes du 2020 Conference
on Empirical Methods in Natural Language
Processing (EMNLP), pages 7859–7869.

Yiben Yang, Chaitanya Malaviya,

Jared
Fernandez, Swabha Swayamdipta, Ronan Le
Bras, Ji-Ping Wang, Chandra Bhagavatula,
Yejin Choi, and Doug Downey. 2020. G-daug:
Generative data augmentation for common-
sense reasoning. arXiv:2004.11546. https://
doi.org/10.18653/v1/2020.findings
-emnlp.90

David Yarowsky. 1995. Unsupervised word sense
disambiguation rivaling supervised methods.
33rd Annual Meeting of the Association for
Computational Linguistics, pages 189–196.
https://doi.org/10.3115/981658.981684

Norouzi, and Quoc V Le. 2018. QANet: Com-
bining local convolution with global self-
attention for reading comprehension. ICLR.

Hongyi Zhang, Moustapha Cisse, Yann N.
Dauphin, and David Lopez-Paz. 2018. mixup:
Beyond empirical risk minimization. ICLR.

Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei
Chen, Chenglong Bao, and Kaisheng Ma. 2019.
Be your own teacher: Improve the perfor-
mance of convolutional neural networks via
self distillation. Proceedings of the IEEE/CVF
International Conference on Computer Vision,
pages 3713–3722. https://est ce que je.org/10
.1109/ICCV.2019.00381

A Appendices

A.1 Datasets

The statistics of GLUE are reported in Table 8.

A.2 GPT-2 for Classification

is the GPT-2 model

We have conducted additional experiments, où
we fine-tune GPT-2 as a classifier. We have
considered two variants of the GPT-2 model.
The first varant
is the original GPT-2 model
(GPT2-original) pre-trained on open-domain text.
que
The second variant
was fine-tuned on the inputs of each task sepa-
rately (GPT-2-finetuned). This model was used to
generate task-specific (synthetic) unlabeled data.
Enfin, we also consider self-training with GAL
on top of GPT2-original. Spécifiquement, we use
the GPT-2-finetuned model
to synthesize 40x
in-domain unlabeled data. Then we apply self-
training to GPT-2-original, where the data is
a combination of the original labeled data and
pseudo-labeled synthetic data. Tableau 9 suggests
that the gains of GAL come from the pseudo-
labeled synthetic data, c'est à dire., both synthetic unla-
beled data and teacher’s knowledge. Without
the generation of synthetic unlabeled data, le
domain-specific knowledge embedded in GPT-2-
finetuned model cannot be utilized. En tant que tel,
GPT-2-finetuned model is inferior to the GPT2-
original model. Since RoBERTa-large is superior
to GPT-2 models, RoBERTa-large+GAL also
significantly outperform the GPT-2 counterpart.

A.3 Importance of Pseudo-labels

Adams Wei Yu, David Dohan, Minh-Thang
Luong, Rui Zhao, Kai Chen, Mohammad

We have argued and demonstrated that using
class-conditional generative models to generate

840

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Dataset

task

domain

#train #dev

#test

#classes

SST-2
QQP
QNLI
RTE
MNLI
MRPC
CoLA
STS-B

movie reviews
social QA questions

sentiment analysis
paraphrase
QA/natural language inference Wikipedia
natural language inference
natural language inference
paraphrase
acceptability
sentence similarity

news, Wikipedia
misc.
news
misc.
misc.

67k
364k
105k
2.5k
393k
3.7k
8.5k
5.8k

872
40k
5k
277
20k
408
1043
15k

1.8k
391k
5.4k
3k
20k
1.7k
1k
1.4k

2
2
2
2
3
2
2

Tableau 8: Summary of the three sets of tasks used for evaluation of GAL. STS-B is a regression task, donc
#classes is not applicable.

Model

MNLI

CoLA SST-2 MRPC

STS-B

QQP

QNLI RTE Avg

GPT-2-original
GPT-2-finetuned
GPT-2-original+GAL

85.9/85.6
85.8/85.5
86.2/85.8

RoBERTa-large
RoBERTa-large + GAL

90.1/89.7
90.2/89.8

54.8
40.9
55.7

63.8
66.2

94.5
94.5
94.7

96.1
96.4

86.9/82.2
87.0/81.0
87.9/83.4

86.3/85.2
85.6/84.3
86.9/85.9

72.5/89.3
71.4/88.5
72.6/89.4

91.2/88.3
92.0/89.2

90.9/90.7
90.7/90.5

72.5/89.6
73.6/89.9

91.2
91.5
91.9

94.5
95.0

69.8
69.0
70.6

85.9
86.3

80.9
78.8
81.5

86.5
87.1

Tableau 9: GLUE test results of using GPT-2 and RoBERTa-large as classification models.

Label type

Accuracy F1 Precision Recall

GPT2
RoBERTa
conditioning label

86.0
90.0
72.0

87.0
91.4
71.4

88.7
100.0
66.0

85.5
84.1
77.8

Tableau 10: Performance of GPT2 annotation,
RoBERTa annotation and conditioning labels on
100 random examples from the synthetic RTE
dataset generated by a class-conditional LM.

labeled synthetic examples is less effective than
GAL in Section 3 and Section 5. To further
verify this argument, we sample 100 instances
from the synthetic RTE dataset generated by the
label-prompted GPT2, as the class-conditional
LM. Then we annotate these examples using a
human annotator, GPT2 classifier, and RoBERTa
classifier. Enfin, we compute the Accuracy, F1,

Precision, and Recall scores between human la-
bels and GPT2 labels, between human labels and
RoBERTa labels, and between human labels
and conditioned labels used by GPT2 when the
data was generated. Tableau 10 shows that class-
conditional LM has difficulty generating sen-
tences retaining the semantics or pragmatics of
a specified category, which also corroborates our
theoretical analysis in Section 3. On the other
main, discriminative models, such as GPT2 clas-
sifier and RoBERTa classifier, are able to produce
higher quality labels that correlate better with
human annotations.

A.4 Generated Unlabeled Examples
Annotated with Pseudo Labels

We provide some synthetic sentences generated
by GAL in Tables 11 et 12.

841

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

are more deeply thought through than in most ‘ right-thinking ’ films (positive)

KNN:
1: is far more sophisticated, insightful and thought-provoking than his previous films .
(positive)
2: is more sophisticated than its more obvious and less-than-dazzling counterparts (positive)
3: is about as well-thought as the idea of a bad hair day, (negative)

contains no wit, only labored gags (negative)

KNN:
1: lacks insight, and lacks empathy (negative)
2: has little humor or intelligence (negative)
3: lacks all wit and humanity (negative)

Tableau 11: SST-2: Two labeled examples, along with 3 nearest neighbors (based on RoBERTa
representations) from our synthetic dataset. We include labels for original examples and
pseudo-labels for synthetic examples in parenthesis.

How is the life of a math student? Could you describe your own experiences? [SEP] Which
level of prepration is enough for the exam jlpt5? (not duplicated)

KNN:
1: What are the best courses for a mechanical engineering student? [SEP] What is the best
course to do after completing a B.Tech in mechanical engineering? (not duplicated)
2: How much marks are needed to get through the GATE with electronics? [SEP] What is the
average score of the Gate EE exam? What are the cut-offs? (not duplicated)
3: What is the best time table for students to prepare for IAS? [SEP] How can one study for
IAS in a best time? (not duplicated)

How does an IQ test work and what is determined from an IQ test? [SEP] How does IQ test
travaux? (duplicated)

KNN:
1: What is the average IQ of the U.S. population? [SEP] How does an IQ test work? (pas
duplicated)
2: Is the Iq test an effective way to measure intelligence? [SEP] How do IQ tests work?
(duplicated)
3: How is an IQ test on a scale from 1 à 100 scored? [SEP] How do you get your IQ tested?
(not duplicated)

Tableau 12: QQP: Two labeled examples, along with 3 nearest neighbors (based on RoBERTa
representations) from our synthetic dataset. We include labels for original examples and
pseudo-labels for synthetic examples in parenthesis.

842

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
9
2
2
0
3
8
5
1
1

/

/
t

je

un
c
_
un
_
0
0
4
9
2
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Télécharger le PDF