Generate, Annotate, and Learn: NLP with Synthetic Text

Xuanli He1

Islam Nassar1 Jamie Kiros2 Gholamreza Haffari1 Mohammad Norouzi2
1Monash University, 澳大利亚

2Google Research, Brain Team, 加拿大

{xuanli.he1, gholamreza.haffari}@monash.edu, mnorouzi@google.com

抽象的

This paper studies the use of language mod-
els as a source of synthetic unlabeled text for
自然语言处理. We formulate a general framework called
‘‘generate, annotate, 并学习 (GAL)’’ to
take advantage of synthetic text within knowl-
edge distillation, self-training, and few-shot
learning applications. To generate high-quality
task-specific text, we either fine-tune LMs on
inputs from the task of interest, or prompt
large LMs with few examples. We use the best
available classifier to annotate synthetic text
with soft pseudo labels for knowledge distilla-
tion and self-training, and use LMs to obtain
hard labels for few-shot learning. We train new
supervised models on the combination of la-
beled and pseudo-labeled data, which results
in significant gains across several applications.
We investigate key components of GAL and
present theoretical and empirical arguments
against the use of class-conditional LMs to
generate synthetic labeled text instead of unla-
beled text. GAL achieves new state-of-the-art
knowledge distillation results for 6-layer trans-
formers on the GLUE leaderboard.

介绍

There is an abundance of unlabeled data in the real
世界, but task-specific unlabeled data within the
scope of a given machine learning problem can be
challenging to find. 例如, one cannot easily
find in-domain unlabeled text conforming to the
input distribution of a specific Natural Language
加工 (自然语言处理) task from the GLUE benchmark
(王等人。, 2019C). Some NLP tasks require an
input comprising a pair of sentences with a partic-
ular relationship between them. 而且, clas-
sification datasets typically represent a tailored
distribution of data and only include a limited
number of class labels. If task-specific unlabeled
data were available, one could adopt self-training
(Yarowsky, 1995) to automatically annotate unla-
beled data with pseudo labels to improve accuracy
and robustness of classifiers (Xie et al., 2020;
Carmon et al., 2019). 此外, one can use

826

knowledge distillation (Hinton et al., 2015) 在
fresh task-specific unlabeled data to more effec-
tively compress deep neural networks and ensem-
布莱斯 (Bucilu˘a et al., 2006; 陈等人。, 2020A).

In the absence of task-specific unlabeled data,
one could retrieve unlabeled examples from a
large and diverse open-domain dataset (Du et al.,
2020). 然而, such a retrieval-based approach
may not scale to problems with complex input
计划, 例如, sentence pairs with certain
关系. Recent work (杨等人。, 2020; Kumar
等人。, 2020乙) has considered the use of Language
楷模 (LMs) like GPT-2 (Radford et al., 2019)
as a means of data augmentation, showing the
effectiveness of this approach for commonsense
reasoning and classification tasks. Existing ap-
proaches often consider class-conditional genera-
的, where the synthetic data is produced by con
ditioning on a specified class label. 然而, 它
is unclear whether class-conditional generation is
best suited for NLP tasks. 此外, existing
pipelines often make synthetic data generation
complicated as one needs to detect and discard
low-quality synthetic labeled data or optionally
re-label data (杨等人。, 2020; Vu et al., 2021乙).
例如, Kumar et al. (2020乙) observe that
it is difficult for sentences generated by label-
conditioned GPT-2 to retain the semantics/prag-
matics of the conditioning label, leading to poor
performance on downstream tasks.

We unify and simplify existing work on LMs
as a data source for NLP and develop a general
framework called ‘‘generate, annotate, 并学习
(GAL )’’. The generality of GAL allows us to use
LM-generated synthetic data within novel appli-
cations such as Knowledge Distillation (KD) 和
few-shot learning. GAL builds on recent advances
in text generation (Radford et al., 2019; 高
等人。, 2021) and uses powerful LMs to synthesize
task-specific unlabeled text by fine-tuning or con-
ditioning a large LM on in-distribution examples.
We use state-of-the-art classifiers to annotate gen-
erated text with soft pseudo labels when possible.

计算语言学协会会刊, 卷. 10, PP. 826–842, 2022. https://doi.org/10.1162/tacl 00492
动作编辑器: Andr´e F.T. 马丁斯. 提交批次: 2/2022; 修改批次: 4/2022; 已发表 8/2022.
C(西德:2) 2022 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

We then combine labeled data and pseudo-labeled
data to train more effective supervised models,
resulting in significant gains on a range of NLP
tasks like KD and few-shot learning.

We present a justification for GAL based on
the empirical and vicinal risk minimization frame-
作品 (Vapnik, 1992; Chapelle et al., 2001). 我们
also investigate key components of GAL. We find
that even if class-conditional LMs are available
for text generation, it is more effective to discard
the conditioning labels and let the teacher models
produce pseudo labels. This observation is sup-
ported by our theoretical and empirical results.
因此, in contrast to prior work (杨等人。,
2020; Vu et al., 2021乙), we advocate for the use
of simple unconditional LMs for text synthesis.
更远, we avoid any form of data filtering. 不是
出奇, we find that the diversity of synthetic
text matters. 那就是说, simple unconditional gen-
eration given random seeds provides sufficient
diversity, and crafting diverse LM prompts is
not needed.
总之:

• We develop GAL, a simple and effective
approach to the use of LMs for task-specific
unlabeled text generation. We show that GAL
can be used effectively for KD, self-training,
and few-shot learning in NLP.

• We present theoretical and empirical investi-
gations for GAL, explaining why it works and
why using class-conditional LMs to generate
synthetic labeled data is not as effective.

• GAL advances KD for NLP and establishes
a new state-of-the-art (SoTA) resu lt for a sin-
gle 6-layer transformer on the GLUE test set.
It further improves prompt-based few-shot
学习, providing an average improvement
的 1.3% on four 4-shot learning NLP tasks,
outperforming GPT-3-6B.

2 相关工作

Data synthesis with large pre-trained language
models is closely related to our work (Kumar
等人。, 2020乙; 杨等人。, 2020; Vu et al., 2021乙;
Norouzi et al., 2020). Yang et al. (2020) propose
a complex scheme, including label-conditioned
data generation, data relabeling, data filtering,

and two-stage training, to utilize synthetic data.
相比之下, we show that a simple mixture of the
original data and synthetic unconditionally gener-
ated data can provide sizable gains. 此外,
we show a broader use of generative models on KD
and few-shot learning. Vu et al. (2021乙) take a task
augmentation approach and employ conditional
generation to produce in-domain synthetic data
for an auxiliary language inference (NLI) 任务,
which is then used to initialize the target-task clas-
sifier. 然而, not all tasks (例如, grammatical
acceptability judgments) can benefit from the NLI-
style auxiliary task (王等人。, 2019A). We aim
to directly generate the unlabeled in-domain data
for the target task. Unlike Norouzi et al. (2020),
we do not use instance-based generative models.
More broadly, there has been a recent surge in
data synthesis and augmentation in NLP, 包括-
ing rule-based and model-based approaches; 看
Feng et al. (2021) for a recent survey. Data synthe-
sis with grammars has been explored in semantic
parsing and natural language understanding (例如,
see Wang et al., 2015, 2021; Marzoev et al.,
2020). Existing approaches to data augmentation
for NLP include lexicon replacement, sentence re-
trieval, and round-trip machine translation (王
and Yang, 2015; Yu et al., 2018; Kobayashi,
2018; Wu et al., 2019; Lichtarge et al., 2019; Wei
and Zou, 2019; Alberti et al., 2019; Du et al.,
2020; Shen et al., 2020). 我们, 反而, propose
the use of unconditional autoregressive LMs for
data augmentation. This is simple, 灵活的, 和
强大的.

Self-training is one of the oldest approaches
for semi-supervised learning (Scudder, 1965,
Fralick, 1967; Agrawala, 1970; Yarowsky, 1995;
Eisner and Karakos, 2005; Ueffing et al., 2007;
Du et al., 2020). 阿布尼 (2004) and Haffari and
Sarkar (2007) have theoretically analyzed self-
training for simple decision lists. Recent theoreti-
cal work analyzes self-training for linear models,
often under the assumption that the data distri-
bution is (几乎) Gaussian (Carmon et al., 2019;
Raghunathan et al., 2020; 陈等人。, 2020乙;
Kumar et al., 2020A; Oymak and Gulcu, 2020).
Wei et al. (2021) prove that, under ‘‘expansion’’
and ‘‘class separation’’ assumptions, self-training
can lead to more accurate neural network classi-
fiers. We present a theoretical framing of GAL in
terms of empirical and vicinal risk minimization
(Vapnik, 1992; Chapelle et al., 2001).

827

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Knowledge Distillation (KD) (Bucilu˘a et al.,
2006; Hinton et al., 2015) uses a procedure sim-
ilar to self-training to distill knowledge of an
expressive teacher model into a smaller student
模型. 相比之下, self-distillation (Furlanello
等人。, 2018; 张等人。, 2019; Mobahi et al.,
2020) uses teacher and student models of equal
尺寸, hoping to iteratively refine class labels. 预-
vious work uses unlabeled data (Bucilu˘a et al.,
2006) and adversarial training (王等人。, 2018)
to improve KD. We demonstrate that synthetic
data generated by unconditional generative mod-
els can improve KD on NLP, outperforming strong
KD baselines, which often add more complexity
and additional hyperparameters (例如, 孙等人。,
2019A; Jiao et al., 2019; 徐等人。, 2020, Rashid
等人。, 2021).

3 Generate, Annotate, and Learn (GAL)

Given a labeled dataset L = {(希, 做)}氮
我=1, 我们
first train an unconditional domain-specific gen-
erative model g(X) on Lx = {希}氮
我=1, 进而
use it to synthesize unlabeled data. Such synthetic
unlabeled data is used within self-training and
KD even in the absence of in-domain unlabeled
数据. We restrict our attention to basic KD and self-
training methods, even though GAL can be com-
bined with more sophisticated semi-supervised
技巧, 也.

The effectiveness of GAL depends on the fi-
delity and diversity of synthetic examples. If we
had access to the oracle generative process, 我们
would be able to obtain the best KD and SSL
结果, as if we had access to real task-specific
unlabeled data. Our preliminary experiments sug-
gest that large language models are particularly
effective within the GAL framework. 因此, 作为
shown in Figure 1, to build the best domain-
specific language model, we adopt a large lan-
guage model pretrained on lots of open-domain
文本, and fine-tune it on a given dataset’s inputs,
那是, Lx, ignoring class labels. Both our theory
and ablations confirm that ignoring class labels is
a good idea (c.f., 部分 4 和 5). Transferring
the knowledge of large language models is par-
ticularly beneficial when a small input dataset Lx
of text is available (Hernandez et al., 2021).

To improve computational efficiency of GAL,
we do not generate unlabeled data on the fly,
but generate as many unconditional samples as
possible and store them in a synthetic unlabeled

数字 1: An illustration of GAL for NLP. We use
open-domain data once for self-supervised pretraining
(例如, BERT) and once for training a large LM (例如,
GPT-2). BERT is fine-tuned on labeled data to yield a
classifier for the task of interest. GPT-2 is fine-tuned on
the same data without labels to obtain an unconditional
task-specific LM, which is used to generate lots of
synthetic in-domain unlabeled data for self-training
and KD.

dataset U . We use soft pseudo labels within self-
training and KD, as we empirically found it is
more effective than using hard labels on synthe-
tic data.

3.1 Knowledge Distillation with GAL

KD distills knowledge of an expressive teacher
model into a smaller student model (Hinton et al.,
2015). We pose the following objective function
for KD with labeled and synthetic unlabeled data:

(西德:2)kd = λ E(X,y)∼LH(y, fs(X))+

(1 − λ) 乙(西德:2)x∼g(X)H(H((西德:2)X), fs((西德:2)X)),

(1)

where h is the teacher model, fs is the student
模型, and g is the large pre-trained language
模型 (例如, GPT2) fine-tuned on the text in the
training data Lx. H(q, p) = q(西德:4) log p is the soft-
max cross entropy loss. Note the use of g(X),
approximating the unknown real data distribu-
tion P (X) 在 (1). Algorithm 1 summarizes the
GAL-KD process.

3.2 Self-Training with GAL

Self-training encourages knowledge transfer be-
tween a teacher and a student model in such a
way that the student can outperform the teacher.
Algorithm 2 summarizes the GAL -self-training
过程. Given the labeled dataset L and the
synthetic unlabeled dataset U , an initial model de-
noted f1 is trained using supervised learning on the
labeled dataset L. 然后, at iteration t, one adopts

828

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

on each dataset of interest after removing class
labels. We find that training from scratch on these
datasets is hopeless, but the larger the pretrained
GPT-2 variant, the better the validation perplex-
ity scores are. For tasks modeling a relationship
between multiple sentences, we concatenate a
separator token [SEP] between consecutive sen-
时态. To alleviate an over-fitting on the train-
ing set, we use the best checkpoint evaluated on
the dev set as our generation engine. Once a fine-
tuned GPT-2 model is obtained, we generate new
domain-specific data by using top-k random sam-
pling similar to Radford et al. (2019). 我们不
feed any prompt to the LM, but a special [BOS]
token to initiate the generation chain. A genera-
tion episode is terminated when a special [EOS]
token is produced. We generate diverse sentences
by varying the random seed. After collecting
enough synthetic data, we only retain unique sen-
时态. For tasks with α input sentences, we dis-
card generated samples that violate this constraint
(大约 10% of samples were rejected). Fi-
nally, we obtain task-specific synthetic data up to
40× larger than the original training sets. 对于一些
samples of generated text for GLUE see Tables 11
和 12. We believe using bigger LMs and larger
synthetic datasets will improve our results, but we
are constrained by computer resources.

4 An Empirical Risk Minimization

Perspective

In supervised learning, one seeks to learn a map-
ping f that, given an input x, predicts a reasonable
output y. To define the supervised learning prob-
lem formally, one assumes that input-output pairs
are drawn from a joint distribution P , 即,
(X, y) ∼ P (X, y), and a loss function H(y, F (X))
is used to assess the quality of a mapping f . 这
loss is used to define a notion of expected risk:

右(F ) = EP (X,y)H(y, F (X)) .

(3)

In almost all practical applications P (X, y) 是
未知. 因此, a labeled dataset of examples
L = {(希, 做)}氮
i=1 is used to approximate R(F ) 作为

(西德:3)右(F ) =

(西德:4)氮
我=1

1
氮

H(做, F (希)) .

(4)

ft as the teacher model to annotate the unlabeled
dataset U using pseudo labels. In self-training
GAL, the student model ft+1 is trained to opti-
mize a classification loss on the combination of
L and U :

(西德:2)t+1 = λ E(X,y)∼LH(y, ft+1(X))+

(1 − λ) 乙(西德:2)x∼g(X)H(英尺((西德:2)X), ft+1((西德:2)X)) ,

(2)

where λ = 0.5 unless stated otherwise. 虽然
many different variants of the basic self-training
algorithm discussed above exist in the literature,
we adopt the simplest variant of self-training and
limit hyperparameter tuning to a bare minimum.

3.3 Domain-Specific Text Generation

We take a pretrained GPT-2 language model
(Radford et al., 2019) and fine-tune it separately

This objective function is known as empirical
风险, and learning f through minimizing (西德:3)右(F ) 是

829

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

known as the empirical risk minimization princi-
普莱 (Vapnik, 1992). To compensate for the finite
sample size in (4), one typically combines (西德:3)右(F )
with a regularizer to improve generalization.

Beyond Empirical Risk Minimization. Empir-
ical risk minimization (4) is motivated as a way
to approximate P (X, y) through a set of Dirac
delta functions on labeled examples: Pδ(X, y) =
(西德:5)
i δ(x = xi, y = yi)/氮 . 然而, this approx-
imation is far from perfect, hence one uses a
heldout validation set for early stopping and
hyperparameter tuning.

Vicinal risk minimization (Chapelle et al.,
2001) approximates expected risk as EPν (X,y)H
(y, F (X)), using a vicinity distribution, 为了考试-
普莱, ν(˜x, ˜y | X, y) = N (˜x − x, σ2)δ(˜y = y) 到
approximate P (X, y) 作为

Pν(X, y) =

(西德:4)氮
我=1

1
氮

ν(˜x = x, ˜y = y | 希, 做) .
(5)

The goal is to increase the support of each labeled
data point and improve the quality and robustness
of the risk function.

Recent work on mixup regularization (张
等人。, 2018) proposes an effective way to con-
struct another vicinity distribution by interpolating
between two data points and their labels. 尽管
their simplicity, these smoothing techniques tend
to improve matters.

Generative Models for Risk Minimization.
One can factorize the joint distribution of input-
output pairs as P (X, y) =P (X)磷 (y | X). 乙酰胆碱-
cordingly, if one is able to learn a reasonable
unconditional generative model of x denoted
G(X), then one can draw a pair (X, y) by first
drawing x ∼ g(X) and then using the current
instance of ft to draw y ∼ ft(X). 然后, one can
use ft and g to approximate expected risk as

Rt(ft+1) = Ex∼g(X)Ey∼ft(X)H(y, ft+1(X)) .

(6)

The quality of this approximation highly depends
on the quality of ft and g. If ft is far from an
optimal classifier f ∗ or g(X) is far from P (X), (6)
yields a poor approximation.

is applicable to any continuous, 离散的, or struc-
tured domain as long as expressive generative
models of P (X) 可用. 那就是说, for al-
most all reasonable loss functions H (例如, softmax
cross entropy and squared error), (6) is minimized
when ft+1 = ft, which is not ideal, 尤其
when ft is far from f ∗. 另一方面, 嗯-
pirical risk (4) anchors the problem in real la-
beled examples that are provided as ground truth.
GAL -self-training aims to combine the benefits

的 (4) 和 (6) via:

Rt(ft+1) =

(西德:4)氮
H(做, ft+1(希))+
我=1
(1 − λ)Ex∼g(X)Ey∼ft(X)H(y, ft+1(X)).

λ
氮

(7)

In this formulation, if ft represents the minimizer
of empirical risk (4), then ft+1 = ft is the mini-
mizer of (7), 也. 然而, one does not seek the
global minimizer of empirical risk, but rather the
best performance on heldout data. If ft is obtained
by stochastic gradient descent on any risk func-
的, but early-stopped according to empirical risk
on a heldout set, then using such ft in (7) to define
Rt(ft+1) promotes the selection of a mapping ft+1
that minimizes empirical risk while staying close
to the best performing mapping so far (IE。, 英尺).
This formulation motivates self-training and GAL
as regularizers in the functional space and explains
why they can conceivably work. Although the ar-
guments are provided here for GAL-self-training,
extending them to GAL-KD is straightforward
(omitted due to the space constraints).

How About Class-conditional Generative
楷模? One can also factorize the joint dis-
tribution P (X, y) as P (y)磷 (X | y) and accord-
ingly utilize a class-conditional generative model
G(X | y) to derive the following expected risk
formulation:

右(F ) = Ey∼P (y)Ex∼g(X|y)H(y, ft+1(X)) .

(8)

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

In this setting pseudo labeling is not needed as
synthetic data is already labeled. One can show
that the optimal classifier f ∗
g that minimizes (8)
for the cross-entropy loss is given by

f ∗
G (y | X) = g(X|y)磷 (y)

(西德:6)(西德:4)

y(西德:6) G(X|y(西德:6))磷 (y(西德:6)) ,
(9)

The expected risk in (6) smoothens the risk
landscape in complex ways beyond simple Gaus-
sian smoothing and interpolation. This smoothing

那是, turning the class-conditional generative
model into a classifier by using the Bayes rule
yields the optimal solution.

830

模型

MNLI(m/mm) CoLA SST-2 MRPC

STS-B

QQP QNLI RTE Avg

Previous work:
BERT-Theseus
BERT-PKD
tinyBERT
MATE-KD

Our results:
DistilRoBERTa
DistilRoBERTa + KD
DistilRoBERTa + WS
DistilRoBERTa + RT
DistilRoBERTa + GAL

82.4/82.1
81.5/81.0
84.6/83.2
86.2/85.6

83.8/83.4
84.5/84.1
86.2/85.9
86.2/85.6
86.9/86.4

47.8
-
51.1
58.6

55.9
53.0
52.2
55.0
58.6

92.2
92.0
93.1
95.1

93.2
93.5
94.0
94.9
95.3

87.6/83.2 85.6/84.1 71.6/89.3
-
85.0/79.9
70.7/88.9
87.3/82.6 85.0/83.7 71.6/89.1
91.2/88.1 88.5/88.4 73.0/89.7

87.4/83.1 87.5/87.5 71.7/89.1
88.9/85.1 88.0/87.4 71.9/89.2
89.9/86.4 88.7/88.3 71.7/89.2
90.1/86.5 89.2/88.9 72.5/89.7
91.6/88.7 89.9/89.5 73.0/89.9

89.6
89.0
90.4
92.4

90.6
91.0
91.5
92.1
92.7

66.2 78.6
65.5 -
70.0 79.8
76.6 83.5

73.3 81.2
75.0 81.5
76.2 82.1
77.2 82.9
79.7 84.3

桌子 1: GLUE test results for a 6-layer transformer. GAL establishes a new state of the art on KD
for NLP. 基线: BERT-Theseus (徐等人。, 2020), BERT-PKD (孙等人。, 2019A), tinyBERT
(Jiao et al., 2019), MATE-KD (Rashid et al., 2021), DistilRoBERTa (Sanh et al., 2019), and Distil-
RoBERTa + KD (standard KD), DistilRoBERTa + WS (word substitution), and DistilRoBERTa + RT
(round-trip translation). MNLI-m and MNLI-mm indicate matched and mismatched, 分别.

Provided that the accuracy of generative clas-
sifiers on text classification is behind their dis-
criminate counterparts (例如, Ravuri and Vinyals,
2019), we think substituting (8) 进入 (7) 不是
a good idea. 本质上, by substituting (8) 进入
the classification objective, one is regularizing
f to remain close to f ∗
G , which is not an ef-
fective strategy if f ∗
g is not competitive. 这
argument corroborates the evidence from our ab-
lation studies and recent work showing that using
class-conditional generative models to augment
supervised learning does not provide big gains
(Ravuri and Vinyals, 2019).

那就是说, one can still use class-conditional
generative models to synthesize high-fidelity sam-
普莱斯. As long as these samples are treated as un-
labeled examples and annotated using a classifier,
例如, 英尺, we believe this is a reasonable
approach falling under GAL. Note that our ar-
gument above only applies to the scenario that
class-conditional generative models are used to
synthesize labeled examples. 换句话说, GAL
emphasizes prediction of the labels in the course
of the algorithm, rather than having the labels
predefined. If one uses the unlabeled synthetic
examples from class-conditional generative mod-
这, it still aligns to (7), which will be verified in
部分 5.4.

5 实验

在这个部分, we assess the effectiveness of
GAL on KD, self-training, and few-shot learning.

5.1 State-of-the-art Results of Knowledge
Distillation with GAL on GLUE

We use the GLUE benchmark (王等人。,
2019C) for our KD experiments; see Appendix A.1
for benchmark details. Our synthetic unlabeled
dataset U includes 40× as many examples as the
original dataset for each task in GLUE.

It is known that KD on fresh data, unseen during
训练, performs better (Bucilu˘a et al., 2006;
陈等人。, 2020A) than KD on original training
数据. 因此, we investigate the effectiveness of
KD using generated unlabeled data through GAL.
We use the HuggingFace implementation (Wolf
等人。, 2020) for KD experiments and adopt a stan-
dard experimental setup consistent with previous
工作 (孙等人。, 2019A; 徐等人。, 2020). Follow-
ing Rashid et al. (2021), fine-tuned RoBERTa-
大的 (24-layer transformer) represents the teacher
and a DistilRoBERTa (6-layer transformer) (Sanh
等人。, 2019) is used as the student. We train the
student model on U and L, where U is annotated
by the best RoBERTa-large model, achieving an
average score of 86.5. We then mix L and U at a
的比率 1:4, which is equivalent to λ = 0.2. 这
ratio works best on the dev set.

桌子 1 shows the results of individual 6-layer
transformers on the GLUE test set. All of the base-
lines use an identical student architecture. GAL
achieves the best entry on the GLUE leaderboard,
marking a new state-of-the-art for KD on NLP. 它
outperforms strong KD baselines such as Distil-
RoBERTa (Sanh et al., 2019), BERT-PKD (Sun

831

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型

RoBERTa base

+ GAL (iter 1)
+ GAL (iter 2)
+ GAL (iter 3)

MNLI CoLA SST-2 MRPC STS-B QQP QNLI RTE Avg

87.7 0.1 63.6 0.4 94.8 0.1 90.1 0.4 90.8 0.1 91.5 0.1 92.6 0.1 78.8 0.4 86.2
87.9 0.1 65.1 0.5 95.3 0.1 91.7 0.5 91.4 0.1 91.8 0.1 93.1 0.1 81.4 0.4 87.2
88.0 0.1 65.2 0.5 95.3 0.1 92.2 0.4 91.5 0.1 91.7 0.1 93.2 0.1 82.4 0.5 87.4
87.9 0.1 65.5 0.5 95.3 0.1 92.2 0.5 91.7 0.2 91.7 0.1 93.2 0.1 82.0 0.5 87.4

RoBERTa base + self-distillation 88.1 0.1 63.7 0.5 95.2 0.1 90.3 0.4 90.4 0.1 91.5 0.1 93.1 0.1 79.7 0.5 86.5

桌子 2: RoBERTa base and GAL self-training results on GLUE dev sets, averaged across 5 独立的-
dent runs (numbers in the subscript indicate the error bar, IE。, standard deviation divided by

5.).

√

等人。, 2019A), BERT-Theseus (徐等人。, 2020),
tinyBERT (Jiao et al., 2019), and MATE-KD
(Rashid et al., 2021). It also outperforms our own
DistilRoBERTa+KD baseline, which learns from
soft labels produced by an identical RoBERTa-
large ensemble on the original labeled dataset.
While the use of soft labels outperform the vanilla
fine-tuned DistilRoBERTa model, it significantly
underperforms our KD+GAL baseline. 我们也
compare with two strong data-augmentation base-
线, round-trip translation (RT) (Yu et al., 2018;
Shleifer, 2019) and word substitutions (WS) (Jiao
等人。, 2019; Wei and Zou, 2019). For RT, We gen-
erate 40× unlabeled data using German as the
bridge language (English→German→English). 这
translations are generated via the best model in
WMT19 (Ng et al., 2019). We use the codebase
from Jiao et al. (2019) to conduct WS data aug-
心理状态. We mirror the KD experimental setup
of GAL for both RT and WS. Although Distil-
RoBERTa+RT and DistilRoBERTa+WS are bet-
ter than vanilla DistilRoBERTa and KD variants,
they still drastically underperform our approach.

5.2 Self-Training with GAL on GLUE

We fine-tune a pretrained RoBERTa model pro-
vided by fairseq (Ott et al., 2019) on each GLUE
任务. Fine-tuned RoBERTa serves as the first
teacher model for self-training. Each student
model is initialized with the original pretrained
RoBERTa and fine-tuned with exactly the same
hyperparameters as suggested by fairseq (Ott
等人。, 2019). We combine the labeled dataset
L and the synthetic dataset U with a ratio of 1:1,
by oversampling labeled data. This corresponds
to λ = 0.5 in Eq. (7).

桌子 2 shows that GAL provides an aver-
age improvement of +1.3% over RoBERTa-base.
We see consistent improvements with more GAL
迭代, but performance saturates after three
迭代. We further compare our approach with

a self-distillation (Furlanello et al., 2018) 根据-
线, in which the teacher and student models
use the same architecture and transfer knowledge
via the original labeled training set. 虽然
self-distillation provides a slight improvement,
the gains from GAL are more significant.

We delve deeper and combine GAL self-
training with RoBERTa-large and report test re-
sults for both single model and ensemble model
表中 3. We observe consistent gains coming
from GAL on RoBERTa-large. Our results un-
derperform the latest and largest LMs from the
GLUE leaderboard, but we are optimistic that
GAL can be effectively combined with enormous
LMs to provide additional gains.

5.3 Prompt-based Few-shot Experiments

GPT3 (Brown et al., 2020) has introduced an
optimization-free paradigm for few-shot learning
for NLP. Without updating the parameters, 大的
LMs can correctly predict the labels of the in-
puts by conditioning on a prompt, which consists
of an instruction, a few labeled instances and a
new unlabeled input. We apply GAL to prompt-
based few-shot learning. 具体来说, we present
k labeled examples as a prompt to GPT-J (王
and Komatsuzaki, 2021), an open-sourced re-
implementation of GPT-3-6B, and generate m
synthetic examples, followed by the correspond-
ing labels. Note that to mitigate noisy outputs, 这
generation of each synthetic example only condi-
tions on the original k labeled examples. 最后,
we concatenate the original k examples and m
synthetic examples, and conduct a (k + 米)-shot
learning experiment with GPT-J.

Brown et al. (2020) studied a total of 51 很少-
shot learning tasks. Studying all of these tasks is
prohibitively expensive. 因此, we filter tasks by
following these two steps. 第一的, since generating
m synthetic examples for each test instance is
computationally expensive, we exclude tasks that

832

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型

MNLI(m/mm) CoLA SST-2 MRPC

STS-B

QQP

QNLI RTE Avg

Individual Models (our implementation):
RoBERTa-large
RoBERTa-large + GAL

90.1/89.7
90.2/89.8

Ensemble Models (our implementation):
RoBERTa-large
RoBERTa-large + GAL

91.2/90.5
91.0/90.7

63.8
66.2

96.1
96.4

91.2/88.3
92.0/89.2

90.9/90.7
90.7/90.5

72.5/89.6
73.6/89.9

94.5
95.0

85.9
86.3

86.5
87.1

66.8
67.9

96.9
97.1

92.8/90.3
93.1/90.8

91.9/91.6
91.6/91.4

74.5/90.4
74.5/90.4

95.5
95.8

87.7
88.2

87.9
88.2

State-of-the-art:
RoBERTa-large
ELECTRA
T5
ERNIE
DeBERTa

90.8/90.2
91.3/90.8
92.2/91.9
91.9/91.4
91.9/91.6

67.8
71.7
71.6
74.4
71.5

96.7
97.1
97.5
97.8
97.5

92.3/89.8
93.1/90.7
92.8/90.4
93.9/91.8
94.0/92.0

92.2/91.9
92.9/92.5
93.1/92.8
93.0/92.6
92.9/92.6

74.3/90.3
75.6/90.8
75.1/90.6
75.2/90.9
76.2/90.8

95.4
95.8
96.9
97.3
99.2

88.2
89.8
92.8
92.0
93.2

88.0
89.2
89.8
90.2
90.3

桌子 3: RoBERTa-large with GAL self-training and SoTA methods evaluated on GLUE test sets. The benefit
of GAL on single models is larger than ensembles. It appears that self-training reduce the variance of models.
Baselines including much larger models: RoBERTa-large (刘等人。, 2019), ELECTRA (Clark et al., 2020),
T5 (Raffel et al., 2020), ERNIE (孙等人。, 2019乙), and DeBERTa (He et al., 2020). MNLI-m and MNLI-mm
indicate matched and mismatched, 分别.

模型

4-shot
8-shot
16-shot

SST-2

PIQA COPA BoolQ Avg

89.8 0.8
91.3 0.8
92.7 0.6

76.0 1.4
76.2 1.2
77.0 0.9

79.0 1.5
79.0 1.5
81.0 1.1

64.3 0.8
66.2 0.8
66.8 0.8

77.3
78.2
79.4

78.5

4-shot + synthetic 12-shot (GAL )

91.5 0.7

76.7 1.0

80.0 1.2

65.9 0.8

桌子 4: Few-shot learning results for GPT-J (6乙) (Wang and Komatsuzaki, 2021)
on four NLP datasets. Accuracy is reported for these datasets.

have more than 5k test examples. 第二, we fil-
ter tasks on which GPT-3-6B achieves a score
lower than 65% (please refer to Table H.1 in
Brown et al. [2020] for more details). After ap-
plying the filtering steps, we use four datasets:
SST-2 (王等人。, 2019C), PIQA (Bisk et al.,
2020), COPA, and BoolQ (王等人。, 2019乙) 作为
the testbed. We notice that in order to generate
valid synthetic data, GPT-J requires to see at
至少 4 labeled examples. 此外, at most
16 examples of BoolQ can be fed into GPT-J
without truncation. 因此, we set k and m to 4
和 12, 分别. As seen in Table 4, GAL
leads to an average improvement of 1.2% 超过
4-shot learning, and reduces the gap between
4-shot and 16-shot learning. We noticed that the
quality of some generated examples is low. 我们
believe the performance of few-shot learning can
be further improved with high-quality instances.
One solution is to generate many synthetic ex-
阿普莱斯, and select a high-quality subset. 自从
each test instance conditions on distinct labeled

instances, one has to generate different synthetic
instances for each test example from GPT-J, 哪个
causes expensive computation. Due to such com-
putational constraints, we leave the investigation
of data selection strategies to the future work.

5.4 Ablating Components of GAL on GLUE

We conduct an in-depth study of different com-
ponents of GAL on GLUE datasets. Unless stated
否则, we use a RoBERTa-base model with a
combination of the original training data and 40×
synthetic data for each self-training experiment.

GPT-2 Model Size. Radford et al.
(2019)
present a few variants of the GPT-2 model includ-
ing GPT-2, GPT-2-medium, GPT-2-large, 和
GPT-2-XL. Larger GPT-2 models yield better
perplexity scores and higher generation quality.
We utilize these models except GPT-2-XL within
the GAL framework to study the impact of the
generative model’s quality on downstream task’s
表现. 桌子 5 shows that regardless of the

833

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

GPT-2

SST-2 RTE MRPC CoLA

NA
小的
medium
大的

94.8
95.5
95.3
95.3

78.8
81.3
81.3
81.4

90.1
90.9
91.3
91.7

63.6
63.9
63.7
65.1

桌子 5: GAL with various GPT-2 model sizes on
GLUE dev sets. NA indicates a RoBERTa base
模型. We bold the best numbers.

Pseudo label

SST-2 RTE MRPC CoLA

难的
soft

95.0
95.3

80.7
81.4

90.8
91.7

63.0
65.1

桌子 6: GAL with soft vs. hard pseudo labels on
GLUE dev sets. We bold the best numbers.

GPT-2 model sizes, GAL consistently surpasses
the vanilla RoBERTa base. 而且, SST-2 and
RTE datasets are not sensitive to the capacity of
GPT-2, but higher quality synthetic text improves
the results on MRPC and CoLA datasets. We leave
investigation of GPT-2-XL and even larger LMs
such as GPT-3 (Brown et al., 2020) to future work.

Soft vs. Hard Pseudo Label. 我们调查
the use of soft and hard pseudo labels within the
GAL framework. The results in Table 6 建议
that GAL using soft pseudo labels is more effective
than hard labels on the GLUE benchmark. 这
finding is compatible with the intuition that soft
labels enable measuring the functional similarity
of neural networks better (Hinton et al., 2015).

它

Class-conditional Synthetic Data Generation.
Previous work (Kumar et al., 2020乙; Ravuri
and Vinyals, 2019) suggests that
is chal-
lenging to utilize labeled synthetic data from
class-conditional generative models to boost the
accuracy of text and image classifiers. Our theory
in Section 4 points to the potential drawback of
class-conditional synthetic data. We empirically
study this phenomenon, by fine-tuning GPT-2 in a
class-conditional manner. Then we utilize its syn-
thetic examples in two different cases: 1) labeled
synthetic examples and 2) unlabeled synthetic ex-
阿普莱斯. 桌子 7 shows that not only do class-
conditional LMs underperform unconditional LMs
in our GAL framework, but also they are much
worse than the baseline, when using the pre-
defined labels. 尽管如此, if we apply GAL
to these examples, the class-conditional LM is

on par with the unconditional one, which cor-
roborates the importance of the annotation step in
GAL. We provide more analysis in Appendix A.3.

6 Limitations

This work demonstrates that one can leverage
synthetic in-domain data generated by powerful
pre-trained generative models. For simplicity, 我们
do not employ any filtering avenue to retain di-
verse but high-quality data points. 然而, pre-
vious work has shown that advanced filtering
approaches can further improve the performance
(Sohn et al., 2020; Du et al., 2020; 杨等人。,
2020). Given that the improvements in the self-
training are not sizeable, we believe it is worth
imposing filtering methods on the synthetic data
to mitigate the side effects caused by the noisy
data points.

Although we examine the effectiveness of GAL
on various classification tasks, we still focus on
the sentence-level tasks. Because of the superior
performance on sentence-level tasks, there has
been a surge of interest shift to document-level
任务, such as document-level machine transla-
的 (Miculicich et al., 2018; Voita et al., 2018;
Maruf and Haffari, 2018), document summariza-
的 (Rush et al., 2015; Nallapati et al., 2016),
等等. As these tasks suffer from data scar-
城市, one can leverage GAL to synthesize more
data points. 然而, previous work has shown
that GPT-2 has difficulty generating coherent text
requiring long-range dependency (Orbach and
Goldberg, 2020; Guan et al., 2020). 因此, 这样的
a limitation may hinder the application of GAL
to document-level tasks.

此外, the label space of the studied tasks
is not as complex as the structured prediction
任务, such as machine translation, dialog system,
question answering, 等等. 然而, we be-
lieve one can smoothly adapt GAL to these tasks
还有. Let us consider machine translation (公吨)
as a canonical structured prediction task. 事先的
work has shown that one can use (真实的) monolin-
gual data, in either source or the target language,
through data augmentation (Sennrich et al., 2016)
or knowledge distillation (Kim and Rush, 2016)
to improve the structured prediction tasks. 这
suggests a promising avenue for future research
on using synthetically generate monolingual data
to improve MT for specialized domains where
even monolingual data is scarce.

834

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Generative model

Labeled synthetic data SST-2 RTE MRPC CoLA

没有任何 (基线)

Class-conditional LM
Unconditional LM (GAL )
Class-conditional LM (GAL)

94.8

92.9
95.3
95.4

78.8

74.4
81.4
81.0

90.1

86.0
91.7
91.4

63.6

58.4
65.1
65.2

桌子 7: Synthetic data from class-conditional LMs underperforms GAL and RoBERTa on
GLUE dev sets.

此外, Vu et al. (2021A) suggest that one
can leverage a retrieval-based approach to ob-
tain monolingual sentences from the generic data
stores. This retrieved monolingual data is then
employed to improve the translation quality in
a domain adaptation setting. This suggests that
a GAL-based approach to synthetically generate
monolingual text is a promising method to im-
prove MT for specialized domains—an interest-
ing direction for future research.

7 结论

We present Generate, Annotate, and Learn (GAL):
a framework for self-training and knowledge dis-
tillation with generated unlabeled data. We mo-
tivate GAL from an expected risk minimization
perspective and demonstrate both theoretically
and empirically that the use of unconditional gen-
erative models for synthetic data generation is
more effective than class-conditional generative
models previously used in the literature. GAL
leverages advances in large pretrained language
models to help supervised learning and can have
implications for learning from limited labeled
数据. GAL significantly helps improve knowledge
distillation and prompt-based few-shot learning.
此外, a concurrent work (Gowal et al.,
2021) has shown that using generated images can
enhance the robustness of images classifiers. 我们
will explore this direction on NLP tasks in the
未来. 最后, we hope that GAL will stimulate
new research on the evaluation and development
of large language models.

致谢

We would like to thank the anonymous review-
ers and action editor Andr´e F.T. Martins for their
comments and suggestions on this work. com-
putational resources of this work are partly sup-
ported by the Multi-modal Australian ScienceS

Imaging and Visualisation Environment (MAS-
SIVE) (www.massive.org.au). This mate-
rial is partly based on research sponsored by Air
Force Research Laboratory and DARPA under
agreement number FA8750-19-2-0501. 美国.
Government is authorized to reproduce and dis-
tribute reprints for Governmental purposes not-
withstanding any copyright notation thereon.

参考

Steven Abney.

2004. Understanding

这
Yarowsky algorithm. Computational Linguis-
抽动症, 30(3):365–395. https://doi.org/10
.1162/0891201041850876

A. Agrawala. 1970. Learning with a probabilis-
tic teacher. IEEE Transactions on Information
理论, 16(4):373–379. https://doi.org
/10.1109/TIT.1970.1054472

Chris Alberti, Daniel Andor, Emily Pitler, 雅各布
Devlin, and Michael Collins. 2019. Synthetic
QA corpora generation with roundtrip con-
sistency. In Proceedings of the 57th Annual
Meeting of the Association for Computational
语言学, pages 6168–6173. https://土井
.org/10.18653/v1/P19-1620

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, 和
Yejin Choi. 2020. PIQA: Reasoning about
physical commonsense in natural language. 在
Proceedings of the AAAI Conference on Artifi-
cial Intelligence, 体积 34, pages 7432–7439.
https://doi.org/10.1609/aaai.v34i05
.6239

Tom B. 棕色的, Benjamin Mann, Nick Ryder,
Melanie Subbiah,
Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
阿加瓦尔, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,

835

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

✓
✗
✗

Christopher Hesse, Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray, 本杰明
Chess, Jack Clark, Christopher Berner, 山姆
McCandlish, Alec Radford, 伊利亚·苏茨克维尔, 和
Dario Amodei. 2020. Language models are
few-shot learners. arXiv:2005.14165.

Steven Y. 冯, Varun Gangal, Jason Wei, Sarath
Chandar, Soroush Vosoughi, Teruko Mitamura,
and Eduard Hovy. 2021. A survey of data aug-
mentation approaches for NLP. In Findings of
the Association for Computational Linguistics:
ACL-IJCNLP 2021, pages 968–988.

Cristian Bucilu˘a, Rich Caruana, and Alexandru
Niculescu-Mizil. 2006. Model compression.
Proceedings of the 12th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and
Data Mining, pages 535–541. https://土井
.org/10.1145/1150402.1150464

Yair Carmon, Aditi Raghunathan, 路德维希
施密特, John C. Duchi, and Percy S. 梁.
2019. Unlabeled data improves adversarial
robustness. Advances in Neural Information
Processing Systems, 32.

Olivier Chapelle, Jason Weston, L´eon Bottou,
and Vladimir Vapnik. 2001. Vicinal risk min-
imization. Advances in Neural Information
Processing Systems.

Ting Chen, Simon Kornblith, Kevin Swersky,
Mohammad Norouzi, and Geoffrey Hinton.
2020A. Big self-supervised models are strong
semi-supervised learners. 神经信息处理系统.

Yining Chen, Colin Wei, Ananya Kumar, 和
Tengyu Ma. 2020乙. Self-training avoids us-
ing spurious features under domain shift. 在
神经信息处理的进展
系统 33: Annual Conference on Neural In-
formation Processing Systems 2020, 神经信息处理系统
2020, 十二月 6-12, 2020, 虚拟的.

Kevin Clark, Minh-Thang Luong, Quoc V. Le,
and Christopher D. 曼宁. 2020. Electra:
Pre-training text encoders as discriminators
rather than generators. International Confer-
ence on Learning Representations.

Jingfei Du, Edouard Grave, Beliz Gunel, Vishrav
Chaudhary, Onur Celebi, Michael Auli, Ves
Stoyanov, and Alexis Conneau. 2020. 自己-
training improves pre-training for natural lan-
guage understanding. arXiv:2010.02194.

Jason Eisner and Damianos Karakos. 2005.
Bootstrapping without the boot. In Proceed-
ings of Human Language Technology Confer-
ence and Conference on Empirical Methods in
自然语言处理, pages 395–402.
https://doi.org/10.3115/1220575
.1220625

S. Fralick. 1967. Learning to recognize patterns
without a teacher. IEEE Transactions on In-
formation Theory. https://doi.org/10
.1109/TIT.1967.1053952

Tommaso Furlanello, Zachary Lipton, 迈克尔
Tschannen, Laurent Itti, and Anima Anandkumar.
国际米兰-
2018. Born again neural networks.
national Conference on Machine Learning,
pages 1607–1616.

Leo Gao, Jonathan Tow, Stella Biderman, Sid
黑色的, Anthony DiPofi, Charles Foster, Laurence
Golding, Jeffrey Hsu, Kyle McDonell, Niklas
Muennighoff, Jason Phang, Laria Reynolds,
Eric Tang, Anish Thite, Ben Wang, Kevin
王, and Andy Zou. 2021. A framework for
few-shot language model evaluation.

Sven Gowal, Sylvestre-Alvise Rebuffi, Olivia
Wiles, Florian Stimberg, Dan Andrei Calian,
and Timothy A. Mann. 2021. Improving robust-
ness using generated data. Advances in Neural
Information Processing Systems, 34.

Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan
朱, and Minlie Huang. 2020. A knowledge-
enhanced pretraining model for commonsense
story generation. Transactions of the Associa-
tion for Computational Linguistics, 8:93–108.
https://doi.org/10.1162/tacl 00302

Gholamreza Haffari and Anoop Sarkar. 2007.
Analysis of semi-supervised learning with the
yarowsky algorithm. In UAI 2007, Proceed-
ings of the Twenty-Third Conference on Un-
certainty in Artificial Intelligence, Vancouver,
BC, 加拿大, 七月 19-22, 2007, pages 159–166.
AUAI Press.

Pengcheng He, Xiaodong Liu, Jianfeng Gao,
and Weizhu Chen. 2020. Deberta: Decoding-
enhanced BERT with disentangled attention.
arXiv:2006.03654.

Danny Hernandez, Jared Kaplan, Tom Henighan,
and Sam McCandlish. 2021. Scaling laws for
transfer.

836

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Geoffrey Hinton, Oriol Vinyals, 和杰夫·迪恩.
2015. Distilling the knowledge in a neural
网络. arXiv:1503.02531.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,
Xiao Chen, Linlin Li, Fang Wang, and Qun
刘. 2019. TinyBERT: Distilling BERT for
language understanding. arXiv:1909
natural
.10351. https://doi.org/10.18653/v1
/2020.findings-emnlp.372

Yoon Kim and Alexander M. 匆忙. 2016.
Sequence-level knowledge distillation. In Pro-
ceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing,
pages 1317–1327. https://doi.org/10
.18653/v1/D16-1139

Sosuke Kobayashi. 2018. Contextual augmenta-
的: Data augmentation by words with paradig-
matic relations. 在诉讼程序中 2018
Conference of the North American Chapter
of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 2
(Short Papers), pages 452–457, New Orleans,
Louisiana. Association for Computational
语言学. https://doi.org/10.18653
/v1/N18-2072

Ananya Kumar, Tengyu Ma, and Percy Liang.
2020A. Understanding self-training for gradual
domain adaptation. In Proceedings of the 37th
International Conference on Machine Learn-
英, 体积 119 of Proceedings of Machine
Learning Research, pages 5468–5479. PMLR.

Varun Kumar, Ashutosh Choudhary, and Eunah
给. 2020乙. Data augmentation using pre-
trained transformer models. arXiv:2003.02245.

Jared Lichtarge, Chris Alberti, Shankar Kumar,
Noam Shazeer, Niki Parmar, and Simon Tong.
2019. Corpora generation for grammatical error
correction. arXiv:1904.05780. https://土井
.org/10.18653/v1/N19-1333

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. Roberta: A robustly optimized
bert pretraining approach. arXiv:1907.11692.

Sameen Maruf and Gholamreza Haffari. 2018.
Document context neural machine translation
with memory networks. 在诉讼程序中
56th Annual Meeting of the Association for
计算语言学 (体积 1: 长的

文件), pages 1275–1284. https://doi.org
/10.18653/v1/P18-1118

Alana Marzoev, Samuel Madden, 中号. Frans
Kaashoek, 迈克尔·J. Cafarella, and Jacob
Andreas. 2020. Unnatural language processing:
Bridging the gap between synthetic and natural
language data. ArXiv, abs/2004.13645.

Lesly Miculicich, Dhananjay Ram, Nikolaos
Pappas, and James Henderson. 2018. 文档-
level neural machine translation with hierarchi-
cal attention networks. 在诉讼程序中
2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 2947–2954,
布鲁塞尔, 比利时. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D18-1325

Hossein Mobahi, Mehrdad Farajtabar, and Peter
L. Bartlett. 2020. Self-distillation amplifies
regularization in hilbert space. In Advances in
Neural Information Processing Systems 33:
Annual Conference on Neural Information Pro-
cessing Systems 2020, 神经信息处理系统 2020, Decem-
误码率 6-12, 2020, 虚拟的.

Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre,
Bing Xiang. 2016. Abstractive text summari-
zation using sequence-to-sequence RNNs and
超过. In Proceedings of The 20th SIGNLL
Conference on Computational Natural Lan-
guage Learning, pages 280–290. https://
doi.org/10.18653/v1/K16-1028

Nathan Ng, Kyra Yee, Alexei Baevski, Myle
Ott, Michael Auli, and Sergey Edunov. 2019.
Facebook fair’s wmt19 news translation task
submission. In Proceedings of the Fourth Con-
ference on Machine Translation (体积 2:
Shared Task Papers, Day 1), pages 314–319.

Sajad Norouzi, David J. Fleet, and Mohammad
Norouzi. 2020. Exemplar vaes for exem-
plar based generation and data augmentation.
arXiv:2004.04795.

Eyal Orbach

and Yoav Goldberg.

2020.
Facts2Story: Controlling text generation by
key facts. In Proceedings of the 28th Interna-
tional Conference on Computational Linguistics,
pages 2329–2345, 巴塞罗那, 西班牙 (在线的).
International Committee on Computational
语言学. https://doi.org/10.18653
/v1/2020.coling-main.211

Myle Ott, Sergey Edunov, Alexei Baevski,
Angela Fan, Sam Gross, Nathan Ng, 大卫

837

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Grangier, and Michael Auli. 2019. fairseq: A
快速地, extensible toolkit for sequence modeling.
诉讼程序 2019 Conference of the
North American Chapter of the Association for
计算语言学 (Demonstrations),
pages 48–53. https://doi.org/10.18653
/v1/N19-4009

Samet Oymak and Talha Cihad Gulcu. 2020.
Statistical and algorithmic insights for semi-
supervised learning with self-training. CoRR,
abs/2006.11006.

Alec Radford, Jeffrey Wu, Rewon Child, 大卫
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. https://d4mucfpksywv-models
-models/language models are unsupervised
multitask learners.pdf

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, 迈克尔
Matena, Yanqi Zhou, Wei Li, and Peter J. 刘.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. 杂志
of Machine Learning Research, 21:1–67.

Aditi Raghunathan, Sang Michael Xie, Fanny
哪个, John Duchi, and Percy Liang. 2020.
Understanding and mitigating the tradeoff be-
tween robustness and accuracy. In Proceedings
of the 37th International Conference on Ma-
chine Learning, 体积 119 of Proceedings of
Machine Learning Research, pages 7909–7919.
PMLR.

Ahmad Rashid, Vasileios Lioutas, and Mehdi
Rezagholizadeh. 2021. Mate-kd: Masked ad-
versarial text, a companion to knowledge dis-
tillation. arXiv 预印本 arXiv:2105.05912.
https://doi.org/10.18653/v1/2021
.acl-long.86

Suman Ravuri and Oriol Vinyals. 2019. Classi-
fication accuracy score for conditional genera-
tive models. Advances in Neural Information
Processing Systems, pages 12268–12279.

Alexander M. 匆忙, Sumit Chopra, and Jason
Weston. 2015. A neural attention model for
abstractive sentence summarization. In Pro-
ceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing,
pages 379–389.

Victor Sanh, Lysandre Debut, Julien Chaumond,
and Thomas Wolf. 2019. DistilBERT, a dis-
tilled version of BERT: Smaller, faster, cheaper
and lighter. ArXiv, abs/1910.01108.

H. Scudder. 1965. Probability of error of some
adaptive pattern-recognition machines. IEEE
Transactions on Information Theory. https://
doi.org/10.1109/TIT.1965.1053799

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Improving neural machine trans-
lation models with monolingual data. In Pro-
ceedings of the 54th Annual Meeting of the
计算语言学协会
(体积 1: Long Papers), pages 86–96, 柏林,
德国. 协会
for Computational
语言学. https://doi.org/10.18653/v1
/P16-1009

Dinghan Shen, Mingzhi Zheng, Yelong Shen,
Yanru Qu, and Weizhu Chen. 2020. A sim-
ple but tough-to-beat data augmentation ap-
proach for natural language understanding and
一代. arXiv 预印本 arXiv:2009.13818.

Sam Shleifer. 2019. Low resource text classifi-
cation with ulmfit and backtranslation. arXiv
preprint arXiv:1903.09244.

Kihyuk Sohn, David Berthelot, Chun-Liang Li,
Zizhao Zhang, Nicholas Carlini, Ekin D. Cubuk,
Alex Kurakin, Han Zhang, and Colin Raffel.
2020. Fixmatch: Simplifying semi-supervised
learning with consistency and confidence.
arXiv:2001.07685.

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing
刘. 2019A. Patient knowledge distillation for
BERT model compression. 会议记录
这 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 4314–4323. https://doi.org/10
.18653/v1/D19-1441

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng,
Xuyi Chen, Han Zhang, Xin Tian, Danxiang
朱, Hao Tian, and Hua Wu. 2019乙. Ernie:
Enhanced representation through knowledge
一体化. arXiv 预印本 arXiv:1904.09223.

Nicola Ueffing, Gholamreza Haffari, and Anoop
Sarkar. 2007. Transductive learning for sta-
tistical machine translation. 在诉讼程序中
the 45th Annual Meeting of the Association

838

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

of Computational Linguistics, pages 25–32,
Prague, 捷克共和国. Association for Com-
putational Linguistics.

Vladimir Vapnik. 1992. Principles of risk mini-
mization for learning theory. Advances in Neu-
ral Information Processing Systems.

Elena Voita, Pavel Serdyukov, Rico Sennrich,
and Ivan Titov. 2018. Context-aware neural
machine translation learns anaphora resolution.
In Proceedings of the 56th Annual Meeting
of the Association for Computational Linguis-
抽动症 (体积 1: Long Papers), pages 1264–1274.
https://doi.org/10.18653/v1/P18
-1117

Thuy Vu, Xuanli He, Dinh Phung,

和
Gholamreza Haffari. 2021A. Generalised unsu-
pervised domain adaptation of neural machine
translation with cross-lingual data selection. 在
诉讼程序 2021 Conference on Empir-
ical Methods in Natural Language Processing,
pages 3335–3346.

学习. 在诉讼程序中

Tu Vu, Minh-Thang Luong, Quoc Le, Grady
西蒙, and Mohit Iyyer. 2021乙. Strata: 自己-
training with task augmentation for better few-
这 2021
shot
Conference on Empirical Methods in Natu-
ral Language Processing, pages 5715–5731.
https://doi.org/10.18653/v1/2021
.emnlp-main.462

Alex Wang, Jan Hula, Patrick Xia, Raghavendra
Pappagari, 右. Thomas McCoy, Roma Patel,
Najoung Kim, Ian Tenney, Yinghui Huang,
Katherin Yu, Shuning Jin, Berlin Chen,
Benjamin Van Durme, Edouard Grave, Ellie
Pavlick, and Samuel R. Bowman. 2019A. Can
you tell me how to get past sesame street?
Sentence-level pretraining beyond language
造型. In Proceedings of the 57th Annual
Meeting of the Association for Computational
语言学, pages 4465–4476. https://土井
.org/10.18653/v1/P19-1439

Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill,
Omer Levy, and Samuel R. Bowman. 2019乙.
SuperGLUE: A stickier benchmark for general-
purpose language understanding systems. arXiv:
1905.00537.

Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R.

Bowman. 2019C. GLUE: A multi-task bench-
mark and analysis platform for natural lan-
guage understanding. International Conference
on Learning Representations. https://土井
.org/10.18653/v1/W18-5446

Bailin Wang, Wenpeng Yin, Xi Victoria Lin, 和
Caiming Xiong. 2021. Learning to synthesize
data for semantic parsing. 在诉讼程序中
the Meeting of the North-American Chapter
of Association for Computational Linguistics
(全国AACL). https://doi.org/10.18653
/v1/2021.naacl-main.220

Ben Wang and Aran Komatsuzaki. 2021.
GPT-J-6B: A 6 billion parameter autoregres-
sive language model. https://github.com
/kingoflolz/mesh-transformer-jax.

William Yang Wang and Diyi Yang. 2015.
That’s so annoying!!!: A lexical and frame-
semantic embedding based data augmenta-
tion approach to automatic categorization of
annoying behaviors using# petpeeve tweets.
诉讼程序 2015 Conference on Empir-
ical Methods in Natural Language Processing,
pages 2557–2563. https://doi.org/10
.18653/v1/D15-1306

Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong
齐. 2018. Kdgan: Knowledge distillation with
generative adversarial networks. 神经信息处理系统.

Yushi Wang, Jonathan Berant, and Percy Liang.
2015. Building a semantic parser overnight.
在诉讼程序中
the 53rd Annual Meet-
ing of
the Association for Computational
Linguistics and the 7th International Joint
Conference on Natural Language Processing
(体积 1: Long Papers), pages 1332–1342,
北京, 中国. Association for Computational
语言学. https://doi.org/10.3115
/v1/P15-1129

Colin Wei, Kendrick Shen, Yining Chen, 和
Tengyu Ma. 2021. Theoretical analysis of
self-training with deep networks on unlabeled
数据. In International Conference on Learning
Representations.

Jason Wei and Kai Zou. 2019. Eda: Easy data
augmentation techniques for boosting perfor-
mance on text classification tasks. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),

839

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

pages 6382–6388. https://doi.org/10
.18653/v1/D19-1670

Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, Remi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drame, Quentin
Lhoest, and Alexander Rush. 2020. Transform-
呃: State-of-the-art natural language process-
英. 诉讼程序 2020 会议
Empirical Methods in Natural Language Pro-
cessing: 系统演示, pages 38–45.
https://doi.org/10.18653/v1/2020
.emnlp-demos.6

Xing Wu, Shangwen Lv, Liangjun Zang,
Jizhong Han, and Songlin Hu. 2019. Condi-
tional BERT contextual augmentation. 内特纳-
tional Conference on Computational Science,
pages 84–95, 施普林格. https://doi.org
/10.1007/978-3-030-22747-0_7

Qizhe Xie, Minh-Thang Luong, Eduard Hovy,
and Quoc V. Le. 2020. Self-training with
noisy student
improves imagenet classifica-
的. 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR),
pages 10684–10695. https://doi.org
/10.1109/CVPR42600.2020.01070

Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu
Wei, and Ming Zhou. 2020. Bert-of-theseus:
Compressing BERT by progressive module re-
placing. 诉讼程序 2020 会议
on Empirical Methods in Natural Language
加工 (EMNLP), pages 7859–7869.

Yiben Yang, Chaitanya Malaviya,

Jared
Fernandez, Swabha Swayamdipta, Ronan Le
Bras, Ji-Ping Wang, Chandra Bhagavatula,
Yejin Choi, and Doug Downey. 2020. G-daug:
Generative data augmentation for common-
sense reasoning. arXiv:2004.11546. https://
doi.org/10.18653/v1/2020.findings
-emnlp.90

David Yarowsky. 1995. Unsupervised word sense
disambiguation rivaling supervised methods.
33rd Annual Meeting of the Association for
计算语言学, pages 189–196.
https://doi.org/10.3115/981658.981684

Norouzi, and Quoc V Le. 2018. QANet: 康姆-
bining local convolution with global self-
attention for reading comprehension. ICLR.

Hongyi Zhang, Moustapha Cisse, Yann N.
Dauphin, and David Lopez-Paz. 2018. mixup:
Beyond empirical risk minimization. ICLR.

Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei
陈, Chenglong Bao, and Kaisheng Ma. 2019.
Be your own teacher: Improve the perfor-
mance of convolutional neural networks via
self distillation. Proceedings of the IEEE/CVF
International Conference on Computer Vision,
pages 3713–3722. https://doi.org/10
.1109/ICCV.2019.00381

A Appendices

A.1 Datasets

The statistics of GLUE are reported in Table 8.

A.2 GPT-2 for Classification

is the GPT-2 model

We have conducted additional experiments, 在哪里
we fine-tune GPT-2 as a classifier. 我们有
considered two variants of the GPT-2 model.
The first varant
is the original GPT-2 model
(GPT2-original) pre-trained on open-domain text.
那
The second variant
was fine-tuned on the inputs of each task sepa-
率地 (GPT-2-finetuned). This model was used to
generate task-specific (synthetic) unlabeled data.
最后, we also consider self-training with GAL
on top of GPT2-original. 具体来说, 我们用
the GPT-2-finetuned model
to synthesize 40x
in-domain unlabeled data. Then we apply self-
training to GPT-2-original, where the data is
a combination of the original labeled data and
pseudo-labeled synthetic data. 桌子 9 suggests
that the gains of GAL come from the pseudo-
labeled synthetic data, IE。, both synthetic unla-
beled data and teacher’s knowledge. 没有
the generation of synthetic unlabeled data, 这
domain-specific knowledge embedded in GPT-2-
finetuned model cannot be utilized. 像这样,
GPT-2-finetuned model is inferior to the GPT2-
original model. Since RoBERTa-large is superior
to GPT-2 models, RoBERTa-large+GAL also
significantly outperform the GPT-2 counterpart.

A.3 Importance of Pseudo-labels

Adams Wei Yu, David Dohan, Minh-Thang
Luong, Rui Zhao, Kai Chen, Mohammad

We have argued and demonstrated that using
class-conditional generative models to generate

840

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数据集

任务

domain

#train #dev

#测试

#类

SST-2
QQP
QNLI
RTE
MNLI
MRPC
CoLA
STS-B

movie reviews
social QA questions

sentiment analysis
paraphrase
QA/natural language inference Wikipedia
natural language inference
natural language inference
paraphrase
acceptability
sentence similarity

消息, 维基百科
misc.
消息
misc.
misc.

67k
364k
105k
2.5k
393k
3.7k
8.5k
5.8k

872
40k
5k
277
20k
408
1043
15k

1.8k
391k
5.4k
3k
20k
1.7k
1k
1.4k

2
2
2
2
3
2
2
-

桌子 8: Summary of the three sets of tasks used for evaluation of GAL. STS-B is a regression task, 所以
#classes is not applicable.

模型

MNLI

CoLA SST-2 MRPC

STS-B

QQP

QNLI RTE Avg

GPT-2-original
GPT-2-finetuned
GPT-2-original+GAL

85.9/85.6
85.8/85.5
86.2/85.8

RoBERTa-large
RoBERTa-large + GAL

90.1/89.7
90.2/89.8

54.8
40.9
55.7

63.8
66.2

94.5
94.5
94.7

96.1
96.4

86.9/82.2
87.0/81.0
87.9/83.4

86.3/85.2
85.6/84.3
86.9/85.9

72.5/89.3
71.4/88.5
72.6/89.4

91.2/88.3
92.0/89.2

90.9/90.7
90.7/90.5

72.5/89.6
73.6/89.9

91.2
91.5
91.9

94.5
95.0

69.8
69.0
70.6

85.9
86.3

80.9
78.8
81.5

86.5
87.1

桌子 9: GLUE test results of using GPT-2 and RoBERTa-large as classification models.

Label type

Accuracy F1 Precision Recall

GPT2
RoBERTa
conditioning label

86.0
90.0
72.0

87.0
91.4
71.4

88.7
100.0
66.0

85.5
84.1
77.8

桌子 10: Performance of GPT2 annotation,
RoBERTa annotation and conditioning labels on
100 random examples from the synthetic RTE
dataset generated by a class-conditional LM.

labeled synthetic examples is less effective than
GAL in Section 3 and Section 5. To further
verify this argument, we sample 100 instances
from the synthetic RTE dataset generated by the
label-prompted GPT2, as the class-conditional
LM. Then we annotate these examples using a
human annotator, GPT2 classifier, and RoBERTa
classifier. 最后, we compute the Accuracy, F1,

Precision, and Recall scores between human la-
bels and GPT2 labels, between human labels and
RoBERTa labels, and between human labels
and conditioned labels used by GPT2 when the
data was generated. 桌子 10 shows that class-
conditional LM has difficulty generating sen-
tences retaining the semantics or pragmatics of
a specified category, which also corroborates our
theoretical analysis in Section 3. On the other
手, discriminative models, such as GPT2 clas-
sifier and RoBERTa classifier, are able to produce
higher quality labels that correlate better with
human annotations.

A.4 Generated Unlabeled Examples
Annotated with Pseudo Labels

We provide some synthetic sentences generated
by GAL in Tables 11 和 12.

841

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

are more deeply thought through than in most ‘ right-thinking ’ films (积极的)

KNN:
1: is far more sophisticated, insightful and thought-provoking than his previous films .
(积极的)
2: is more sophisticated than its more obvious and less-than-dazzling counterparts (积极的)
3: is about as well-thought as the idea of a bad hair day, (negative)

contains no wit, only labored gags (negative)

KNN:
1: lacks insight, and lacks empathy (negative)
2: has little humor or intelligence (negative)
3: lacks all wit and humanity (negative)

桌子 11: SST-2: Two labeled examples, along with 3 nearest neighbors (based on RoBERTa
陈述) from our synthetic dataset. We include labels for original examples and
pseudo-labels for synthetic examples in parenthesis.

How is the life of a math student? Could you describe your own experiences? [SEP] 哪个
level of prepration is enough for the exam jlpt5? (not duplicated)

KNN:
1: What are the best courses for a mechanical engineering student? [SEP] What is the best
course to do after completing a B.Tech in mechanical engineering? (not duplicated)
2: How much marks are needed to get through the GATE with electronics? [SEP] What is the
average score of the Gate EE exam? What are the cut-offs? (not duplicated)
3: What is the best time table for students to prepare for IAS? [SEP] How can one study for
IAS in a best time? (not duplicated)

How does an IQ test work and what is determined from an IQ test? [SEP] How does IQ test
作品? (duplicated)

KNN:
1: What is the average IQ of the U.S. 人口? [SEP] How does an IQ test work? (不是
duplicated)
2: Is the Iq test an effective way to measure intelligence? [SEP] How do IQ tests work?
(duplicated)
3: How is an IQ test on a scale from 1 到 100 scored? [SEP] How do you get your IQ tested?
(not duplicated)

桌子 12: QQP: Two labeled examples, along with 3 nearest neighbors (based on RoBERTa
陈述) from our synthetic dataset. We include labels for original examples and
pseudo-labels for synthetic examples in parenthesis.

842

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
4
9
2
2
0
3
8
5
1
1

/
t

我

A
C
_
A
_
0
0
4
9
2
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
下载pdf