True Few-Shot Learning with Prompts—A Real-World Perspective

True Few-Shot Learning with Prompts—A Real-World Perspective

Timo Schick and Hinrich Sch ¨utze
Center for Information and Language Processing (CIS), LMU Munich, Germany

schickt@cis.lmu.de, inquiries@cislmu.org

Abstract

Prompt-based approaches excel at few-shot
learning. However, Perez et al. (2021) re-
cently cast doubt on their performance as
they had difficulty getting good results in a
‘‘true’’ few-shot setting in which prompts and
hyperparameters cannot be tuned on a dev
set. In view of this, we conduct an extensive
study of PET, a method that combines textual
instructions with example-based finetuning.
We show that, if correctly configured, PET
performs strongly in true few-shot settings
without a dev set. Crucial for this strong perfor-
mance is a number of design choices, including
PET’s ability to intelligently handle multiple
prompts. We put our findings to a real-world
test by running PET on RAFT, a benchmark of
tasks taken from realistic NLP applications for
which no labeled dev or test sets are available.
PET achieves a new state of the art on RAFT
and performs close to non-expert humans for
7 out of 11 tasks. These results demonstrate
that prompt-based learners can successfully be
applied in true few-shot settings and underpin
our belief that learning from instructions will
play an important role on the path towards
human-like few-shot learning capabilities.

1

Introduction

With pretrained language models (LMs) getting
ever larger (Radford et al., 2019; Raffel et al.,
2020; Brown et al., 2020; Fedus et al., 2021),
instruction-based learning is a powerful method
for few-shot text classification (e.g., Schick and
Sch¨utze, 2020; Jiang et al., 2020; Schick and
Sch¨utze, 2021; Brown et al., 2020; Wei et al.,
2022; Sanh et al., 2022). The key idea is to give
an LM access to descriptive names for all possi-
ble outputs and to short prompts explaining the
task to be solved. In settings where at most a few
dozen examples are available, this simple idea
leads to substantial improvements over other ap-
proaches (Schick and Sch¨utze, 2020, 2021; Gao
et al., 2021; Tam et al., 2021).

716

However, recent work has questioned the strong
few-shot performance of instruction-based ap-
proaches, arguing that
they are evaluated in
scenarios that are not true few-shot settings (Perez
et al., 2021; Logan IV et al., 2021), mainly for two
reasons. First, some approaches (e.g., Xie et al.,
2019; Zhang et al., 2020; Chen et al., 2020; Tam
et al., 2021) make use of large development sets
to optimize hyperparameters. Second, it is argued
that manually designed instructions require man-
ual tuning on development sets to achieve strong
performance (Perez et al., 2021; Logan IV et al.,
2021). Indeed, performance can vary greatly—and
in mostly unpredictable ways—across different in-
structions (Jiang et al., 2020; Schick and Sch¨utze,
2020); this issue even persists after finetuning on
hundreds of instructions (Sanh et al., 2022). More
generally, the need for human involvement is seen
as a serious drawback of manually designed in-
structions (Shin et al., 2020; Lester et al., 2021).
Thus, several recent studies have abandoned them
in favor of automatically generated prompts (Shin
et al., 2020; Gao et al., 2021; Hambardzumyan
et al., 2021; Li and Liang, 2021; Lester et al.,
2021).

Contrary to this trend, we argue that when
correctly configured, prompt-based approaches
achieve strong performance even in true few-shot
settings and that there is no problem with using
manually designed instructions. Quite the oppo-
site: Such instructions are often easy to specify
if one is familiar with the task to be solved, they
provide an intuitive interface to convey task-
specific knowledge, and, if properly used, they can
considerably improve model performance in few-
shot settings.

To provide empirical support for these claims,
we revisit PET (Schick and Sch¨utze, 2020), a
method for combining instructions with example-
based finetuning, and thoroughly examine its per-
formance with human-made instructions in true
few-shot settings. We simulate a real-world sce-
nario by proceeding in two steps: First, we conduct

Transactions of the Association for Computational Linguistics, vol. 10, pp. 716–731, 2022. https://doi.org/10.1162/tacl a 00485
Action Editor: Alexander Rush. Submission batch: 12/2021; Revision batch: 3/2022; Published 6/2022.
c(cid:2) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
5
2
0
3
0
6
9
2

/

/
t

l

a
c
_
a
_
0
0
4
8
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

2 Related Work

As a precursor to instruction-based learning, some
studies have investigated ways of informing clas-
sifiers about the meaning of different output classes
both for text (Chang et al., 2008; Veeranna et al.,
2016; Zhou et al., 2018) and image classification
(Norouzi et al., 2014; Romera-Paredes and Torr,
2015); providing instructions in the form of short
prompts was first proposed by Radford et al.
(2019). This idea has since been applied to solve a
wide range of NLP tasks without any task-specific
training data (Puri and Catanzaro, 2019; Opitz,
2019; Davison et al., 2019; Schick et al., 2021;
Schick and Sch¨utze, 2021; Wei et al., 2022; Sanh
et al., 2022). While most approaches rephrase
tasks as a language modeling problem, some use
prompts to reformulate them as different tasks for
which large amounts of training data are avail-
able (Levy et al., 2017; McCann et al., 2018;
Yin et al., 2019; Sun et al., 2021; Sainz et al.,
2021). Instruction-based learning has also been
used in few-shot settings; popular variants include
in-context learning, where the model’s parameters
are fixed and examples are provided as additional
context (Brown et al., 2020; Lu et al., 2021;
Kumar and Talukdar, 2021; Zhao et al., 2021; Min
et al., 2021), finetuning the entire model (Schick
and Sch¨utze, 2020, 2021; Gao et al., 2021; Tam
et al., 2021), and prompt tuning, where only the
instruction itself is optimized (Shin et al., 2020;
Hambardzumyan et al., 2021; Li and Liang, 2021;
Lester et al., 2021).

Several works investigating the limitations of
instruction-based few-shot approaches find that
current LMs are mostly unable to understand com-
plex instructions that go beyond short prompts or
simple questions (Efrat and Levy, 2020; Weller
et al., 2020; Webson and Pavlick, 2021) and that
they are highly sensitive to the exact wording
of the instructions provided (Jiang et al., 2020;
Schick and Sch¨utze, 2020; Chu et al., 2021; Elazar
et al., 2021). In a similar vein, Perez et al. (2021)
and Logan IV et al. (2021) argue that prior work
overestimates few-shot performance as manual
prompt tuning is required to achieve good per-
formance. Accordingly, some studies attempt to
obtain either prompts (Shin et al., 2020; Gao
et al., 2021; Li and Liang, 2021; Lester et al.,
2021) or meaningful names for output classes
(Schick et al., 2020; Gao et al., 2021) without
human involvement.

Figure 1: PET achieves near-human performance for 7
out of 11 tasks of the RAFT benchmark (Alex et al.,
2021), for which labeled dev and test sets are not avail-
able. This demonstrates the potential of prompt-based
learners for few-shot learning in ‘‘true’’ real-world
settings, i.e., without any tuning of instructions or
hyperparameters on a dev set.

an extensive study of PET using three academic
datasets to analyze its ability to perform true
few-shot learning in a controlled environment and
derive best practices for the choice of instructions
and hyperparameters. We then put our findings
to the test and evaluate PET on a large variety of
real-world tasks from the RAFT benchmark (Alex
et al., 2021), for which no labeled dev or test sets
are available, enforcing a true few-shot setting
(Perez et al., 2021). On average, PET clearly out-
performs all baselines on this dataset and comes
surprisingly close to non-expert human perfor-
mance (see Figure 1). This demonstrates that
instruction-based learning can successfully be ap-
plied to real-world tasks in true few-shot settings.

Our main contributions are as follows:

• We investigate the performance of PET for
various models, tasks, and training set sizes,
its ability to cope with different instructions,
and its robustness to hyperparameter choices
in true few-shot settings.

• We show how PET can be used when no
unlabeled data is available and propose a
method for efficient classification in scenar-
ios with many different classes, addressing
two frequent real-world scenarios.

• We apply PET to RAFT (Alex et al., 2021), a
benchmark of real-world tasks. PET obtains a
new state of the art and achieves near-human
performance for 7 out of 11 tasks in a true
few-shot setting.

717

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
5
2
0
3
0
6
9
2

/

/
t

l

a
c
_
a
_
0
0
4
8
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 2: Different choices of patterns and corresponding verbalizers for classifying movie reviews as positive
(+) or negative (−). The input is first converted into a cloze question using the pattern; classification is done by
computing the output whose verbalization is, according to the MLM, the most likely substitute for the mask.

Finally, many benchmarks have been proposed
for comparing few-shot approaches in a standard-
ized way (e.g., Mishra et al., 2021; Bragg et al.,
2021; Xu et al., 2021; Ye et al., 2021; Alex et al.,
2021). As our focus is on the real-world appli-
cability of few-shot methods, we evaluate PET on
the RAFT benchmark (Alex et al., 2021), which
measures performance in applied settings.

3 Pattern-Exploiting Training

We briefly review pattern-exploiting training
(PET) (Schick and Sch¨utze, 2020, 2021),
the
method we use for instruction-based text classi-
fication. At its core, PET combines textual in-
structions with regular finetuning using labeled
examples. To that end, users must specify one
or more patterns that convert an input example
x into a cloze question so that it can readily be
processed by a masked language model (MLM)
(Devlin et al., 2019).1 These patterns can take on
very different forms; some examples are shown in
Figure 2. In addition, users must inform the model
about the meaning of all output classes; this is
done with a verbalizer that assigns a natural lan-
guage expression to each output y (see Figure 2,
right). We refer to the combination of a pattern
and verbalizer as a pattern-verbalizer pair (PVP).
Given a single PVP, let p(y | x) be the prob-
ability that an MLM assigns to y’s verbalization
in the cloze question obtained by applying the
pattern to x, normalized over all y. The MLM is
finetuned on labeled examples (x, y) by minimiz-
ing the cross-entropy loss between p(y | x) and a
distribution that assigns a probability of 1.0 to y.
If a user specifies multiple PVPs, individual
models are trained for each pair. Similar to knowl-
edge distillation (Hinton et al., 2015), they are

1We use the term prompt to refer to a short sequence
of tokens that typically contains some form of instruction;
pattern is used to denote the function that adds a prompt to
an input.

then used to annotate unlabeled examples for train-
ing a final classifier with a regular sequence clas-
sification head (Devlin et al., 2019). We use the
weighted variant of PET without auxiliary lan-
guage modeling; see Schick and Sch¨utze (2020)
for details.

4 True Few-Shot Learning with PET

After describing our experimental setup, we con-
duct experiments on academic datasets to answer
6 important research questions (Q1–Q6) on the
extent to which true few-shot learning is possible
with PET. The purpose of our experiments is also
to establish best practices for real-world scenarios
and experiments on RAFT (Alex et al., 2021).

Tasks and Datasets While they were heavily
used in prior work (e.g., Brown et al., 2020; Schick
and Sch¨utze, 2021; Logan IV et al., 2021;
Webson and Pavlick, 2021), we decide against
tasks and datasets from GLUE (Wang et al., 2018)
and SuperGLUE (Wang et al., 2019) as they are
different from what we expect to see in real-world
applications. Instead, we experiment with AG’s
News, Yelp Reviews Full Star, and Yahoo Ques-
tions (Zhang et al., 2015) as these datasets
represent classification tasks in three different do-
mains that resemble real-world settings. We create
a broad variety of instructions for each task to be
able to experiment with a large number of differ-
ent patterns.

We consider settings with n = 10 and n = 100
training examples. For each n, we generate five
different training sets per task by randomly sam-
pling examples from the original training set while
ensuring that the number of examples is about the
same for each possible output class. In addition, for
both n = 10 and n = 100, we sample 1,000 unla-
beled examples from the original training set. We
repeat all of our experiments for all five training
sets and, by default, report average performance.

718

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
5
2
0
3
0
6
9
2

/

/
t

l

a
c
_
a
_
0
0
4
8
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

PVPs We manually write a total of 23 patterns
per task, all of which can be categorized into one
of the following groups:2

• NULL: Following Logan IV et al. (2021), these

patterns simply insert a mask token.

• PUNC: Similar to some patterns of Schick
and Sch¨utze (2020), these patterns only add
punctuation characters and a mask token.

• PROMPTS: Patterns in this group add short
prompts—typically consisting of no more
than three words—to the input, similar to
Radford et al. (2019) and Schick and Sch¨utze
(2020).

• Q&A: These patterns rephrase the task as a

question q and append

Question: q Answer: [MASK].

to the input, similar to Brown et al. (2020)
and Schick et al. (2021).

For all patterns, we use a single verbalizer, adopted
from Schick and Sch¨utze (2020). There is often
a single natural choice for the verbalizer (e.g.,
the category names for AG’s News / Yahoo
Questions), so finding many good verbalizers is
challenging.

Hyperparameters We consider a setting sim-
ilar to that of Schick and Sch¨utze (2020) and
Schick and Sch¨utze (2021) and, unless other-
wise specified, use the default settings of the PET
library.3 As our experiments require training hun-
dreds of models, we make a few changes to reduce
environmental impact (Strubell et al., 2019) and
computational cost: We use the base variant of
RoBERTa (Liu et al., 2019) as underlying LM
both for individual models and the final classifier,
we train only one model per PVP, and we reduce
the training steps for all individual models and the
final classifier to 100 and 1,000, respectively.

Monitoring Finetuning LMs on small datasets
is unstable (Devlin et al., 2019; Dodge et al., 2020)
and sometimes results in poor performance. We

2The full set of PVPs can be found at https://github

.com/timoschick/pet/tree/master/true-fsl.

3See https://github.com/timoschick/pet.

aim to detect failed finetuning without a labeled
test set using the following two checks:

• TRAIN SET UNDERFITTING: We check for train-
ing runs that result in less than 50% accuracy
on the training set. As finetuning on up to
100 examples typically leads to perfect pre-
dictions on the training set, this is a clear
indicator of failed finetuning.

• CONSTANT PREDICTIONS: We check for training
runs that result in the same class being pre-
dicted for all inputs, both on the training set
and the unlabeled set. Again, this is a clear
indicator of failed finetuning.

Whenever one of these two events occurs, we
restart training using a different seed.

Q1: How can we find the best pattern—or do
we even need to?

Slightly different patterns can have very dif-
ferent performance (Jiang et al., 2020; Schick
and Sch¨utze, 2020; Schick et
al., 2021;
Webson and Pavlick, 2021; Sanh et al., 2022,
inter alia) and popular model selection criteria
cannot reliably identify the best-performing pat-
terns in few-shot settings (Perez et al., 2021). We
thus investigate to what extent PET can eliminate
the need to find the best
instruction even in
extreme settings where there are dozens of
candidates to choose from.

Setup Using our default setup, we train individ-
ual models for each PVP and a final PET model;
we also train models with iPET, an iterative variant
of PET introduced by Schick and Sch¨utze (2020),
using 3 iterations.

Results Performance of individual models for
each pattern and of the distilled models obtained
using PET and iPET is shown in Figure 3. Interest-
ingly, sorting all pattern groups by their average
performance gives the exact same order for each
task and training set size: NULL patterns clearly
perform worst, followed by PUNC and PROMPT;
Q&A gives the best average results. Contrary to
findings of Logan IV et al. (2021), this shows
that LMs can benefit considerably from manu-
ally written instructions even if combined with
finetuning.

Crucially, PET’s performance is much higher
than average performance of individual patterns;
further, it consistently outperforms even the best

719

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
5
2
0
3
0
6
9
2

/

/
t

l

a
c
_
a
_
0
0
4
8
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
8
5
2
0
3
0
6
9
2

/

/
t

l

a
c
_
a
_
0
0
4
8
5
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 3: Performance of individual patterns, PET and iPET on all tasks considered. Accuracy is shown on the
y-axis; the x-axis shows individual pattern ids where color is used to distinguish the different pattern categories
(NULL, PUNC, PROMPT, Q&A). Small bullets (•) correspond to individual training sets, large bullets (•) correspond
to average performance. Average performance across all patterns is shown as a dashed gray line. Q&A and PROMPT
perform better than NULL and PUNC; PET consistently outperforms even the best individual pattern.

pattern, verifying that PET indeed removes the
need to find the best pattern. While iPET gives
clear improvements for n = 10, it performs worse
than PET for n = 100. The reason for this may
be that we use a much smaller set of unlabeled
examples than prior work (Schick and Sch¨utze,
2020, 2021).

Q2: Does performance of different patterns
transfer across models?

While our results for Q1 show a consistent order
of pattern groups for different training set sizes
and tasks, an important question for real-world
applications is whether the same finding also
holds for different model sizes and entirely dif-
ferent models.

Setup We consider BERT (Devlin et al., 2019),
RoBERTa (Liu et al., 2019), and ALBERT (Lan
et al., 2020) as underlying LMs;4 we experiment
with the base and large variants. For each model
and size, we repeat the same experiment as for Q1.

Results Figure 4 shows the performance of each
pattern group (i.e., average performance of all in-
dividual patterns in this group) and PET; scores are
normalized so that the best-performing approach
for each task, training set size, and model gets a
score of 1.0. With few exceptions, our findings
from Q1 regarding the relative performance of
pattern groups and PET (NULL < PUNC < PROMPT < Q&A < PET) also hold for different models and sizes. The performance of individual patterns strongly correlates between different models and sizes (Spearman’s ρ ≥ 0.7 except in one case). Q3: Does PET still work if some PVPs are not well understood? Q1 and Q2 show that PET performs even better than the best PVP for a large set of high-quality PVPs. But perhaps the performance is much worse if the LM fails to understand many patterns and verbalizers, for example, because they are in a style different from the model’s pretraining data? For real-world scenarios, we want to know how such ‘‘bad’’ PVPs affect the performance of PET. 4For Yahoo, we do not consider BERT as it uses a vocabulary that does not assign a single token to each verbalization. Setup It is difficult to obtain large quantities of bad instructions as they might occur in real-world 720 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 5 2 0 3 0 6 9 2 / / t l a c _ a _ 0 0 4 8 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 4: Relative performance of individual pattern groups and PET for different models and sizes. Scores are normalized so that the best performance for each task, number of examples, model, and size is 1.0. Our findings from Q1 (NULL < PUNC < PROMPT < Q&A < PET) also hold for other models and sizes. scenarios. As a proxy, we resort to noise patterns that add random tokens to the input, serving as a lower bound for truly bad patterns. In concrete terms, we add up to three randomly sampled tokens before and after the input.5 We also create noise verbalizers by assigning uniformly selected tokens to each output class. Using this process, we ob- tain 20 different intentionally bad PVPs per task. For each task, we start with 3 randomly selected, high-quality patterns from our original set of man- ually designed instructions, add noise PVPs one by one, and investigate the effect on performance. Figure 5: Performance of PET with three randomly selected patterns when adding noise PVPs; the x-axis shows the number of noise PVPs added. Performance hardly changes when adding noise PVPs, showing that PET is very robust to ‘‘bad’’ instructions. We also show performance of using only noise PVPs with PET (NP+P) and their average performance without PET (NP). Q4: How many patterns are required for good performance? Orthogonal to Q3, what is the minimum number of high-quality prompts required for good per- formance? This is important because we want to minimize the amount of time spent creating PVPs in a practical setting. Setup We generate, per task, 10 random permu- tations of the 23 patterns. For each permutation and training set, we use the same setup as in Q1 to compute the average performance obtained with PET when using only the first i, 1 ≤ i ≤ 5, patterns. Results The effect of adding noise PVPs is shown in Figure 5. Performance hardly changes even if more than half of the PVPs are noise PVPs, demonstrating that PET is robust to ‘‘bad’’ instruc- tions. Figure 5 also shows that performance is substantially worse when using only noise PVPs. 5If there are multiple input texts, we shuffle their order and additionally add 0–3 tokens in between them. Results Average performance of PET trained with the first i patterns is shown in Figure 6, rela- tive to the performance of PET trained with all 23 patterns. For all tasks and training set sizes, as little as four patterns are already sufficient to achieve performance close to that of PET trained with all 23 patterns. Surprisingly, PET’s performance is much higher than the average performance of a model trained on individual patterns even with i = 1. 721 Figure 6: Relative performance of PET with only a subset of patterns compared to that achieved using all 23 manually designed patterns. The x-axis shows the number of patterns used. As little as 4 patterns are sufficient to almost match the performance of a model trained on all patterns. This indicates that the process of knowledge dis- tillation using unlabeled data is also beneficial when using only a single instruction. Q5: Are other hyperparameters important? For true few-shot settings, we want the same set of hyperparameter values to perform well across different tasks; this enables us to adopt these val- ues for new tasks without tuning on task-specific validation sets. We investigate how the hyperpa- rameters, learning rate, training steps, and batch size affect performance. Setup Based on previous work, we consider learning rates from 10−4 to 10−6, training steps from 10 to 1,000, and batch sizes from 1 to 32. Learning rate and batch size are changed for the individual models and the final classifier simul- taneously; the number of training steps is varied only for individual models. We modify each hy- perparameter independently, keeping all other pa- rameters at their default value (i.e., a learning rate of 10−5, 100 steps and a batch size of 4). Results Results are shown in Figure 7. For train- ing steps and batch size, performance is relatively stable across a wide range of different values, with more steps and larger batch sizes typically lead- ing to slightly better performance (especially for n = 100). Learning rate clearly has the strongest impact on performance, but values of 10−5 and 5 · 10−5 consistently give the best results across tasks; these are also the values typically used for finetuning in prior work (Devlin et al., 2019; Liu et al., 2019). Figure 7: Performance of PET (solid) and avg. per- formance of individual models (dotted) for different learning rates (LR), training steps (Steps), and batch sizes. Except for LR, performance is stable across a wide range of values. For readability, the legend is only shown in the top left. Q6: Do we really need unlabeled data? In contrast to individual PVPs, PET needs un- labeled data, which is not available in some real-world settings. Building on earlier work (Anaby-Tavor et al., 2020; Papanikolaou and Pierleoni, 2020; Yang et al., 2020; Mohapatra et al., 2020; Kumar et al., 2020; Schick and Sch¨utze, 2021), we investigate whether synthetic examples can replace unlabeled data. Setup We use GPT2-XL (Radford et al., 2019) to generate synthetic unlabeled data: We provide one or two random training examples without labels as in-context examples (Brown et al., 2020) and let the model generate an additional example. 722 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 5 2 0 3 0 6 9 2 / / t l a c _ a _ 0 0 4 8 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 each example and only keep so many examples per label that the resulting dataset—which is used for training the final classifier—is balanced. Results Figure 8 shows the performance of in- dividual patterns as well as PET and iPET with real and synthetic unlabeled data. Except for iPET on Yahoo Questions with n = 10, the accuracy of synthetic data is within one point of real data, with our balanced version performing slightly better. For n = 10, using synthetic data even improves accuracy in some cases. This shows that in the absence of unlabeled examples, synthetic data ob- tained from generative language models can serve as a drop-in replacement without substantially degrading performance. 5 PET for Real-World Tasks We use our insights from §4 to apply PET to the RAFT benchmark, a collection of 11 diverse real-world tasks whose automated solution has in- herent value to someone (Alex et al., 2021). These tasks are challenging for few-shot approaches: they require domain expertise, understanding of detailed instructions, processing of long inputs, and handling a large number of output classes. Tasks and Datasets The RAFT benchmark in- cludes 11 tasks from different domains: ADE, B77, NIS, OSE, Over, SOT, SRT, TAI, ToS, TEH, and TC; for a detailed overview see Alex et al. (2021). Each task comes with 50 labeled training examples; in accordance with the RAFT rules, we additionally make use of the unlabeled data (rang- ing from 150 to 5,000 examples) for PET’s distil- lation step. In the case of RAFT, the unlabeled set is the same as the test set. So unlike in §4, our final classifier is directly trained on (unlabeled) test examples. PVPs Based on Q1 and Q2, we only employ Q&A prompts. To obtain the question q, we make minimal changes to the original instructions of Alex et al. (2021); we rephrase all binary classifi- cation tasks as yes/no questions. For example, we rephrase the instruction ‘‘Label the sentence based on whether it is related to an adverse drug effect (ADE).’’ as ‘‘Is this sentence related to an adverse drug effect (ADE)?’’ Following our results from Q4, we specify 4 PVPs per task. For binary classi- fication, we use two different patterns that either include or omit the full task specification of Alex Figure 8: Minimum, average, and maximum accuracy (in percent) of individual patterns compared to regular PET and iPET as well as PET and iPET with synthetic data (+SD). Accuracy with synthetic data is very similar to that obtained with real data. For two inputs x1 and x2, the input given to the model is Example 1:x1 ←(cid:3) Example 2: x2 ←(cid:3) Example 3: where ←(cid:3), denotes two consecutive line breaks. If an input consists of two texts, we simply conca- tenate them using the sequence +++ as a separator. We generate 10,000 examples for n = 10 and 30,000 examples for n = 100 using top-p sam- pling (Holtzman et al., 2020) with p = 0.9. For each input, we stop the generation process as soon as the model generates two consecutive line breaks. We discard all examples for which the model does not generate two consecutive line breaks within 128 tokens; for datasets with text pairs, we also discard examples where the model fails to generate the sequence separator (+++). As the datasets obtained with this method may be highly imbalanced regarding the distribution of (unknown) labels, we also experiment with a balanced variant: We use the ensemble of models trained on individual PVPs to assign labels to 723 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 5 2 0 3 0 6 9 2 / / t l a c _ a _ 0 0 4 8 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 et al. (2021) and combine them with both a yes/no verbalizer and a true/false verbalizer.6 Hyperparameters Based on the results of Q5, we mostly keep hyperparameter defaults from §4. However, we make the following changes: • We replace RoBERTa (base) with ALBERT (xxlarge, v2). While being much slower to train, ALBERT was shown to outperform RoBERTa both in regular and few-shot set- tings (Lan et al., 2020; Schick and Sch¨utze, 2021; Logan IV et al., 2021). • Since 1,000 steps cover only 4,000 examples at batch size 4, we finetune the distilled model for 2,000 steps for tasks with more than 4,000 unlabeled examples. This makes sure all examples are seen at least once. • Following Schick and Sch¨utze (2020) and Schick and Sch¨utze (2021) we train three individual models per PVP. This improves robustness as performance can vary greatly between individual finetuning runs. Handling Many Labels The B77 dataset con- sists of banking customer service queries, each annotated with one of 77 possible intents. That large a number of outputs leads to several issues for PET: First, it is impossible to specify a mean- ingful verbalizer that maps each intent to a single token. We initially experimented with PET’s multi- mask version (Schick and Sch¨utze, 2021), but it was too inefficient for experimentation. We in- stead proceed as follows. We rephrase the task as binary classification, where for each pair of query x and intent y, the task is to decide whether y is the correct intent for x. For each original training example (x, y), we generate one example (x, y, True) and four examples (x, y(cid:6), False) with ran- domly sampled, incorrect intents y(cid:6). As this in- creases the amount of data fivefold, we finetune each individual model for 500 instead of 100 steps. This approach still is not particularly efficient: Reframing the task as binary classification means that for each input, 77 forward passes are required to find the correct intent. We thus train the final model as a regular classifier with 77 different 6For a full list of all task specifications, see https:// github.com/oughtinc/raft-baselines. The full set of PVPs can be found at https://github.com /timoschick/pet/tree/master/true-fsl. output classes; for training this classifier on input x, we set the target probability of output y propor- tional to the probability of True being the correct output for (x, y) according to our ensemble of binary classifiers. Finally, another issue is that with 50 labeled examples, at least 27 labels are not covered in the training set; this may bias a model to never predict these labels. To alleviate this issue, we train two generations of models using iPET. For training the second generation, we obtain training data covering all possible labels similar to Schick and Sch¨utze (2020): For each label, we pick the two ex- amples from the unlabeled data for which this label is most likely according to the first generation. The nature of RAFT makes it hard to measure the impact of any of these choices. While we could conduct experiments similar to those in §4, none of the datasets considered there has a struc- ture similar to B77; as our modifications affect only one of 11 tasks, we leave further analysis for future work. Monitoring We checked for TRAIN SET UNDER- FITTING and CONSTANT PREDICTIONS (§4) to detect finetuning issues. Unlike in §4, on RAFT we en- countered some issues that could not be resolved simply by retraining with a different seed: • We observed TRAIN SET UNDERFITTING for the final classifier on B77. This may be due to the classification head for 77 classes introducing many new parameters; we train the final model for 5,000 instead of 2,000 steps, which fixed this issue. • We observed CONSTANT PREDICTIONS for the ToS training set. Doubling the number of training steps resolved this problem. • Finally, we also observed CONSTANT PREDIC- TIONS on the unlabeled data of SRI. Upon manually inspecting the training set, we ob- served that all but one out of 50 examples have the same label. As all models already classified the training set perfectly, we left the setup for our SRI submission unchanged. Results For all 11 tasks, Table 1 shows results of PET and baselines.7 As can be seen, PET performs 7All results are taken directly from the leaderboard at https://huggingface.co/spaces/ought/raft -leaderboard. 724 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 5 2 0 3 0 6 9 2 / / t l a c _ a _ 0 0 4 8 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Method ADE B77 NIS OSE Over SOT SRI TAI ToS TEH TC Avg GPT-2 GPT-Neo AdaBoost snlt GPT-3 SetFit PET Human 60.0 45.2 54.3 60.3 68.6 72.6 82.2 83.0 12.1 14.9 02.3 24.8 29.9 53.8 59.3 56.1 40.8 62.6 58.5 67.9 87.2 85.7 24.5 34.3 47.5 30.2 43.1 52.1 64.6 60.7 85.7 64.6 49.8 68.1 83.8 83.1 93.7 90.7 90.8 91.7 38.0 40.6 45.5 33.6 76.9 68.2 81.6 49.2 49.3 50.6 49.2 51.6 49.3 49.3 61.2 60.5 55.6 62.6 65.6 62.8 63.8 49.8 56.5 56.0 54.0 57.4 62.0 57.6 90.8 46.8 60.9 62.7 31.1 55.4 44.3 44.9 52.6 53.2 48.3 72.2 72.3 63.6 62.5 79.1 82.1 83.7 82.4 45.8 48.1 51.4 52.8 62.7 66.9 69.6 89.7 73.5 Table 1: Performance (macro F1 multiplied by 100) of baselines and PET on the RAFT benchmark (Alex et al., 2021). Best model performance is shown in bold, best overall performance (including human annotators) is underlined. The final column shows average performance across all 11 tasks. better than all other approaches on average, achieving near-human performance for 7 out of 11 tasks. Note, however, that non-expert humans per- form worse than the majority baseline on SRI, so results on this task should be taken with a grain of salt. PET also clearly outperforms a GPT-3 model (Brown et al., 2020) by almost 7 points, despite the latter being larger by several orders of magnitude. While PET is particularly successful on ADE, B77, and OSE (where it outperforms GPT-3 by 13.6, 21.5, and 29.4 points, respectively), it performs comparably poorly on datasets in the law (Over, ToS) and social media (TEH, TC) domains. Our approach for handling many labels performs sur- prisingly well on B77 without any tuning of its parameters. Due to the nature of RAFT, we cannot perform further analysis or ablation studies. 6 Discussion Our experimental results in §4 and §5 show that strong performance in few-shot settings is clearly possible without manual prompt tuning or hyper- parameter optimization on large dev sets; in other words, PET can successfully be applied in true few-shot settings. While we believe that it should be an important goal of future work to make LMs more robust to different instructions, even with current models it is relatively easy to success- fully apply PET when following a few simple principles—such as rephrasing the task in a Q&A format, using simple vocabulary and single-token verbalizers where possible, and specifying at least a handful of different patterns. In light of these findings, we also hope that future work will not view human involvement in prompt design as a drawback of instruction-based approaches, but rather as an exciting possibility to communi- cate with models in ways other than exclusively through examples. Our study has limitations. First, a major obsta- cle to using PET in real-world applications is that we do not know a priori how well it performs for a given task; we therefore believe an important next step is to investigate methods for estimating performance without access to large test sets—for example, through model calibration (Desai and Durrett, 2020; Jiang et al., 2021; Zhao et al., 2021)—in real-world settings. In addition, we did not fully explore the capabilities of PET; for exam- ple, we did not investigate domain-adaptive pre- training (Gururangan et al., 2020) and auxiliary language modeling (Chronopoulou et al., 2019), both of which were shown to be helpful by Schick and Sch¨utze, (2020). We also did not quantify the impact of our decisions regarding B77 and the effectiveness of monitoring (§4) and only con- sidered English models and datasets. Finally, we did not examine PET’s performance beyond aggre- gate scores. While this is not feasible on RAFT due to its nature, performing such analysis either with other datasets or with methods such as the ones proposed by Ribeiro et al. (2020) would be relevant future work to understand real-world capabilities of instruction-based approaches more comprehensively. 7 Conclusion In light of recent work casting doubt on the performance of prompt-based approaches in true few-shot settings (Perez et al., 2021), we have con- ducted an extensive study of PET. In a controlled environment, we found that manually designed 725 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 5 2 0 3 0 6 9 2 / / t l a c _ a _ 0 0 4 8 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 instructions outperform null prompts, with Q&A- style prompts performing best (Q1, Q2). Across different tasks, models and training set sizes, PET consistently outperforms even the best individ- ual prompt (Q1, Q2). We have also shown that PET is robust to uninformative prompts and to dif- ferent choices of hyperparameters (Q3, Q5), that as little as four prompts are sufficient to reach good performance (Q4), and that synthetic exam- ples can be used to replace unlabeled data (Q6). Based on these insights, we applied PET to a benchmark of real-world tasks, where it achieves near-human performance for 7 out of 11 tasks without any tuning on a dev set, demonstrating the potential of instruction-based approaches in true few-shot settings. Acknowledgments This work was funded by the European Research Council (ERC #740516). We thank the anony- mous reviewers and the action editor for their helpful comments. References Neel Alex, Eli Lifland, Lewis Tunstall, Abhishek Thakur, Pegah Maham, C. Jess Riedel, Emmie Hine, Carolyn Ashurst, Paul Sedille, Alexis Carlier, Michael Noetel, and Andreas Stuhlm¨uller. 2021. RAFT: A real-world few- shot text classification benchmark. In Thirty- fifth Conference Information Processing Systems Datasets and Benchmarks Track (Round 2). on Neural Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Amir Kantor, George Kour, Segev Shlomov, Naama Tepper, and Naama Zwerdling. 2020. Do not have enough data? Deep learning to the rescue! Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7383–7390. https://doi.org/10 .1609/aaai.v34i05.6233 Jonathan Bragg, Arman Cohan, Kyle Lo, and Iz Beltagy. 2021. FLEX: Unifying evaluation for few-shot NLP. In Thirty-Fifth Conference on Neural Information Processing Systems. Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few- In- shot formation Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc. In Advances in Neural learners. Ming-Wei Chang, Lev Ratinov, Dan Roth, and Vivek Srikumar. 2008. Importance of semantic representation: Dataless classification. In Pro- ceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08, pages 830–835. AAAI Press. Jiaao Chen, Zichao Yang, and Diyi Yang. 2020. MixText: Linguistically-informed interpolation of hidden space for semi-supervised text clas- sification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2147–2157, Online. Associ- ation for Computational Linguistics. https:// doi.org/10.18653/v1/2020.acl-main.194 the 2019 Conference of Alexandra Chronopoulou, Christos Baziotis, and Alexandros Potamianos. 2019. An embarrass- ingly simple approach for transfer learning from pretrained language models. In Proceed- ings of the North American Chapter of the Association for Com- putational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 2089–2095, Minneapolis, Minnesota. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19 -1213 Zewei Chu, Karl Stratos, and Kevin Gimpel. 2021. Unsupervised label refinement improves dataless text classification. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4165–4178, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021 .findings-acl.365 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Joe Davison, Joshua Feldman, and Alexander Rush. 2019. Commonsense knowledge min- ing from pretrained models. In Proceedings 726 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 5 2 0 3 0 6 9 2 / / t l a c _ a _ 0 0 4 8 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 1173–1178, Hong Kong, China. Associ- ation for Computational Linguistics. https:// doi.org/10.18653/v1/D19-1109 Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295–302, Online. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/2020.emnlp-main.21 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Com- putational Linguistics. Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020. Fine-tuning pretrained language mod- els: Weight initializations, data orders, and early stopping. Computing Research Reposi- tory, arXiv:2002.06305. the 59th Annual Meeting of the Association for Computational Linguistics and the 11th In- ternational Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics. Suchin Gururangan, Ana Marasovi´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, pages 8342–8360, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .acl-main.740 Karen Hambardzumyan, Hrant Khachatrian, and Jonathan May. 2021. WARP: Word-level Ad- versarial ReProgramming. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th In- ternational Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), pages 4921–4933, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2021.acl-long.381 Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neu- ral network. Computing Research Repository, arXiv:1503.02531. Avia Efrat and Omer Levy. 2020. The turking test: Can language models understand instruc- tions? Computing Research Repository, arXiv: 2010.11982. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neu- ral text degeneration. In International Confer- ence on Learning Representations. Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Sch¨utze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained lan- guage models. Transactions of the Association for Computational Linguistics, 9:1012–1031. https://doi.org/10.1162/tacl a 00410 Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. 2021. How can we know when language models know? On the calibration of language models for question answering. Trans- actions of the Association for Computational Linguistics, 9:962–977. https://doi.org /10.1162/tacl_a_00407 William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch transformers: Scaling to tril- lion parameter models with simple and efficient sparsity. Computing Research Repository, arXiv:2101.03961. Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models learners. In Proceedings of better few-shot Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438. https://doi.org/10.1162 /tacl_a_00324 Sawan Kumar and Partha Talukdar. 2021. Re- ordering examples helps during priming-based 727 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 5 2 0 3 0 6 9 2 / / t l a c _ a _ 0 0 4 8 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 few-shot learning. In Findings of the Asso- ciation for Computational Linguistics: ACL- IJCNLP 2021, pages 4507–4518, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021 .findings-acl.395 Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data augmentation using pre-trained transformer models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Lan- guage Systems, pages 18–26, Suzhou, China. Association for Computational Linguistics. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representa- tions. In International Conference on Learning Representations. Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter- efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Nat- ural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021 .emnlp-main.243 Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot relation extrac- tion via reading comprehension. In Proceed- ings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 333–342, Vancouver, Canada. Associ- ation for Computational Linguistics. https:// doi.org/10.18653/v1/K17-1034 Xiang Lisa Li and Percy Liang. 2021. Prefix- tuning: Optimizing continuous prompts for gen- eration. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online. Association for Computational Lin- guistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly opti- mized BERT pretraining approach. Computing Research Repository, arXiv:1907.11692. Robert L. Logan IV, Ivana Balaˇzevic, Eric Wallace, Fabio Petroni, Sameer Singh, and Sebastian Riedel. 2021. Cutting down on prompts and parameters: Simple few-shot learn- ing with language models. Computing Research Repository, arXiv:2106.13353. Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2021. Fantasti- cally ordered prompts and where to find them: Overcoming few-shot prompt order sen- sitivity. Computing Research Repository, arXiv:2104.08786. Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as ques- tion answering. Computing Research Reposi- tory, arXiv:1806.08730. Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2021. Noisy channel language model prompting for few-shot text classification. Computing Research Repository, arXiv:2108.04106. Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2021. Cross-task gen- eralization via natural language crowdsourcing instructions. Computing Research Repository, arXiv:2104.08773. Biswesh Mohapatra, Gaurav Pandey, Danish Contractor, and Sachindra Joshi. 2020. Simu- lated chats for task-oriented dialog: Learning to generate conversations from instructions. Computing Research Repository, arXiv:2010. 10216. https://doi.org/10.18653/v1 /2021.findings-emnlp.103 Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S. Corrado, and Jeffrey Dean. 2014. Zero-shot learning by convex com- bination of semantic embeddings. Juri Opitz. 2019. Argumentative relation classi- fication as plausibility ranking. In Preliminary Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, pages 193–202, Erlangen, Germany. German Society for Computational Linguistics & Language Technology. Yannis Papanikolaou and Andrea Pierleoni. 2020. DARE: Data augmented relation extraction with GPT-2. Computing Research Repository, arXiv:2004.13845. 728 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 5 2 0 3 0 6 9 2 / / t l a c _ a _ 0 0 4 8 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models. In Thirty-Fifth Conference on Neural Information Processing Systems. Raul Puri and Bryan Catanzaro. 2019. Zero-shot text classification with generative language models. Computing Research Repository, arXiv: 1912.10165. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Technical report, Open AI. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67. Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th An- nual Meeting of the Association for Computa- tional Linguistics, pages 4902–4912, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .acl-main.442 Bernardino Romera-Paredes and Philip Torr. 2015. An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning, pages 2152–2161. Oscar Sainz, Oier Lopez de Lacalle, Gorka Labaka, Ander Barrena, and Eneko Agirre. 2021. Label verbalization and entailment for ef- fective zero and few-shot relation extraction. In Proceedings of the 2021 Conference on Empir- ical Methods in Natural Language Processing, pages 1199–1212, Online and Punta Cana, Dominican Republic. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/2021.emnlp-main.92 Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M. Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M. Rush. 2022. Multi- task prompted training enables zero-shot task generalization. In International Conference on Learning Representations. Timo Schick, Helmut Schmid, and Hinrich Sch¨utze. 2020. Automatically identifying words that can serve as labels for few-shot text classi- fication. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5569–5578, Barcelona, Spain (Online). International Committee on Computational Linguistics. https://doi.org/10.18653 /v1/2020.coling-main.488 Timo Schick and Hinrich Sch¨utze. 2021. It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339–2352, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2021.naacl-main.185 Timo Schick and Hinrich Sch¨utze. 2020. Ex- text ploiting cloze questions for classification and natural language inference. Computing Research Repository, arXiv:2001. 07676. https://doi.org/10.18653/v1 /2021.eacl-main.20 few shot Timo Schick and Hinrich Sch¨utze. 2021. Gener- ating datasets with pretrained language models. the 2021 Conference on In Proceedings of Empirical Methods in Natural Language Pro- cessing (EMNLP). Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/2021.emnlp-main.555 Timo Schick, Sahana Udupa, and Hinrich Sch¨utze. 2021. Self-diagnosis and self-debiasing: A pro- posal for reducing corpus-based bias in NLP. Transactions of the Association for Computa- tional Linguistics. https://doi.org/10 .1162/tacl_a_00434 Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. 729 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 5 2 0 3 0 6 9 2 / / t l a c _ a _ 0 0 4 8 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 AutoPrompt: Eliciting knowledge from lan- guage models with automatically generated prompts. In Proceedings of the 2020 Confer- ence on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 4222–4235, Online. Association for Computational Lin- guistics. https://doi.org/10.18653 /v1/2020.emnlp-main.346 Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy consid- erations for deep learning in NLP. In Pro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/P19-1355 Yi Sun, Yu Zheng, Chao Hao, and Hangping Qiu. 2021. NSP-BERT: A prompt-based zero- shot learner through an original pre-training task–next sentence prediction. Computing Re- search Repository, arXiv:2109.03564. Derek Tam, Rakesh R. Menon, Mohit Bansal, Shashank Srivastava, and Colin Raffel. 2021. Improving and simplifying pattern exploiting training. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4980–4991, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi .org/10.18653/v1/2021.emnlp-main .407 Sappadla Prateek Veeranna, Jinseok Nam, Eneldo Loza Mencıa, and Johannes F¨urnkranz. 2016. Using semantic similarity for multi-label zero- shot classification of text documents. In Pro- ceeding of European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges, Belgium: Elsevier, pages 423–428. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general- purpose language understanding systems. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc. ing. In Proceedings of the 2018 EMNLP Work- shop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computa- tional Linguistics. Albert Webson and Ellie Pavlick. 2021. Do prompt-based models really understand the meaning of their prompts? Computing Research Repository, arXiv:2109.01247. https://doi .org/10.18653/v1/W18-5446 Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned language models are zero-shot learn- ers. In International Conference on Learning Representations. Orion Weller, Nicholas Lourie, Matt Gardner, and Matthew E. Peters. 2020. Learning from task descriptions. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 1361–1375, Online. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /2020.emnlp-main.105 Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. 2019. Unsupervised data augmentation for consis- tency training. Computing Research Reposi- tory, arXiv:1904.12848. Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Xiang Pan, Xin Tian, Libo Qin, and Hu Hai. 2021. FewCLUE: A Chinese few-shot learning eval- uation benchmark. Computing Research Repos- itory, arXiv:2107.07498. Yiben Yang, Chaitanya Malaviya, Jared Fernandez, Swabha Swayamdipta, Ronan Le Bras, Ji-Ping Wang, Chandra Bhagavatula, Yejin Choi, and Doug Downey. 2020. Gen- erative data augmentation for commonsense the Association reasoning. for Computational Linguistics: EMNLP 2020, pages 1008–1025, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2020.findings-emnlp.90 In Findings of Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and anal- ysis platform for natural language understand- Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. 2021. CrossFit: A few-shot learning challenge for cross-task generalization in NLP. Comput- ing Research Repository, arXiv:2104.08835. 730 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 5 2 0 3 0 6 9 2 / / t l a c _ a _ 0 0 4 8 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. Benchmarking zero-shot text classification: Da- tasets, evaluation and entailment approach. In Proceedings of the 2019 Conference on Empi- rical Methods in Natural Language Process- ing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 3914–3923, Hong Kong, China. Association for Computational Linguistics. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th In- ternational Conference on Machine Learning, volume 119 of Proceedings of Machine Learn- ing Research, pages 11328–11339, Virtual. PMLR. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Informa- tion Processing Systems 28, pages 649–657. Curran Associates, Inc. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of lan- guage models. In Proceedings of the 38th International Conference on Machine Learn- ing, volume 139 of Proceedings of Ma- chine Learning Research, pages 12697–12706. PMLR. Ben Zhou, Daniel Khashabi, Chen-Tse Tsai, and Dan Roth. 2018. Zero-shot open entity typing as type-compatible grounding. In Pro- ceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2065–2076, Brussels, Belgium. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/D18-1231 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 5 2 0 3 0 6 9 2 / / t l a c _ a _ 0 0 4 8 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 731
Download pdf