An End-to-End Contrastive Self-Supervised Learning Framework for
Language Understanding
Hongchao Fang, Pengtao Xie∗
University of California San Diego, USA
p1xie@eng.ucsd.edu
Astratto
Self-supervised learning (SSL) methods such
as Word2vec, BERT, and GPT have shown
great effectiveness in language understanding.
Contrastive learning, as a recent SSL approach,
has attracted increasing attention in NLP. Contro-
trastive learning learns data representations by
predicting whether two augmented data in-
stances are generated from the same original
data example. Previous contrastive learning
methods perform data augmentation and con-
trastive learning separately. Di conseguenza, IL
augmented data may not be optimal for con-
trastive learning. To address this problem, we
propose a four-level optimization framework
that performs data augmentation and contras-
tive learning end-to-end, to enable the aug-
mented data to be tailored to the contrastive
learning task. This framework consists of four
learning stages, including training machine
translation models for sentence augmentation,
pretraining a text encoder using contrastive
apprendimento, finetuning a text classification model,
and updating weights of translation data by
minimizing the validation loss of the classifi-
cation model, which are performed in a unified
modo. Experiments on datasets in the GLUE
benchmark (Wang et al., 2018UN) and on da-
tasets used in Gururangan et al. (2020) dem-
onstrate the effectiveness of our method.
1
introduzione
Self-supervised learning (Bengio et al., 2000;
Mikolov et al., 2013; Devlin et al., 2019; Radford
et al., 2018; Lewis et al., 2020), which learns
data representations by solving prediction tasks
defined on input data without leveraging human-
provided labels, has achieved broad success in
PNL. Many NLP-specific self-supervised learn-
ing (SSL) methods have been proposed, ad esempio
neural language models (Bengio et al., 2000),
Word2vec (Mikolov et al., 2013), BERT (Devlin
et al., 2019), GPT (Radford et al., 2018), BART
∗Corresponding author.
(Lewis et al., 2020), and so forth, with various
SSL tasks defined. Per esempio, in Word2vec and
BERT, the SSL task is predicting the identities
of masked tokens based on their contexts. In neu-
ral language models including GPT, the SSL task
is language modeling: Given a history of tokens,
predict the next token.
Recentemente, contrastive self-supervised learning
(He et al., 2020; Chen et al., 2020) has been
borrowed from vision domains into NLP and has
shown promising success in predicting seman-
tic textual similarity (Gao et al., 2021), machine
translation (Pan et al., 2021), relation extraction
(Su et al., 2021), and so on. The key idea of con-
trastive self-supervised learning (CSSL) È: Create
augments of original examples, then learn repre-
sentations by predicting whether two augments
are from the same original data example. In ex-
isting CSSL approaches, data augmentation and
contrastive learning are performed separately. As
a result, augmented data may not be optimal for
contrastive learning. Per esempio, considering a
back-translation (Sennrich et al., 2016)–based
augmentation method, if the translation model is
trained using news corpora, it is not suitable for
augmenting data for movie review data.
in questo documento, we aim to address this issue. Noi
propose a four-level optimization framework that
performs data augmentation and contrastive learn-
ing end-to-end in a unified way, to allow the data
augmentation models to be guided by the con-
trastive learning task and make augmented data
suitable for performing contrastive learning. Noi
assume the end task is text classification. Nostro
framework consists of four learning stages. Al
first stage, we train four translation models to per-
form sentence augmentation based on back trans-
lation. To account for the fact that translation data
used at this stage and text classification data used
in later stages have a domain discrepancy, we
perform reweighting of translation data;
these
weights are tentatively fixed at this stage and will
1324
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 10, pag. 1324–1340, 2022. https://doi.org/10.1162/tacl a 00521
Redattore di azioni: Dipanjan Das. Lotto di invio: 9/2021; Lotto di revisione: 5/2022; Pubblicato 11/2022.
C(cid:3) 2022 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
be updated at a later stage. At the second stage,
we pretrain a text encoder using contrastive learn-
ing on augmented sentences created by the trans-
lation models. At the third stage, we finetune a
text classification model, using the text encoder
pretrained at the second stage as regularization.
At the fourth stage, we measure the performance
of the text classifier on a validation set and up-
date weights of translation data by maximizing
the validation performance. Each level of opti-
mization problem in our framework corresponds
to a learning stage. These stages are performed
end-to-end. Experiments on datasets in the GLUE
benchmark (Wang et al., 2018UN) and on datasets
used in Gururangan et al. (2020) demonstrate the
effectiveness of our method.
The major contributions of this paper include:
• We propose
UN
four-level optimization
framework to perform contrastive learning
(CL) and data augmentation end-to-end. Nostro
framework enables the training of augmen-
tation models to be guided by the CL task
and makes augmented data suitable for CL.
• We demonstrate the effectiveness of our
method on datasets in the GLUE benchmark
(Wang et al., 2018UN) and on datasets used in
Gururangan et al. (2020).
2 Related Works
2.1 Contrastive Learning in NLP
Recentemente, contrastive learning has received in-
creasing attention in NLP. Gao et al. (2021)
proposed a simple contrastive learning–based
sentence embedding method. In this method, IL
same input sentence is fed into a pretrained
RoBERTa (Liu et al., 2019) model twice by ap-
plying different dropout masks and the result-
ing two embeddings are labeled as being similar.
Embeddings of different sentences are labeled as
dissimilar. Pan et al. (2021) proposed a contras-
tive learning method for many-to-many multilin-
gual neural machine translation, where contrastive
learning is leveraged to close the gap among
representations of different languages. Su et al.
(2021) developed a contrastive learning method
for biomedical relation extraction, where linguis-
tic knowledge is leveraged for data augmentation.
Wang et al. (2021) proposed to construct seman-
tically negative examples to perform contrastive
apprendimento, for the sake of improving the robust-
ness against semantical adversarial attacks. Pan
et al. (2022) proposed to perform contrastive
learning on adversarial examples generated by
perturbing word embeddings, in order to learn
noise-invariant representations.
2.2 Contrastive Self-Supervised Learning in
Non-NLP Domains
Contrastive self-supervised learning has been
broadly studied recently in other domains be-
sides NLP. Henaff (2020) proposed a contrastive
predictive coding method for data-efficient clas-
sificazione. In this method, autoregressive models
are leveraged to predict the future in a latent
spazio. Khosla et al. (2020) proposed a supervised
contrastive learning method. Data examples hav-
ing the same class label are made close to each
other in the latent space while examples with
different class labels are separated farther apart.
Laskin et al. (2020) proposed a method to learn
contrastive unsupervised representations for rein-
forcement learning. In Klein and Nabi (2020), UN
contrastive self-supervised learning approach is
proposed for commonsense reasoning.
2.3 Bi-level Optimization
framework is a multi-level optimization
Nostro
framework, which is an extension of bi-level op-
timization (BLO). BLO (Dempe, 2002) has been
applied for many applications in NLP, ad esempio
neural architecture search (Liu et al., 2018), hy-
perparameter tuning (Feurer et al., 2015), dati
reweighting (Shu et al., 2019; Ren et al., 2020;
Wang et al., 2020), label denoising (Zheng et al.,
2021), learning rate adjustment (Baydin et al.,
2018), meta learning (Finn et al., 2017), data gen-
eration (Such et al., 2020), and so forth. In these
BLO-based methods, meta parameters (neural ar-
chitectures, hyperparameters, importance weights
of training data examples, eccetera.) are learned by
minimizing a validation loss and weight parame-
ters are optimized by minimizing a training loss.
3 Method
In this section, we present our proposed end-to-end
contrastive learning framework.
3.1 Overview
We use back-translation (Sennrich et al., 2016)
to perform data augmentation of sentences. Then
1325
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figura 1: Illustration of our framework.
on augmented sentences, contrastive learning is
performed. Two augmented sentences are labeled
as similar if they originate from the same original
sentence. Two augmented sentences are labeled as
dissimilar if they originate from different original
sentences. Contrastive losses (Hadsell et al., 2006)
are defined on these similar and dissimilar pairs.
A text encoder is pretrained by minimizing the
contrastive losses.
We assume the end task is text classification.
Our framework consists of the following learn-
ing stages, which are performed end-to end. A
the first stage, we train four translation models
to perform data augmentation using back trans-
lation. Each translation pair in the training set is
associated with a weight. At the second stage, SU
augmented sentences created by the translation
models, we perform contrastive learning to train a
text encoder. At the third stage, using the encoder
trained at the second stage as regularization, we
finetune a text classification model. At the fourth
stage, the classification model trained at the third
stage is evaluated on a validation classification
dataset and the weights of translation pairs at the
first stage are updated by minimizing the valida-
tion loss. The four stages are performed in a four-
level optimization framework. Figura 1 illustrates
our framework. Prossimo, we describe the four stages
in detail.
3.2 Stage I: Training Machine Translation
Model for Sentence Augmentation
Figura 2 shows the workflow of data augmenta-
zione. For an input sentence x, we augment it us-
ing back-translation (Sennrich et al., 2016). In our
esperimenti, the language of the classification
data is English. We use an English-to-German
machine translation (MT) model to translate x
to y. Then we use a German-to-English MT
model to translate y to x(cid:4). Then x(cid:4) is regarded as
Figura 2: The workflow of data augmentation based on
back translation.
an augmented sentence of x. Allo stesso modo, we use an
English-to-Chinese MT model and a Chinese-to-
English MT model to obtain another augmented
sentence x(cid:4)(cid:4). We use German and Chinese as the
two auxiliary languages because 1) both of them
are resource-rich languages that have abundant
translation data for model training; 2) they are
sufficiently different from each other to achieve
higher diversity in augmented examples.
io
To perform data augmentation based on back
translation, we train four machine translation (MT)
models: English-to-German, German-to-English,
English-to-Chinese, and Chinese-to-English. Let
Weg, Wge, Wec, and Wce denote these four MT
models, and let Deg = {D(eg)
}N (eg)
i=1 , Dge =
io
{D(ge)
i=1 , Dec = {D(ec)
}N (ge)
}N (ec)
i=1 , and Dce =
io
{D(ce)
}N (ce)
denote the corresponding datasets
i=1
io
used to train these four models. Some trans-
lation data have a large domain discrepancy
with the text classification data and should be
excluded from training the translation models.
Otherwise, the translation models trained using
such out-of-domain translation data may not be
able to generate meaningful augmentations for
the text classification data due to the domain dis-
crepancy. To identify and remove out-of-domain
, and d(ce)
translation data, for d(eg)
, D(ec)
,
io
io
, UN(ec)
, UN(ge)
we associate them with weights a(eg)
,
io
io
and a(ce)
, which are in [0, 1]. If the weight is close
io
A 0, it means that the corresponding example
has a large domain discrepancy with classification
texts and should be excluded from training the
MT models. These weights are used to reweight
the training losses. At this stage, we solve the
following optimization problems:
, D(ge)
io
io
io
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
W ∗
eg(Aeg) = argmin
Weg
W ∗
ge(Age) = argmin
Wge
W ∗
ec(Aec) = argmin
Wec
W ∗
ce(Ace) = argmin
Wce
(cid:2)N (eg)
i=1
(cid:2)N (ge)
i=1
(cid:2)N (ec)
i=1
(cid:2)N (ce)
i=1
UN(eg)
io
(cid:2)mt(D(eg)
io
; Weg),
UN(ge)
io
(cid:2)mt(D(ge)
io
; Wge),
UN(ec)
io
(cid:2)mt(D(ec)
io
; Wec),
UN(ce)
io
(cid:2)mt(D(ce)
io
; Wce),
(1)
1326
io
io
io
io
where Aeg = {UN(eg)
i=1 , Age = {UN(ge)
}N (ge)
}N (eg)
i=1 ,
io
i=1 , and Ace = {UN(ce)
Aec = {UN(ec)
}N (ce)
}N (ec)
i=1 .
; Weg) is an MT loss defined on d(eg)
(cid:2)mt(D(eg)
.
io
io
If a(eg)
is close to 0 (indicating d(eg)
has large
io
domain discrepancy with classification texts), Questo
loss is made close to 0, effectively excluding d(eg)
from training the MT model. These data weights
are tentatively fixed at this stage and will be up-
dated later. They cannot be updated at this stage
by minimizing the training losses. Otherwise,
trivial solutions will be yielded where all these
weights are zero. Note that W ∗
eg depends on Aeg
(cid:2)mt(D(eg)
since W ∗
; Weg)
eg depends on
io
which is a function of Aeg.
N (eg)
i=1 a(eg)
(cid:3)
io
io
eg(Aeg)
Given an original English sentence x, we
into W ∗
to get a translated
feed it
then fed into
German sentence, che è
W ∗
ge(Age) to get a translated English sentence
X(cid:4). Nel frattempo, we feed x into W ∗
ec(Aec) A
get a translated Chinese sentence, che è
then fed into W ∗
ce(Ace) to get another trans-
lated English sentence x(cid:4)(cid:4). X(cid:4) and x(cid:4)(cid:4) are two
augmented sentences of x. Since translation
models are trained using in-domain translation
examples that have large domain similarity
with classification data, augmented examples
generated by translation models are likely to
be in the same domain as classification data as
BENE; contrastive learning performed on these in-
domain augmented examples is likely to produce
latent representations that are suitable for repre-
senting classification data.
3.3 Stage II: Contrastive Learning
At the second stage, we perform CSSL pretrain-
ing. Given two augmented sentences, if they orig-
inate from the same original sentence, they are
labeled as a positive pair; if they are from differ-
ent sentences, they are labeled as a negative pair.
Let x(cid:4) and x(cid:4)(cid:4) denote two sentences augmented
from the same original sentence x. Let {yi}K
i=1
denote K augmented sentences derived from orig-
inal sentences different from x. Let U denote a
text encoder such as BERT and f (T; U ) denote
the embedding of a sentence t extracted by U .
Let s(R, T; U ) = exp(sim(F (R; U ), F (T; U ))/τ )
where sim(·, ·) denotes cosine similarity and τ
is a temperature parameter. Let A denote {Aeg,
Age, Aec, Ace} and W ∗(UN) denote {W ∗
eg(Aeg),
ec(Aec), W ∗
W ∗
ce(Ace)}. We define the
ge(Age), W ∗
following contrastive loss (Hadsell et al., 2006)
on x:
(cid:2)C(X; U, W ∗(UN)) = − log
S(X(cid:4), X(cid:4)(cid:4); U )
(cid:2)
S(X(cid:4), X(cid:4)(cid:4); U ) +
.
K
i=1 s(X(cid:4), yi; U )
(2)
Note that x(cid:4), X(cid:4)(cid:4), E {yi}K
by W ∗(UN). A
following optimization problem:
Questo
i=1 are generated
IL
solve
stage, we
U ∗(W ∗(UN)) = argminU
(cid:2)
(cid:2)C(X; U, W ∗(UN)).
(3)
X
3.4 Stage III: Finetuning Text Classifier
At the third stage, we finetune a text classification
model where the text encoder is regularized by
the encoder trained at the second stage. Let V
and H denote the text encoder and classification
head in the classification model. Let D(tr)
cls and
D(val)
denote the training and validation sets of a
cls
classification dataset. The third stage amounts to
solving the following problem:
V ∗(U ∗(W ∗(UN))), H ∗ =
Lcls(D(tr)
argmin
V,H
cls ; V, H) + λ(cid:5)V − U ∗(W ∗(UN))(cid:5)2
2,
(4)
where Lcls(·) is classification loss. (cid:5) · (cid:5)2
2 is an
L2 regularizer that encourages the text encoder V
to be close to the encoder U ∗(W ∗(UN)) pretrained
at the second stage. λ is a tradeoff parameter.
3.5 Stage IV: Update Weights of
Translation Data
At the fourth stage, the classification model fine-
tuned at the third stage is evaluated on the vali-
dation set and the weights of machine translation
(MT) examples are updated by minimizing the
validation loss:
minA Lcls(D(val)
cls
; V ∗(U ∗(W ∗(UN))), H ∗).
(5)
3.6 Four-Level Optimization Framework
Putting these pieces together, we have the
following four-level optimization framework.
(Stage IV:)
minA Lcls(D(val)
cls ; V ∗(U ∗(W ∗(UN))), H ∗)
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
1327
s.t. (Stage III:)
V ∗(U ∗(W ∗(UN))), H ∗ =
Lcls(D(tr)
argmin
V,H
cls ; V, H) + λ(cid:5)V − U ∗(W ∗(UN))(cid:5)2
2
(Stage II:)
U ∗(W ∗(UN)) = argminU
(Stage I:)
(cid:2)
X
(cid:2)C(X; U, W ∗(UN))
be developed for the memory/computation cost
reduced framework in Section 3.7. Let ∇2
Y,X
F (X, Y ) denote ∂f (X,Y )
∂X∂Y . Following Liu et al.
(2018), at
the first stage (where the training
data are translation examples), we approximate
W ∗
ce(Ace)
using one-step gradient descent updates of Weg,
Wge, Wec, and Wce:
ec(Aec), and W ∗
ge(Age), W ∗
eg(Aeg), W ∗
UN(eg)
io
(cid:2)mt(D(eg)
io
; Weg)
W ∗
eg(Aeg) ≈ W (cid:4)
UN(ge)
io
(cid:2)mt(D(ge)
io
; Wge)
Weg − ηw∇Weg
eg =
(cid:2)N (eg)
i=1
W ∗
eg(Aeg) = argmin
Weg
W ∗
ge(Age) = argmin
Wge
W ∗
ec(Aec) = argmin
Wec
W ∗
ce(Ace) = argmin
Wce
(cid:2)N (eg)
i=1
(cid:2)N (ge)
i=1
(cid:2)N (ec)
i=1
(cid:2)N (ce)
i=1
UN(ec)
io
(cid:2)mt(D(ec)
io
; Wec)
UN(ce)
io
(cid:2)mt(D(ce)
io
; Wce)
(6)
3.7 Reducing Memory and
Computation Cost
In the proposed formulation in Eq. (6), there are
four translation models and two text encoders.
Storing these models and performing computa-
tion on them will incur a lot of memory and com-
putation costs. In this section, we discuss how to
reduce such costs, via parameter sharing (Sachan
and Neubig, 2018). For the two encoders—one
pretrained during contrastive learning and the
other finetuned during text classification, we can
let them share the same weight parameters. As
come, the third stage becomes:
H ∗(U ∗(W ∗(UN))) =
argmin
H
Lcls(D(tr)
cls ; U ∗(W ∗(UN)), H),
(7)
and the fourth stage becomes:
minA Lcls(D(val)
cls ; U ∗(W ∗(UN)), H ∗(U ∗(W ∗(UN)))).
(8)
W ∗
ge(Age) ≈ W (cid:4)
Wge − ηw∇Wge
ge =
(cid:2)N (ge)
i=1
W ∗
ec(Aec) ≈ W (cid:4)
Wec − ηw∇Wec
ec =
(cid:2)N (ec)
i=1
W ∗
ce(Ace) ≈ W (cid:4)
Wce − ηw∇Wce
ce =
(cid:2)N (ce)
i=1
UN(eg)
io
(cid:2)mt(D(eg)
io
; Weg),
(9)
UN(ge)
io
(cid:2)mt(D(ge)
io
; Wge),
(10)
UN(ec)
io
(cid:2)mt(D(ec)
io
; Wec),
(11)
UN(ce)
io
(cid:2)mt(D(ce)
io
; Wce),
(12)
ce)
ec, W (cid:4)
eg, W (cid:4)
ge, W (cid:4)
At the second stage, we use these approximate
models (including W (cid:4)
A
generate augmented sentences and define con-
trastive losses on these augmented sentences.
Let W (cid:4) denote {W (cid:4)
}. The sum-
mation of all contrastive losses can be writ-
(cid:3)
X (cid:2)C(X; U, W (cid:4)). Then we approximate
ten as
U ∗(W ∗(UN)) using one-step gradient descent up-
date of U with respect to
X (cid:2)C(X; U, W (cid:4)):
ge, W (cid:4)
eg, W (cid:4)
ec, W (cid:4)
(cid:3)
ce
U ∗(W ∗(UN)) ≈ U (cid:4) = U − ηu∇U
(cid:2)
X
(cid:2)C(X; U, W (cid:4)).
(13)
For the four translation models, they involve four
encoders and four decoders, for three languages.
For the same language, we let its encoders and
decoders in the four translation models share
the same parameters. By doing this, the eight
encoders/decoders are reduced to three encoders/
decoders.
At the third stage (where the training data are
classification examples), we plug the approxi-
mation U ∗(W ∗(UN)) ≈ U (cid:4) into Eq. (4) and get
an approximate objective. Then we approximate
V ∗(U ∗(W ∗(UN))) and H ∗ using one-step gradient
descent update of V and H with respect to the
approximated objective:
3.8 Optimization Algorithm
We develop an optimization algorithm to solve
the problem in Eq. (6). A similar algorithm can
V ∗(U ∗(W ∗(UN))) ≈ V (cid:4) =
V − ηv∇V (Lcls(D(tr)
cls ; V, H) + λ(cid:5)V − U (cid:4)(cid:5)2
2),
(14)
1328
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figura 3: Dependency between variables and gradients.
Algorithm 1: Optimization algorithm.
while not converged do
1. Update MT models using Eq. (9) A
Eq. (12)
2. Update text encoder U in contrastive
SSL using Eq. (13)
3. Update text encoder V and
classification head H in the
classification model using Eq. (14) E
Eq. (15)
4. Update machine translation example
weights A using Eq. (16)
end
H ∗ ≈ H (cid:4) = H − ηh∇H Lcls(D(tr)
cls ; V, H). (15)
At the fourth stage (where the validation data are
classification examples), we plug the approxima-
tions V ∗(U ∗(W ∗(UN))) ≈ V (cid:4) and H ∗ ≈ H (cid:4) into
Eq. (5) and get an approximated validation loss,
then update A by performing one-step gradient
descent with respect to the approximated valida-
tion loss.
A ← A − ηa∇ALcls(D(val)
cls
; V (cid:4), H (cid:4)).
(16)
ec, W (cid:4)
ge, W (cid:4)
eg, W (cid:4)
ec, W (cid:4)
ge, W (cid:4)
eg, W (cid:4)
After A is updated, W (cid:4)
ce in
Eqs. (9–12), which are functions of A, need to
be updated as well. Further, U (cid:4) in Eq. (13), Quale
is a function of W (cid:4)
ce, needs to be
updated as well. V (cid:4) in Eq. (14), which is a function
of U (cid:4), needs to be updated. After V (cid:4) is updated,
A in Eq. (16), which is a function of V (cid:4), needs to
be updated again. A further updated A will ren-
der all other variables to be updated again. Questo
update process iterates until convergence. In each
iteration, we update each variable with one-step
gradient descent, then move to updating the next
variable. The iterative algorithm is summarized in
Algorithm 1. Blue arrows in Figure 3 show the
dependency between variables.
Prossimo, we discuss how to calculate the gradient
; V (cid:4), H (cid:4)) in Eq. (16). According to
∇ALcls(D(val)
cls
the chain rule, we have:
∇ALcls(D(val)
cls
∂V (cid:4)
∂U (cid:4)
∂W (cid:4)
∂W (cid:4)
∂A
∂U (cid:4)
; V (cid:4), H (cid:4)) =
∇V (cid:4)Lcls(D(val)
cls
; V (cid:4), H (cid:4)).
(17)
From Eq. (14), we have:
∂V (cid:4)
∂U (cid:4) = −ηvλ∇U (cid:4),V (cid:5)V − U (cid:4)(cid:5)2
2,
(18)
From Eq. (13), we have:
∂U (cid:4)
∂W (cid:4) = −ηu∇W (cid:4),U
(cid:2)
X
(cid:2)C(X; U, W (cid:4)),
(19)
From Eqs. (9–12), we have:
∂W (cid:4)
eg
∂Aeg
= −ηw∇2
Aeg,Weg
N (eg)(cid:2)
i=1
UN(eg)
io
(cid:2)mt(D(eg)
io
; Weg),
(20)
∂W (cid:4)
ge
∂Age
∂W (cid:4)
ec
∂Aec
∂W (cid:4)
ce
∂Ace
= −ηw∇2
Age,Wge
= −ηw∇2
Aec,Wec
= −ηw∇2
Ace,Wce
N (ge)(cid:2)
i=1
N (ec)(cid:2)
i=1
N (ce)(cid:2)
i=1
UN(ge)
io
(cid:2)mt(D(ge)
io
; Wge),
(21)
UN(ec)
io
(cid:2)mt(D(ec)
io
; Wec),
(22)
UN(ce)
io
(cid:2)mt(D(ce)
io
; Wce).
(23)
Green arrows in Figure 3 show the dependency
during gradient calculation. The gradients in
Eqs. (9–15) and Eqs. (18–23) can be automati-
cally calculated using auto differentiation (per esempio.,
autograd in PyTorch). The gradient in Eq. (17)
needs to be manually implemented. When ap-
proximating optimal solutions at Stage I-III and
updating A at Stage IV, we calculate stochastic
1329
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
CoLA RTE
2490
8551
1043
277
3000
1063
QNLI
104743
5463
5463
STS-B MRPC WNLI
3668
5749
408
1500
1725
1379
635
71
146
SST-2 MNLI (m/mm)
67349
872
1821
392702
9815/9832
9796/9847
QQP
363871
40432
390965
AX
–
–
1104
Train
Dev
Test
Tavolo 1: Split statistics of GLUE datasets.
gradients on mini-batches instead of full gradients
on the entire dataset.
4 Experiments
In this section, we evaluate our framework on
eleven English understanding tasks in the GLUE
(Wang et al., 2018B) benchmark and eight text
datasets used in Gururangan et al. (2020)
4.1 Tasks and Datasets
For text classification, we use two collections of
datasets. The first collection is from the Gen-
eral Language Understanding Evaluation (GLUE)
benchmark, which has 11 compiti,
including 2
single-sentence tasks including CoLA (Warstadt
et al., 2019) and SST-2 (Socher et al., 2013), 3
similarity and paraphrase tasks including MRPC
(Dolan and Brockett, 2005), QQP,1 and STS-B
(Cer et al., 2017), E 5 inference tasks including
MNLI (Williams et al., 2018), QNLI (Rajpurkar
et al., 2016), RTE (Dagan et al., 2005), and WNLI
(Levesque et al., 2012). Tavolo 1 shows the split
statistics of GLUE datasets. The second collection
is from Gururangan et al. (2020), including CHEM-
PROT (Kringelum et al., 2016), RCT (Dernoncourt
and Lee, 2017), ACL-ARC (Jurgens et al.,
2018), SCIERC (Luan et al., 2018), HYPERPAR-
TISAN (Kiesel et al., 2019), AGNEWS (Zhang et al.,
2015), HELPFULNESS (McAuley et al., 2015), E
IMDB (Maas et al., 2011). In our method, we
split the original training set into a new training
set and a validation set, with a ratio of 1:1. IL
new training set is used as D(tr)
cls and the valida-
tion set is used as D(val)
.
cls
For machine translation, we use 3K English-
Chinese and 3K English-German language pairs
randomly sampled from WMT17.2 For contras-
tive learning, it is performed on all input texts
(excluding labels) of training datasets in the 11
GLUE tasks.
1https://www.quora.com/q/quoradata/First
-Quora-Dataset-Release-Question-Pairs.
2http://www.statmt.org/wmt17/translation
-task.html.
4.2 Experimental Settings
For translation models, we use those experimented
in Britz et al. (2017), which are encoder-decoder
models with attention. The encoder and decoder
are both 4-layer bi-directional LSTM networks
with a hidden size of 512. The attentions are ad-
ditive, with a dimension of 512. For classifiers,
BERT is used for GLUE and RoBERTa is used for
datasets in Gururangan et al. (2020). The gumbel-
softmax trick (Jang et al., 2017; Maddison
et al., 2017) is leveraged to deal with the non-
differentiability of words.
We use MoCo (He et al., 2020) to implement
the contrastive learning method. Text encoders
are initialized using pretrained BERT (Devlin
et al., 2019) or pretrained RoBERTa (Liu et al.,
2019). In MoCo, the size of the queue (che è
the hyperparameter K in Section 3.3) was set to
96606. The coefficient of MoCo momentum of
updating the key encoder was set to 0.999. IL
temperature parameter (which is the hyperparam-
eter τ in Section 3.3) in the contrastive loss was
set to 0.07. A multi-layer perceptron head was
used. For MoCo training, a stochastic gradient
descent solver with momentum was used. Mini-
batch size was set to 16. Initial learning rate was
set to 4 · 10−5. Learning rate was adjusted using
cosine scheduling. Weight decay was used with a
coefficient of 1 · 10−5.
For classification on GLUE tasks, the classifi-
cation head is set to a linear layer. The maximum
sequence length was set to 128. The tradeoff pa-
rameter λ in Eq. (4) is set to 0.1. Minibatch size
was set to 16. The learning rate was set to 3 · 10−5
for CoLA, MNLI, STS-B; 2·10−5 for RTE, QNLI,
MRPC, SST-2, WNLI; E 1 · 10−5 for QQP. IL
number of training epochs was set to 100.
Hyperparameter Tuning Details For most hy-
perparameters in MoCo and LSTM, we use the
default values given in He et al. (2020) and Britz
et al. (2017). The tradeoff parameter λ is tuned
In {0.01, 0.05, 0.1, 0.5, 1} on the development
set. For each configuration of λ, we run our
method on D(tr)
cls (one half of the training set) E
1330
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
D(val)
(the other half of the training set). Then we
cls
measure the performance of the trained model on
the development set. The λ value yielding the
best performance is selected. We tuned the hy-
perparameters of baselines extensively, dove il
tuning time for each baseline is roughly the same
as that for our method.
4.3 Baselines
We compare our methods with the following
baselines. Let Ours-SPS denote the proposed
(6) which performs soft
framework in Eq.
parameter-sharing (SPS) between V and U via
regularization, and let Ours-HPS denote the
framework in Section 3.7 which performs hard
parameter-sharing (HPS) where V and U are the
same.
• Vanilla RoBERTa (Liu et al., 2019). IL
Transformer-based encoder is initialized with
pretrained RoBERTa. A text classification
model is formed by stacking the pretrained
encoder and a classification head, with an
architecture that is the same as that in Liu
et al. (2019). The classification head is a
feedforward layer, where the nonlinear ac-
tivation function is tanh. Learned encoding
of the special token [CLS] is fed into the
classification head to predict the class label.
Then we finetune the classification model on
a classification dataset.
• Vanilla BERT (Devlin et al., 2019). Questo
approach is similar to vanilla RoBERTa. IL
only difference is that the Transformer-based
encoder is initialized by pretrained BERT
(Devlin et al., 2019) instead of RoBERTa.
• TAPT: Task Adaptive
Pretraining
(Gururangan et al., 2020). In this approach,
given a target dataset Dt,
the pretrained
BERT or RoBERTa on external data is fur-
ther pretrained on the input sentences in Dt
by predicting masked tokens.
• SimCSE (Gao et al., 2021). In this approach,
the same input sentence is fed into a pre-
trained RoBERTa encoder twice by applying
different dropout masks, to get two differ-
ent embeddings. These two embeddings are
labeled as being ‘‘similar’’. Embeddings of
different sentences are labeled as being ‘‘dis-
similar’’. Contrastive learning is performed
on these ‘‘similar’’ and ‘‘dissimilar’’ pairs.
• CSSL-Separate. In this approach, data aug-
mentation, contrastive learning, and text
classification are performed separately. Noi
first train machine translation models and
use them to perform sentence augmentation.
Then on augmented sentences, we perform
contrastive learning. Finalmente, using the text
encoder pretrained by contrastive learning
as initialization, we finetune the classifi-
cation model. When performing contrastive
apprendimento, the text encoder is initialized using
pretrained RoBERTa or BERT.
• CSSL-MTL. This approach is similar to
CSSL-Separate, except that the CSSL task
and classification task are performed jointly
in a multi-task learning (MTL) framework,
by minimizing the weighted sum of their
losses. The weight is 0.01 for CSSL loss and
È 1 for classification loss.
4.4 Results
4.4.1 Results in BERT-Based Experiments
In BERT-based experiments, the text encoder is
initialized using BERT, before contrastive learn-
ing is performed. Tables 2 E 3 show the results
on GLUE test sets, in BERT-based experiments.
Our methods including Ours-SPS and Ours-HPS
outperform all baselines on average scores. Out of
IL 11 compiti (MNLI-m and MNLI-mm are treated
as two separate tasks), Ours-SPS outperforms all
baselines on 8 compiti; Ours-HPS outperforms all
baselines on 7 compiti. These results demonstrate the
effectiveness of our end-to-end frameworks. Via
parameter sharing, Ours-HPS has much smaller
memory and computation costs than Ours-SPS,
with a small sacrifice of classification perfor-
mance. The inference costs of our methods are
similar to those of baselines. During inference,
only V and H are needed, which are the same as
baseline models. Figura 4 shows the accuracy
curve of Ours-HPS on the RTE validation set
(D(val)
) under different runs. As can be seen,
cls
our algorithm converges well.
We present
the following analysis. Primo,
the reason that our methods outperform CSSL-
Separate and CSSL-MTL is that in our meth-
ods, data augmentation and contrastive learning
are performed end-to-end where the training of
translation models (used for data augmentation)
1331
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Train Time Memory Train Data Parameters
(millions)
345
345
690
713
713
713
356
(millions)
1.019
1.019
1.019
1.025
1.025
1.025
1.025
(hours)
6.3
13.5
16.4
16.1
16.9
28.2
17.4
(GB)
11.7
11.8
12.1
12.0
12.2
20.5
12.7
CoLA
(Matthew)
60.5
61.3
59.5
59.4
59.7
63.0
62.4
BERT
TAPT
SimCSE
CSSL-Separate
CSSL-MTL
Ours-SPS
Ours-HPS
SST-2 RTE QNLI MRPC
(Acc./F1)
(Acc.)
85.4/89.3
94.9
85.9/89.5
94.4
85.9/89.8
94.3
85.8/89.6
94.5
86.0/89.6
94.7
86.1/89.9
95.8
86.3/89.9
95.3
(Acc.)
92.7
92.4
92.9
92.8
92.5
93.2
92.5
(Acc.)
70.1
70.3
71.2
71.4
71.2
72.5
72.4
Tavolo 2: Results on GLUE test sets, using BERT for model initialization. The results are obtained
from the GLUE evaluation server. The best results are bolded. The second best results are bolded and
underlined. Models evaluated on AX are trained on the training dataset of MNLI. Matthew denotes
Matthew correlation. Acc. denotes accuracy.
MNLI-m/mm
(Precisione)
QQP
(Accuracy/F1)
STS-B (Pearson/
Spearman)
WNLI
(Precisione)
AX
(Matthew)
Average
BERT
TAPT
SimCSE
CSSL-Separate
CSSL-MTL
Ours-SPS
Ours-HPS
86.7/85.9
85.7/84.4
87.1/86.4
87.3/86.6
87.4/86.8
86.7/86.2
86.8/86.2
89.3/72.1
89.6/71.9
90.5/72.5
90.6/72.7
90.9/72.9
90.0/72.9
89.8/72.8
87.6/86.5
88.1/87.0
87.8/86.9
87.4/86.6
87.3/86.6
88.2/87.3
88.3/87.3
65.1
65.8
65.8
65.5
65.4
66.9
66.1
39.6
39.3
39.6
39.6
39.6
40.3
40.2
80.5
80.6
80.8
80.8
80.8
81.7
81.4
Tavolo 3: Continuation of Table 2. Pearson, Spearman, and Matthew denote the corresponding
correlations.
unique properties of classification data. In cont-
trast, in Ours-HPS, the classification model and
the CSSL-pretrained text encoder are required to
be exactly the same, which might be too restric-
tive. D'altra parte, it is worth noting that
the classification performance gap between Ours-
HPS and CSSL-pretrained is not very large while
the memory and computation cost of Ours-HPS
are much smaller.
Third, overall, CSSL-MTL works better than
CSSL-Separate. Out of the 11 compiti, CSSL-MTL
outperforms CSSL-Separate on 6 tasks and is on
par with CSSL-Separate on 1 task. The reason
is that in CSSL-MTL, contrastive learning and
classification are performed jointly, which en-
ables these two tasks to mutually benefit from
each other. In contrasto, in CSSL-Separate, con-
trastive learning and classification are performed
separately. While contrastive learning influences
classificazione, classification does not provide any
feedback to contrastive learning. Fourth, CSSL-
Separate and SimCSE are in general on par with
each other. The only difference between these
two methods is that CSSL-Separate uses back-
translation for data augmentation while SimCSE
uses dropout masks. This shows that these two
Figura 4: Accuracy on RTE validation set (D(val)
cls ).
is guided by the contrastive learning performance
and the augmented sentences are encouraged to
be suitable for performing the contrastive learning
task. In contrasto, in CSSL-Separate and CSSL-
MTL, data augmentation and contrastive learn-
ing are performed separately. Consequently, IL
augmented data may not be optimal for perform-
ing contrastive learning. Secondo, the reason that
Ours-SPS outperforms Ours-HPS is that in Ours-
SPS, while the classification model is regular-
ized by the CSSL-pretrained text encoder, Essi
are not exactly the same. This gives the classi-
fication model some flexibility in capturing the
1332
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
CoLA
(Matthew Corr.)
SST-2
(Precisione)
RTE
(Precisione)
QNLI
(Precisione)
MRPC
(Accuracy/F1)
BERT (from Lan et al.)
BERT (our run)
TAPT
SimCSE
CSSL-Separate
CSSL-MTL
No-CL
No-FT
No-BT
Domain-Reweight
Fix-Weight-Separate
Fix-Weight-MTL
DANN
CAN
Ours-Transformer-MT
Ours-SPS
Ours-HPS
60.6
62.1
61.2
62.0
62.3
62.7
62.3
62.4
62.8
62.7
62.7
62.9
62.4
62.3
63.2
63.6
63.3
93.2
93.1
93.1
93.5
93.7
93.7
93.2
93.6
93.7
93.8
93.8
93.8
93.7
93.6
93.9
94.0
93.9
70.4
74.0
74.0
73.5
72.0
72.6
74.0
72.3
74.0
72.5
72.8
73.2
72.2
72.3
74.2
74.8
74.4
92.3
92.1
92.0
92.2
92.5
92.3
92.2
92.4
92.5
92.6
92.6
92.4
92.4
92.4
92.6
92.9
92.7
88.0/–
86.8/90.8
85.3/89.8
86.8/90.9
87.0/90.9
87.1/90.9
86.9/90.8
87.0/90.9
87.1/90.9
87.1/90.9
87.1/90.9
87.2/90.9
87.1/90.9
87.2/90.9
87.2/90.9
87.8/91.1
87.3/90.9
Tavolo 4: Results on GLUE development sets, using BERT for model initialization. The result is the
median of five runs. Due to randomness of model initialization, our median performance is not the
same as that reported in Lan et al. (2019).
augmentation methods work equally well for con-
trastive learning on texts. Fifth, CSSL-Separate
and SimCSE outperform TAPT. In TAPT, IL
self-supervised task is masked token prediction.
This shows that contrastive learning is a more
effective self-supervised learning method than
masked token prediction. Sixth, CSSL-Separate
and SimCSE perform better than BERT, Quale
further demonstrates the effectiveness of con-
trastive learning. Seventh,
the improvement
achieved by our methods is more prominent on
smaller datasets such as WNLI and RTE. This is
because, for smaller datasets, it is more necessary
to leverage unlabeled data via contrastive learn-
ing to learn overfitting-resilient representations.
Eighth, for tasks that have similar data size, IL
improvement of our method is more prominent on
similarity tasks than on other types of tasks. For
esempio, QQP and MNLI have similar data size;
our method achieves better improvement on QQP,
which is a similarity task, than on MNLI, Quale
is an inference task.
Tables 4 E 5 show the results of BERT-based
experiments on the development sets of GLUE.
Our methods outperform all baselines on differ-
ent tasks. The analysis of reasons is similar to that
for Tables 2 E 3. Note that our method can
be applied to other text encoders besides BERT.
Per esempio, our method can be applied to
Turing-NLR-v5 (Bajaj et al., 2022). This encoder
achieves
state-of-the-art performance on the
GLUE leaderboard, with an average score of 91.2.
4.4.2 Results in RoBERTa-Based
Experiments
In RoBERTa-based experiments, text encoders
are initialized using RoBERTa, before contrastive
learning is performed. These experiments are con-
ducted on datasets used in Gururangan et al.
(2020). Tavolo 6 shows the results. Our methods
outperform all baselines. The analysis of reasons
is similar to that for results in Tables 2 E 3.
4.4.3 Ablation Studies
To verify whether the individual components in
our method are necessary, we perform the fol-
lowing ablation studies.
• No contrastive learning (No-CL). Contras-
tive learning is removed from the framework.
For each text-label pair (X, sì) ∈ D(tr)
cls , IL
input text x is fed into the four translation
models to generate two augmentations x(cid:4)
and x(cid:4)(cid:4). Then (X(cid:4), sì) E (X(cid:4)(cid:4), sì) are utilized
as augmented data to train the classification
model directly.
1333
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
MNLI-m/mm
(Precisione)
QQP
(Accuracy/F1)
STS-B (Pearson Corr./
Spearman Corr.)
WNLI
(Precisione)
BERT (from Lan et al.)
BERT (our run)
TAPT
SimCSE
CSSL-Separate
CSSL-MTL
No-CL
No-FT
No-BT
Domain-Reweight
Fix-Weight-Separate
Fix-Weight-MTL
DANN
CAN
Ours-Transformer-MT
Ours-SPS
Ours-HPS
86.6/–
86.2/86.0
85.6/85.5
86.4/86.1
86.6/86.4
86.6/86.5
86.3/86.1
86.6/86.5
86.7/86.6
86.6/86.5
86.7/86.6
86.7/86.6
86.6/86.4
86.6/86.5
86.7/86.8
86.9/87.0
86.9/86.9
91.3/–
91.3/88.3
91.5/88.7
91.2/88.1
91.4/88.5
91.5/88.6
91.3/88.4
91.4/88.4
91.5/88.7
91.5/88.6
91.5/88.7
91.6/88.8
91.5/88.5
91.6/88.5
91.6/88.9
91.9/89.2
91.7/89.1
90.0/–
90.4/90.0
90.6/90.2
90.5/90.1
90.2/89.9
90.4/90.1
90.5/90.2
90.4/90.0
90.7/90.3
90.4/90.0
90.4/90.0
90.5/90.2
90.3/90.1
90.2/90.1
90.7/90.3
91.0/90.8
90.8/90.5
–
56.3
53.5
56.3
56.3
56.3
56.3
56.3
56.3
56.3
56.3
56.3
56.3
56.3
56.3
56.3
56.3
Tavolo 5: Continuation of Table 4.
Dataset
RoBERTa TAPT SimCSE CSSL-Separate CSSL-MTL Ours-SPS Ours-HPS
CHEMPROT
ACL-ARC
SCIERC
HYPERPARTISAN
AGNEWS
HELPFULNESS
IMDB
81.91.0
87.20.1
63.05.8
77.31.9
86.6 0.9
93.90.2
65.13.4
95.00.2
82.60.4
87.70.1
67.41.8
79.31.5
90.45.2
94.50.1
68.51.9
95.50.1
83.20.2
87.60.1
69.52.6
80.50.7
90.92.7
94.70.1
68.71.7
95.70.1
82.90.5
87.60.1
68.73.1
80.91.3
91.43.3
94.30.1
69.20.5
95.60.1
83.40.3
87.70.1
70.83.7
81.20.9
90.71.9
94.50.1
69.51.1
95.30.1
84.50.4
88.00.1
75.62.9
82.11.1
92.32.1
95.10.1
70.90.2
96.00.1
84.20.3
87.90.1
75.31.5
81.90.6
91.81.6
94.90.1
70.40.4
95.80.1
Tavolo 6: Results of RoBERTa-based experiments on datasets used in Gururangan et al. (2020). IL
results of vanilla RoBERTa and TAPT are taken from Gururangan et al. (2020). Each method runs
four times with random initialization. In the xy formatted results, x denotes average and y denotes
standard deviation. Following Gururangan et al. (2020), the results on CHEMPROT and RCT are micro-
F1; the results on other datasets are macro-F1. The best results are bolded. The second best results
are bolded and underlined.
• No finetuning (No-FT). At the third stage,
the encoder V in the classification model is
set to U ∗(W ∗(UN)) without being finetuned.
• Replacing back-translation with SimCSE
(No-BT). Instead of learning machine trans-
lation models for text augmentation, noi usiamo
the augmentation method proposed in Sim-
CSE (Gao et al., 2021) for augmentation.
• Domain-Reweight. In CSSL-Separate, we
reweight self-supervised training examples
based on their domain relatedness to classifi-
cation data, and perform contrastive learning
on reweighted examples. Domain related-
ness is calculated using an H-divergence
based metric (Elsahar and Gall´e, 2019).
• Fix-Weight-Separate and Fix-Weight-MTL:
Learn translation sample weights using our
method, fix them, then run CSSL-Separate
and CSSL-MTL on reweighted translation
examples.
1334
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
• Compare with other domain adaptation meth-
ods, including DANN (Ganin et al., 2016) E
CAN (Kang et al., 2019). We use these meth-
ods to align the domains of input sentences
in the translation and classification datasets.
• Ours-Transformer-MT: In Ours-HPS, us-
ing Transformer
(specifically, pretrained
BERT) for machine translation, instead of
using attentional LSTM.
Tables 4 E 5 show results on GLUE develop-
ment sets, using BERT for model initialization.
We make the following observations. Primo, our
full methods work better than No-CL, No-FT,
and No-BT. In these three ablation baselines, one
component (which is contrastive learning, fine-
tuning classifier, back translation, rispettivamente) È
removed from our full methods, yielding simpler
metodi. These results show that each component
is useful and should not be removed, and simpler
methods do not perform comparably well. Secondo,
our methods work better than Domain-Reweight.
The reason is that our methods perform reweight-
ing of translation data together with performing
other tasks (including training translation models,
contrastive learning, finetuning, and validation),
in an end-to-end manner. In this way, pesi
of translation data are influenced by other tasks
and learned towards maximizing the classifica-
tion performance. In contrasto, Domain-Reweight
performs reweighting of self-supervised training
examples separately from other tasks. Weights
calculated in this way are not guaranteed to be op-
timal for maximizing classification performance.
D'altra parte, Domain-Reweight outperforms
CSSL-Separate, which shows that it is benefi-
cial to reweight self-supervised training examples
based on their domain relatedness to classifica-
tion data. Third, our methods work better than
Fix-Weight-Separate and Fix-Weight-MTL. IL
reason is that our methods generate augmented
sentences and perform CL end-to-end while
Fix-Weight-Separate and Fix-Weight-MTL per-
form these two tasks separately. In our end-to-end
framework, guided by the performance of CL,
the training of translation models is dynamically
changing to generate augmented sentences that
are better for improving CL. On the contrary, In
Fix-Weight-Separate and Fix-Weight-MTL, IL
generation of augmented examples is not influ-
enced by the CL task. Consequently, the gener-
ated augmentations may not be optimal for CL.
Method
E→G G→E
E→C
C→E
MTL
Ours
14.8
16.1
15.3
16.7
14.2
15.5
14.6
16.0
Tavolo 7: BLEU scores on test sets. E, G, C denote
English, German, Chinese, rispettivamente. E → G
denotes English-to-German translation.
D'altra parte, Fix-Weight-Separate and
Fix-Weight-MTL outperform CSSL-Separate and
CSSL-MTL, which further demonstrates the ben-
efits of reweighting translation examples based
on their domain similarity to classification data
and the weights learned by our methods can ac-
curately reflect domain similarity. Fourth, our
methods work better than the two domain adap-
tation methods DANN and CAN. The reason is
because many translation examples have large
domain discrepancies with classification texts; Esso
is difficult to adapt these translation examples
into the domain of classification data. Our meth-
ods learn to remove such examples instead of
forcefully adapting them. Fifth, comparing Ours-
Transformer-MT (using BERT for translation)
and Ours-HPS (using LSTM), we can see that
BERT works slightly worse than LSTM. While
BERT is more expressive, it has more weight pa-
rameters to learn than LSTM, which incurs higher
risk of overfitting.
We also check whether our framework can im-
prove machine translation (MT). Since the end
goal of our work is improving text classification,
we evaluate MT performance on selected transla-
tion examples that have large domain similarity
to classification data. Domain similarity is cal-
culated using H-divergence (Elsahar and Gall´e,
2019). Translation examples whose normalized
H-divergence is smaller than 0.5 are selected.
MT models are trained on selected training ex-
amples and evaluated on selected test examples.
Tavolo 7 compares the BLUE (Papineni et al.,
2002) scores (on test sets) of the four MT models
trained in our framework and those trained via
MTL (minimizing the sum of training losses of
the four models) without using our framework. As
can be seen, the models trained in our framework
perform better. The reason is that our framework
trains the translation models to generate linguisti-
cally meaningful augmented texts; by doing this,
the translation models are encouraged to generate
translations with higher linguistic quality.
1335
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Sentence
Weight
The government recently called for more balanced
development, even proposing a ‘‘green index’’ to
measure growth.
President-elect Donald Trump’s campaign narrative
was based on the assumption that the US has fallen
from its former greatness.
Russia considers the agreements from the 1990s un-
just, based as they were on its weakness at the time,
and it wants to revise them.
Publicity for the new film claims that it is ‘‘the first
live-action film in the history of movies to star, E
be told from the point of view of, a sentient animal.’’
Gore told the world in his Academy Award-winning
movie (recently labeled ‘‘one-sided’’ and contain-
ing ‘‘scientific errors’’ by a British judge) to expect
20-foot sea-level rises over this century.
Jia’s movie is episodic; four loosely linked stories
about lone acts of extreme violence, mostly culled
from contemporary newspaper stories.
0
0
0
1
1
1
Tavolo 8: Some randomly sampled translation
examples with importance weights close to 0
E 1.
whose weights are close to one are more relevant
to movie reviews.
5 Conclusions, Discussion, E
Future Work
in questo documento, we propose an end-to-end frame-
work for learning language representations based
on contrastive learning. Different from existing
contrastive learning methods that perform data
augmentation and contrastive learning separately
and thus cannot guarantee that the augmented data
is optimal for contrastive learning, our method per-
forms data augmentation and contrastive learning
end-to-end in a unified framework so that data
augmentation models are specifically trained for
being suitable for contrastive learning. Our frame-
work consists of four learning stages: 1) training
machine translation models for text augmentation;
2) contrastive learning; 3) training a classification
modello; 4) updating weights of translation data by
minimizing the validation loss of the classification
modello. We evaluate our framework on 11 English
understanding tasks in the GLUE benchmark and
8 datasets in Gururangan et al. (2020). On both test
set and development set, the experimental results
demonstrate the effectiveness of our method.
One major limitation of our method is that it
has larger computational and memory costs, due
to the extra overhead of solving a four-level opti-
mization based problem and storing MT models.
Figura 5: How the accuracy of the RTE development
set changes with λ.
4.4.4 Parameter Sensitivity
Figura 5 shows how the classification accuracy
on the development set of RTE changes with
the tradeoff parameter λ of Ours-SPS. As can
be seen, when λ increases from 0 A 0.1, IL
accuracy increases. This is because a larger λ
encourages more knowledge transfer from the
CSSL-pretrained encoder to the classification
modello. The representations learned in the con-
trastive SSL task help the classification model to
Imparare. Tuttavia, as we continue to increase λ, IL
accuracy decreases. This is because the classifi-
cation model is too much biased to the CSSL-
pretrained encoder and is less tailored to the
classification data.
4.4.5 Qualitative Results
Tavolo 8 shows some randomly sampled translation
examples where the learned importance weights
(ai in Eq. (6)) are close to 0 O 1, when the classi-
fication task is SST-2 (the percentage of data gets
near-zero weights is 35.2%). Due to space lim-
itations, we only show the English sentences in
translation pairs. As can be seen, translation sen-
tences with near-zero weights have a large domain
discrepancy with the SST-2 data. SST-2 mainly
contains movie reviews while these zero-weight
sentences are mainly about politics. Due to this
domain discrepancy, these translation data is not
suitable to train data augmentation models for
SST-2. Our framework can effectively identify
such out-of-domain translation data and exclude
them from the training process. This is another
reason that our end-to-end framework achieves
better performance than baselines which lack the
mechanism of removing out-of-domain transla-
tion data. D'altra parte, in Table 8, sentences
1336
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
To reduce these costs, in addition to tying pa-
rameters, we will explore other techniques in
future, such as reducing the update frequencies
of MT models and MT data weights, applying
diversity-promoting regularizers (Xie et al., 2017)
to speed up convergence, performing core-set
based mini-batch selection (Sinha et al., 2020) A
speed up convergence, and so forth.
For future work, we plan to study more chal-
lenging loss functions for self-supervised learning.
We are interested in investigating a ranking-based
loss, where each sentence is augmented with a
ranked list of sentences that have decreasing dis-
crepancy with the original sentence. The auxiliary
task is to predict the order given the augmented
sentences. Predicting an order is presumably more
challenging than binary classification (as adopted
in existing contrastive SSL methods) and may
facilitate the learning of better representations.
Riferimenti
Payal Bajaj, Chenyan Xiong, Guolin Ke,
Xiaodong Liu, Di He, Saurabh Tiwary, Tie-Yan
Liu, Paul Bennett, Xia Song, and Jianfeng Gao.
2022. Metro: Efficient denoising pretraining
of large scale autoencoding language models
with model generated signals. arXiv preprint
arXiv:2204.06644.
Atılım G¨unes¸ Baydin, Robert Cornish, David
Mart´ınez Rubio, Mark Schmidt, and Frank
Legna. 2018. Online learning rate adaptation
with hypergradient descent. In Sixth Interna-
tional Conference on Learning Representations
(ICLR), Vancouver, Canada, April 30 – May 3,
2018.
Yoshua Bengio, R´ejean Ducharme, and Pascal
Vincent. 2000. A neural probabilistic language
modello. Advances in Neural Information Pro-
cessing Systems, 13.
Denny Britz, Anna Goldie, Minh-Thang Luong,
and Quoc Le. 2017. Massive exploration of
neural machine translation architectures. In
Atti del 2017 Conferenza sull'Em-
pirical Methods in Natural Language Process-
ing, pages 1442–1451, Copenhagen, Denmark.
Associazione per la Linguistica Computazionale.
Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo
and Lucia Specia. 2017.
Lopez-Gazpio,
SemEval-2017 task 1: Semantic textual sim-
ilarity multilingual and crosslingual focused
evaluation. Negli Atti di
the 11th In-
ternational Workshop on Semantic Evalua-
zione (SemEval-2017), pages 1–14, Vancouver,
Canada. Association for Computational Lin-
guistics.
Ting Chen, Simon Kornblith, Mohammad
Norouzi, and Geoffrey Hinton. 2020. A simple
framework for contrastive learning of visual
representations. In Proceedings of the 37th In-
ternational Conference on Machine Learning,
volume 119 of Proceedings of Machine Learn-
ing Research, pages 1597–1607. PMLR.
Ido Dagan, Oren Glickman, and Bernardo
Magnini. 2005. The Pascal recognising text-
ual entailment challenge. In Machine Learn-
ing Challenges Workshop, pages 177–190.
Springer.
Stephan Dempe. 2002. Foundations of Bilevel
Programming. Springer Science & Business
Media.
Franck Dernoncourt and Ji Young Lee. 2017.
PubMed 200k RCT: A dataset for sequential
sentence classification in medical abstracts. In
Proceedings of the Eighth International Joint
Conferenza sull'elaborazione del linguaggio naturale
(Volume 2: Short Papers), pages 308–313,
Taipei, Taiwan. Asian Federation of Natural
Language Processing.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, E
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In NAACL.
William B. Dolan and Chris Brockett. 2005.
Automatically constructing a corpus of sen-
IL
tential paraphrases.
Third International Workshop on Paraphrasing
(IWP2005).
Negli Atti di
Hady Elsahar and Matthias Gall´e. 2019. To an-
notate or not? Predicting performance drop
under domain shift. Negli Atti di
IL
2019 Conference on Empirical Methods in
Natural Language Processing and the 9th
Internazionale
Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 2163–2173. https://doi.org/10
.18653/v1/D19-1222
1337
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Matthias Feurer, Jost Springenberg, and Frank
Hutter. 2015. Initializing Bayesian hyperpa-
rameter optimization via meta-learning.
In
Proceedings of the AAAI Conference on Arti-
ficial Intelligence, volume 29. https://doi
.org/10.1609/aaai.v29i1.9354
Chelsea Finn, Pieter Abbeel, and Sergey Levine.
2017. Model-agnostic meta-learning for fast
adaptation of deep networks. Negli Atti
of the 34th International Conference on Ma-
chine Learning-Volume 70, pages 1126–1135.
JMLR. org.
Yaroslav Ganin, Evgeniya Ustinova, Hana
Ajakan, Pascal Germain, Hugo Larochelle,
Franc¸ois Laviolette, Mario Marchand, E
Victor Lempitsky. 2016. Domain-adversarial
training of neural networks. The Journal of
Machine Learning Research, 17(1):2096–2030.
Tianyu Gao, Xingcheng Yao, and Danqi Chen.
2021. SimCSE: Simple contrastive learning of
sentence embeddings. Negli Atti del
2021 Conference on Empirical Methods in Nat-
elaborazione del linguaggio urale, pages 6894–6910,
Online and Punta Cana, Dominican Republic.
Associazione per la Linguistica Computazionale.
Suchin Gururangan, Ana Marasovic, Swabha
Swayamdipta, Kyle Lo,
Iz Beltagy, Doug
Downey, and Noah A. Smith. 2020. Don’t
language models to
stop pretraining: Adapt
domains and tasks. In Proceedings of ACL.
https://doi.org/10.18653/v1/2020
.acl-main.740
Raia Hadsell, Sumit Chopra, and Yann LeCun.
2006. Dimensionality reduction by learning
an invariant mapping. In 2006 IEEE Com-
puter Society Conference on Computer Vision
and Pattern Recognition (CVPR’06), volume 2,
pages 1735–1742. IEEE.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie,
and Ross Girshick. 2020. Momentum contrast
for unsupervised visual representation learning.
In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition,
pages 9729–9738.
Olivier Henaff. 2020. Data-efficient image recog-
nition with contrastive predictive coding. In
International Conference on Machine Learn-
ing, pages 4182–4192. PMLR.
Eric Jang, Shixiang Gu, and Ben Poole. 2017.
Categorical reparameterization with gumbel-
softmax. ICLR.
David Jurgens, Srijan Kumar, Raine Hoover,
Daniel A. McFarland, and Dan Jurafsky. 2018.
Measuring the evolution of a scientific field
through citation frames. TACL.
Guoliang Kang, Lu Jiang, Yi Yang, and Alexander
G. Hauptmann. 2019. Contrastive adaptation
network for unsupervised domain adaptation.
In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition,
pages 4893–4902. https://doi.org/10
.1109/CVPR.2019.00503
Prannay Khosla, Piotr Teterwak, Chen Wang,
Aaron Sarna, Yonglong Tian, Phillip Isola,
Aaron Maschinot, Ce Liu, and Dilip Krishnan.
2020. Supervised contrastive learning. In Ad-
vances in Neural Information Processing Sys-
tems, volume 33, pages 18661–18673. Curran
Associates, Inc.
Johannes Kiesel, Maria Mestre, Rishabh Shukla,
Emmanuel Vincent, Payam Adineh, David
Corney, Benno Stein, and Martin Potthast.
2019. SemEval-2019 Task 4: Hyperpartisan
news detection. In SemEval. https://doi
.org/10.18653/v1/S19-2145
Tassilo Klein and Moin Nabi. 2020. Contrastive
self-supervised learning for commonsense rea-
the 58th Annual
soning. Negli Atti di
Riunione dell'Associazione per il Computazionale
Linguistica, pages 7517–7523.
Jens Kringelum, Sonny Kim Kjærulff, Søren
Brunak, Ole Lund, Tudor
IO. Oprea, E
Olivier Taboureau. 2016. ChemProt-3.0: UN
global chemical biology diseases mapping. In
Database. https://doi.org/10.1093
/database/bav123, PubMed: 26876982
Zhenzhong Lan, Mingda Chen, Sebastian
Goodman, Kevin Gimpel, Piyush Sharma, E
Radu Soricut. 2019. AlBERT: A lite bert for
self-supervised learning of language represen-
tations. In International Conference on Learn-
ing Representations.
Michael Laskin, Aravind Srinivas, and Pieter
Abbeel. 2020. CURL: Contrastive unsupervised
representations for reinforcement learning. In
Proceedings of the 37th International Confer-
ence on Machine Learning, volume 119 Di
1338
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Proceedings of Machine Learning Research,
pages 5639–5650. PMLR.
Hector Levesque, Ernest Davis, and Leora
Morgenstern. 2012. The Winograd schema
challenge. In Thirteenth International Confer-
ence on the Principles of Knowledge Represen-
tation and Reasoning.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, e Luke Zettlemoyer.
2020. BART: Denoising sequence-to-sequence
pre-training for natural language generation,
translation, and comprehension. In Procedi-
ings di
IL
Associazione per la Linguistica Computazionale,
pages 7871–7880. https://doi.org/10
.18653/v1/2020.acl-main.703
the 58th Annual Meeting of
Hanxiao Liu, Karen Simonyan, and Yiming
Yang. 2018. Darts: Differentiable architecture
search. In International Conference on Learn-
ing Representations.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly op-
timized BERT pretraining approach. arXiv
preprint arXiv:1907.11692.
Yi Luan, Luheng He, Mari Ostendorf, E
Hannaneh Hajishirzi. 2018. Multi-task identi-
fication of entities, relations, and coreference
for scientific knowledge graph construction.
Negli Atti di
IL 2018 Conference on
Empirical Methods in Natural Language Pro-
cessazione, pages 3219–3232, Brussels, Belgium.
Associazione per la Linguistica Computazionale.
https://doi.org/10.18653/v1/D18
-1360
Andrew L. Maas, Raymond E. Daly, Peter
T. Pham, Dan Huang, Andrea Y. Di, E
Christopher Potts. 2011. Learning word vectors
for sentiment analysis. In ACL.
C. Maddison, UN. Mnih, and Y. Teh. 2017. IL
concrete distribution: A continuous relaxation
of discrete random variables. Negli Atti
of the International Conference on Learning
Representations. International Conference on
Learning Representations.
based recommendations on styles and substi-
tutes. In ACM SIGIR. https://doi.org
/10.1145/2766462.2767755
Tomás Mikolov, Ilya Sutskever, Kai Chen, Greg
S. Corrado, and Jeff Dean. 2013. Distributed
representations of words and phrases and their
compositionality. Advances in neural informa-
tion processing systems, 26.
Lin Pan, Chung-Wei Hang, Avirup Sil, E
Saloni Potdar. 2022. Improved text classifica-
tion via contrastive adversarial training. AAAI.
Xiao Pan, Mingxuan Wang, Liwei Wu, and Lei Li.
2021. Contrastive learning for many-to-many
multilingual neural machine translation. Nel professionista-
ceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and
the 11th International Joint Conference on Nat-
elaborazione del linguaggio urale (Volume 1: Long Pa-
pers), pages 244–258. https://doi.org
/10.18653/v1/2021.acl-long.21
Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. BLEU: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics,
pages 311–318. Associazione per il calcolo
Linguistica. https://doi.org/10.3115
/1073083.1073135
Alec Radford, Karthik Narasimhan, Tim
Salimans, and Ilya Sutskever. 2018. Improv-
ing language understanding by generative pre-
training. Technical report, OpenAI.
Pranav Rajpurkar,
Jian Zhang, Konstantin
Lopyrev, and Percy Liang. 2016. Squad:
100,000+ questions
for machine compre-
hension of text. Negli Atti del 2016
Conference on Empirical Methods in Natural
Language Processing,
2383–2392.
https://doi.org/10.18653/v1/D16
-1264
pagine
Zhongzheng Ren, Raymond Yeh, and Alexander
Schwing. 2020. Not all unlabeled data are
equal: Learning to weight data in semi-
supervised learning. In Advances in Neural
Information Processing Systems, volume 33,
pages 21786–21797. Curran Associates, Inc.
Julian McAuley, Christopher Targett, Qinfeng
Shi, and Anton Van Den Hengel. 2015. Image-
Devendra Sachan and Graham Neubig. 2018.
Parameter sharing methods for multilingual
1339
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
self-attentional translation models. In Confer-
ence on Machine Translation.
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Improving neural machine trans-
lation models with monolingual data. Nel professionista-
ceedings of the 54th Annual Meeting of the
Associazione per la Linguistica Computazionale
(Volume 1: Documenti lunghi), pages 86–96.
https://doi.org/10.18653/v1/P16
-1009
Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping
Zhou, Zongben Xu, and Deyu Meng. 2019.
Meta-weight-net: Learning an explicit map-
In Advances
ping for
in Neural Information Processing Systems,
pages 1919–1930.
sample weighting.
Samarth Sinha, Han Zhang, Anirudh Goyal,
Yoshua Bengio, Hugo Larochelle, and Augustus
Odena. 2020. Small-gan: Speeding up gan train-
ing using core-sets. In International Conference
on Machine Learning, pages 9005–9015.
PMLR.
Riccardo Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D. Equipaggio, Andrea Y.
Di, and Christopher Potts. 2013. Recursive
deep models for semantic compositionality over
a sentiment treebank. Negli Atti del
2013 Conference on Empirical Methods in Nat-
elaborazione del linguaggio urale, pages 1631–1642.
Peng Su, Yifan Peng, and K. Vijay-Shanker.
2021. Improving bert model using contrastive
learning for biomedical relation extraction.
the 20th Workshop on
Negli Atti di
Biomedical Language Processing, pages 1–10.
Felipe Petroski Such, Aditya Rawal, Joel Lehman,
Kenneth Stanley, and Jeffrey Clune. 2020. Gen-
erative teaching networks: Accelerating neural
architecture search by learning to generate
synthetic training data. In International Confer-
ence on Machine Learning, pages 9206–9216.
PMLR.
Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R.
Bowman. 2018UN. GLUE: A multi-task bench-
mark and analysis platform for natural lan-
guage understanding. arXiv preprint arXiv:
1804.07461. https://doi.org/10.18653
/v1/W18-5446
Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R.
Bowman. 2018B. GLUE: A multi-task bench-
mark and analysis platform for natural language
In International Conference
understanding.
sulle rappresentazioni dell'apprendimento. https://doi
.org/10.18653/v1/W18-5446
Dong Wang, Ning Ding, Piji Li, and Haitao
Zheng. 2021. Cline: Contrastive learning with
semantic negative examples for natural lan-
guage understanding. Negli Atti del
59th Annual Meeting of
the Association
for Computational Linguistics and the 11th
International Joint Conference on Natural Lan-
guage Processing (Volume 1: Documenti lunghi),
pages 2332–2342. https://doi.org/10
.18653/v1/2021.acl-long.181
Yulin Wang, Jiayi Guo, Shiji Song, and Gao
Huang. 2020. Meta-semi: A meta-learning ap-
proach for semi-supervised learning. CoRR,
abs/2007.02394.
Alex Warstadt, Amanpreet Singh, and Samuel
R. Bowman. 2019. Neural network acceptabil-
ity judgments. Transactions of the Associa-
tion for Computational Linguistics, 7:625–641.
https://doi.org/10.1162/tacl a 00290
Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through in-
ference. Negli Atti del 2018 Confer-
ence of the North American Chapter of the
Associazione per la Linguistica Computazionale: Eh-
uomo Tecnologie del linguaggio, Volume 1 (Lungo
Carte), pages 1112–1122. https://doi
.org/10.18653/v1/N18-1101
Pengtao Xie, Aarti Singh, and Eric P. Xing.
2017. Uncorrelation and evenness: A new
In Interna-
diversity-promoting regularizer.
tional Conference on Machine Learning,
pages 3811–3820. PMLR.
Xiang Zhang, Junbo Jake Zhao, and Yann LeCun.
2015. Character-level convolutional networks
for text classification. In NeurIPS.
Guoqing Zheng, Ahmed Hassan Awadallah, E
Susan Dumais. 2021. Meta label correction
for learning with weak supervision. AAAI.
https://doi.org/10.1609/aaai.v35i12
.17319
1340
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
5
2
1
2
0
6
2
1
6
1
/
/
T
l
UN
C
_
UN
_
0
0
5
2
1
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3