Uncertainty Estimation and Reduction of Pre-trained Models for
Text Regression
Yuxia Wang(cid:2)
Daniel Beck(cid:2)
Timothy Baldwin(cid:2)
Karin Verspoor†(cid:2)
(cid:2) The University of Melbourne, Melbourne, Victoria, Australia
†RMIT University, Melbourne, Victoria, Australia
yuxiaw@student.unimelb.edu.au
d.beck@unimelb.edu.au
tb@ldwin.net
karin.verspoor@rmit.edu.au
Abstrait
State-of-the-art classification and regression
models are often not well calibrated, et
cannot reliably provide uncertainty estimates,
limiting their utility in safety-critical applica-
tions such as clinical decision-making. While
recent work has focused on calibration of
classifiers, there is almost no work in NLP
on calibration in a regression setting. Dans ce
papier, we quantify the calibration of pre-
trained language models for text regression,
both intrinsically and extrinsically. We fur-
ther apply uncertainty estimates to augment
training data in low-resource domains. Notre
experiments on three regression tasks in both
self-training and active-learning settings show
that uncertainty estimation can be used to in-
crease overall performance and enhance model
generalization.
1
Introduction
Modern neural network models, particularly those
based on pre-training and fine-tuning, have
achieved impressive results across a broad spec-
trum of NLP tasks, in terms of evaluation metrics
such as classification accuracy or F-score for
classification tasks and mean squared error for
regression tasks. Cependant,
the standard train-
ing regime fails to take model uncertainty into
account, and tends to result in over-fitting and
poor generalization, especially in limited training
data situations.
En outre, these models have been empiri-
cally demonstrated to have poor calibration—the
predictive probability does not reflect the true cor-
rectness likelihood, and they are generally over-
confident when they make wrong predictions (Guo
et coll., 2017; Desai and Durrett, 2020; Jiang et al.,
2020). Put differently, the models do not know
what they don’t know. This is particularly the
680
case in low-resource settings. Cependant, faithfully
assessing the uncertainty of model predictions is
as important as obtaining high accuracy in many
safety-critical applications, such as autonomous
driving or clinical decision support (Chen et al.,
2020; Kendall and Gal, 2017; Davis et al., 2017).
If models were able to more faithfully capture
their lack of certainty when they make erroneous
prédictions, they could be used more reliably in
critical decision-making contexts, and avoid ca-
tastrophic errors.
In the context of text regression, we aim to al-
leviate over-fitting and improve generalizability
in low-resource settings by taking the uncer-
tainty sourced from both the data and model into
account. Spécifiquement, we address: (1) data uncer-
tainty by filtering noisy annotations from (either
pseudo or gold) labeled data based on predictive
confidence, preventing models from memorizing
out-of-distribution examples; et (2) model un-
certainty to accurately estimate both the target
value and predictive confidence by uncertainty
models, providing more reliable and interpretable
prédictions, meanwhile effectively supporting de-
noising in (1).
Uncertainty estimation has been extensively
explored in the context of classification (Guo
et coll., 2017; Vaicenavicius et al., 2019; Desai
and Durrett, 2020; Jiang et al., 2020), but is rela-
tively unexplored for regression tasks, due to the
complexities in dealing with a continuous target
espace. The output of a classifier passed through a
softmax layer naturally provides a discrete prob-
ability distribution, while in a regression setting
the output is a single numerical value.
We compare four well-studied techniques for
uncertainty estimation, as applied to pre-trained
language models (LMs): Gaussian processes
(Shen et al., 2019; Camporeale and Car`e, 2020),
Transactions of the Association for Computational Linguistics, vol. 10, pp. 680–696, 2022. https://doi.org/10.1162/tacl a 00483
Action Editor: Dani Yogatama. Submission batch: 11/2021; Revision batch: 02/2022; Published 6/2022.
c(cid:3) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
je
un
c
_
un
_
0
0
4
8
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Bayesian linear regression (Hern´andez-Lobato
and Adams, 2015), Bayes by backprop, and Monte
Carlo (MC) dropout. To comprehensively assess
uncertainty quality, we evaluate results intrin-
sically using various metrics, and extrinsically
with several downstream experiments. Our anal-
ysis shows that predictions are highly uncertain
and inaccurate in low-resource scenarios.
Two major types of uncertainty have been
identified: aleatoric uncertainty captures noise
inherent in the observations; and epistemic un-
certainty accounts for uncertainty in the model,
which can be explained away given enough data,
compensating for limited knowledge (Kendall and
Gal, 2017). Autrement dit, uncertainty results pri-
marily from noisy human annotations, insufficient
labeled data, and out-of-domain text in practice
(Glushkova et al., 2021). We therefore propose
a simple method to filter noisy labels and se-
lect high-quality instances from an unlabeled data
pool based on the predictive confidence, lequel
on the one hand alleviates both aleatoric and epis-
temic uncertainty, and on the other hand, improves
accuracy and generalization thanks to increased
training data.
In this work, we explore how to estimate un-
certainty in a regression setting with pre-trained
language models, and evaluate estimation quality
both intrinsically and extrinsically. Intrinsic un-
certainty estimation provides the basis for our
proposed data selection strategy: By filtering
noise based on confidence thresholding, and mit-
igating exposure bias, our approach is shown
to be effective at improving both performance
and generalization in low-resource settings, dans
self-training, and active learning settings.
2 Background
We first review approaches for estimating the
predictive uncertainty of deep neural networks
then meth-
(DNNs)
in a regression setting,
ods for
reducing uncertainty and improving
generalization.
2.1 Uncertainty Estimation in DNNs
Bayesian Estimation Bayesian approaches pro-
vide a general framework for dealing with un-
certainty estimation, Par exemple, in the form of
Gaussian processes (GPs: Camporeale and Car`e,
2020; Shen et al., 2019) and Bayesian neural
réseaux (Hern´andez-Lobato and Adams, 2015).
Cependant, prior work has either been based on
hand-crafted features, or based on small-scale neu-
ral networks with only one or two hidden layers,
which are far removed from modern pre-trained
LMs. How to combine deterministic pre-trained
LMs with Bayesian methods to achieve both high
accuracy and accurate uncertainty estimation is an
open problem, particularly in a regression setting.
While applying Bayesian estimation to all
model parameters in large-scale LMs is theo-
retically possible, in practice it is prohibitively
expensive in both model training and evaluation
(Xue et al., 2021). Concretely, the true Bayesian
posterior on the weights P (w|D) is generally
approximated by variational inference, minimiz-
ing the KL divergence with a parameterized dis-
tribution q(w|je):
je(cid:2) = arg min
je
KL[q(w|je)(cid:4)P. (w|D)]
(cid:2)
= arg min
je
q(w|je) log
q(w|je)
P. (w)P. (D|w)
dw
Deriving uncertainty estimates by integrating over
millions of model parameters, and initializing the
prior distribution for each are both non-trivial.
One simple strategy for combining them is Bayes
by backprop (BBB: Blundell et al., 2015), où-
by unbiased Monte Carlo gradients are minimized:
n(cid:3)
je = 1
log q(w(je)|je) − log P (w(je)) − log P (D|w(je))
where w(je) denotes the ith Monte Carlo sample
drawn from the variational posterior q(w(je)|je).
Ensemble Estimation Another approach is to
estimate uncertainty by ensemble, typically with
MC-dropout (Gal and Ghahramani, 2016) et
deep ensembles (Lakshminarayanan et al., 2017),
which are agnostic to model structure.
MC-dropout casts dropout training in DNNs as
approximate Bayesian inference in deep Gaussian
processes. The predictive probability of the deep
GP model (integrated with respect to the finite
rank covariance function parameters w) given
precision parameter τ > 0 est:
(cid:2)
p(oui|X, D) =
p(oui|X, w)p(w|D)dw
p(oui|X, w) = N (oui; ˆy(X, w), τ −1ID)
681
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
je
un
c
_
un
_
0
0
4
8
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
The dropout NNs are kept on during evaluation,
without changing either the model or the opti-
mization strategy. MC-dropout and its variants
have been extensively used to estimate regression
uncertainty due to their simplicity and scalability
in implementation (Zelikman et al., 2020; Laves
et coll., 2020; Sicking et al., 2021).
The deep ensemble approach trains multiple
copies of the variance networks from different
network initializations to estimate predictive dis-
tributions. It operates similarly to sub-networks
of MC dropout, but is computationally more ex-
pensive due to the need to train multiple models.
En plus, the need to split the training data into
multiple folds to train different networks exacer-
bates overfitting in small-data scenarios. Given
our specific focus on low-data scenarios, we focus
exclusively on MC dropout in this paper.
The only work we are aware of for estimating
uncertainty with transformers in a regression set-
ting is Glushkova et al. (2021), who use ensemble
estimation of uncertainty for machine translation
quality evaluation, comparing the translated sen-
tence with a reference translation. In contrast, nous
experiment in a cross-lingual setting, comparing a
source sentence and its translation directly.
2.2 Selecting Clean Instances
To reduce the uncertainty from both data and
model, we draw on approaches that can filter noisy
labels from labeled data, and select clean instances
from unlabeled data, thus eliminating aleatoric
uncertainty, and reducing epistemic uncertainty
due to the enhanced knowledge learned from the
augmented data. In brief, we need a method to
distinguish noisy and clean labels.
It has been shown that in data augmentation,
self-training, and zero-shot learning, using the
right sampling strategy is critical (Thakur et al.,
2020; Wang et al., 2020c). Cependant, previous
work has mainly focused on label distribution
équilibre, and lexical and semantic similarity, mais
not uncertainty.
In this work, we propose a simple method lever-
aging predictive confidence, to select high-quality
instances, which is related to uncertainty-based
sampling in active learning (Settles, 2009). Comment-
jamais, most work in active learning has focused
on classification rather than regression, either ex-
tracting the least probable or the most informative
examples with large entropy (Settles and Craven,
2008; Pinsler et al., 2019; Radmard et al., 2021).
Our approach also has a similar flavor to
self-paced curricular learning (Bengio et al., 2009;
Kumar et al., 2010; Wan et al., 2020), dans lequel
the aim is to choose ‘‘hard’’ examples and gra-
dually increase the difficulty of learning con-
tent, differing from the criteria in our setting—
‘‘clean’’ ones.
According to a recent review of uncertainty
estimation for DNNs (Abdar et al., 2020), là
is little work on using aleatoric uncertainty for
denoising and sampling in NLP tasks. The most
relevant work is that by Miok et al. (2020), OMS
aims to guide the annotation process for the binary
classification task of hate speech detection.
3 Tasks and Notation
In this paper, we consider text regression across
three separate tasks, and a total of 10 datasets.
Tasks STS: Semantic textual similarity assesses
the degree of semantic equivalence between two
pieces of text (Corley and Mihalcea, 2005). Le
aim is to predict a similarity score for a sentence
pair (S1, S2), generally in the range [0, 5], où
0 indicates complete dissimilarity and 5 indicates
equivalence in meaning. As an example:
S1: Total minutes spent in timed codes: 10 mins.
S2: Total minutes spent in timed codes: 33 mins.
might be labeled 4, as the two texts differ only in
very specific content (underlined).
SA: Sentiment analysis rating involves predict-
ing a sentiment score for a review S, in the range
1 (extremely negative) à 5 (extremely positive).
DA: Machine translation quality estimation,
based on the direct assessment approach (Graham
et coll., 2017), aims to predict a normalised quality
score for text pair (S1, S2), where S2 is machine
translated from S1. En tant que tel, it is similar to STS,
but differs in that it is cross-lingual.
Notation and Assumptions Throughout
ce
papier, raw examples, column vectors, and ma-
trices are denoted in lower-case italics, bold, et
upper-case italics, respectivement (par exemple., X, X, et
X). θencoder and θreg represent parameters of the
encoder and task-specific regression layers, et
F (je, ·) refers to the whole model. Take a dataset
D = {(x1, y1), (xi, yi), · · · , (xN , yN )}, où
(xi, yi) is the ith instance, yi ∈ R, and xi =
s(θencoder, xi) is the hidden state of xi. Le
682
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
je
un
c
_
un
_
0
0
4
8
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Dataset
Size (train, test, dev)
Range
Domain
STS-B (2017)
MedSTS (2018)
N2C2-STS (2019)
BIOSSES (2017)
EBMSASS (2019)
5749, 1379, 1500
750, 318, —
1642, 412, —
100, —, —
700, 300, —
Yelp (2018)
PeerRead (2018)
7000, 1500, 1500,
713, 290, —
[0, 5]
[0, 5]
[0, 5]
[0, 4]
[1, 5]
[1,5]
[1,5]
général
clinical
clinical
biomedical
biomedical
product
papier
WMT en-zh (2020)
WMT ru-en (2020)
WMT si-en (2020)
7000, 1000, 1000
7000, 1000, 1000
7000, 1000, 1000
[0, 100]
[0, 100]
[0, 100]
high-resource
medium-resource
low-resource
Tableau 1: STS/SA rating/QE-DA datasets. Train,
Test, Dev Size = number of text pairs, range =
label range. In practice, QE-DA is normalised by
z-score.
loss function is the empirical risk of the mean
je = 1 (F (je, xi)−yi)2
square error (MSE): L = 1
N
(cid:4)
N
Datasets We evaluate on different-sized data-
sets across various domains for STS and SA, et
three same-sized datasets for DA, summarized
in Table 1.
For STS, we use: (1) one large-scale general
dataset, STS-B (Cer et al., 2017); (2) two small
clinical data sets, MedSTS (Wang et al., 2018) et
N2C2-STS (Wang et al., 2020un); et (3) two small
biomedical data sets, BIOSSES (Soˇgancıoˇglu
et coll., 2017) and EBMSASS (Hassanzadeh et al.,
2019), each of which is 5-way annotated.
For SA, we use: (1) a large-scale product review
dataset, Yelp (Sabnis, 2018); et (2) a small pa-
per review rating dataset, PeerRead (Kang et al.,
2018), augmented with 399 Spanish paper re-
views (Keith et al., 2017) machine-translated into
English.
For DA, we use the three language pairs from
WMT2020 (Lucia et al., 2020), en-zh, ru-en, et
si-en, corresponding to high-, medium-, et faible-
resource settings in terms of the source language.
4 Method
Chiffre 1: Overview of pipeline and end-to-end training
workflow. gauche: SBERT is fine-tuned separately with
STS/NLI labeled data using MSE/NLL loss; middle:
well-trained SBERT provides off-the-shelf sentence
embeddings
to GP/Cosine similarity. End-to-end
(droite): under MC-dropout, keep dropout on in infer-
ence; in BBB, parameters of LR/HConv are stochastic
variables.
tion, either in a pipeline approach, or end-to-
end. Chiffre 1 provides an overview.
Pipeline Training To estimate probability dis-
tributions for the regression task of document
quality assessment, Shen et al. (2019) used a
Gaussian process (GP) with Radial Basis Function
(RBF) kernel function over hand-crafted features.
We build off this in applying Bayesian linear re-
gression and sparse GP regression to pre-trained
sentence encoders,
such as Sentence-BERT
(SBERT; Reimers and Gurevych, 2019). For text
input x, we generate x = s(θencoder, X) ∈ Rd.
In this way, we leverage contextualized sentence
representations, while avoiding the complexity
of estimating uncertainty directly from a large-
scale Bayesian neural network.
Bayesian Linear Regression: The prior distri-
bution of a Bayesian linear layer with parameters
w and b is set to be a Gaussian distribution:
ˆy = w(cid:2) · x + b + ε
w ∼ N (m, σ2I); b ∼ N (m, 1)
(1)
(2)
Dans cette section, we first
introduce approaches
for estimating regression uncertainty based on
pre-trained LMs, then propose a simple method to
sample ‘‘clean’’ instances from unlabeled data to
augment training data based on predictive un-
certainty. The proposed methods can be applied
equally in semi-supervised and unsupervised set-
tings (including active learning and self-learning).
4.1 Bayesian Regression using LMs
We investigate two alternatives to combine pre-
trained transformer LMs with Bayesian estima-
where ˆy is the approximated value and ε is the
observation noise, which is assumed to be an
independent and identically distributed random
variable ε ∼ N (0, σ2).
Gaussian Processes (GPs) (Rusmassen and
Williams, 2005) are a natural way to generalize
the concept of a multivariate normal distribution
determined by a mean vector μ and covariance
matrix Σ,
to describe a real-valued function.
They provide a mathematically elegant framework
for Bayesian inference and offer principled un-
certainty estimates for regression problems with
683
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
je
un
c
_
un
_
0
0
4
8
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
a closed-form posterior (Leibfried et al., 2020).
Given (xi, yi), yi = f (xi) + εi, where f (·) is a
real-valued function with input xi that is sampled
from a GP, and where εi are scalar indepen-
dent and identically distributed random variables
corresponding to observation noise.
The prior on data generation can be encapsu-
lated in the distribution of f (·). We assume that
F (·) is distributed according to a GP, c'est,
F (X) ∼ GP(m(X), k(X, X(cid:7))))
(3)
where m(X) is a mean function, and k(X, X(cid:7)) est
a covariance or kernel function, corresponding
to μ and Σ of a multivariate normal distribution.
Following common practice, we fix the mean
function to zero, and use a RBF as the kernel
fonction (Preot¸iuc-Pietro and Cohn, 2013; Beck
et coll., 2014; Bitvai and Cohn, 2015; Shen et al.,
2019).
Computing the exact posterior requires the stor-
age and inversion of an (N × N ) matrice, lequel
is quadratic in the amount of training data N
and has cubic computational complexity, both of
which are infeasible for large datasets. Thus we
use sparse GPs, which approximate an exact GP
by using a small set of latent inducing points
(Titsias, 2009), learned by variational inference.
End-to-end Training Rather than pre-training
a LM and task-specific model separately, Xue
et autres. (2021) jointly trained them by only applying
Bayesian estimation to a subset of the model
parameters. This requires training entirely from
scratch, while we seek to leverage pre-trained
LMs. We apply Bayesian inference to task-
specific layers, keeping parameters of the LM
deterministic and making task-specialised param-
eters stochastic during fine-tuning. Surtout,
being deterministic is not equivalent to being fro-
zen: Parameters are updated as in non-Bayesian
optimization, rather than kept fixed during back-
propagation.
To increase randomness, we evaluate on two
task-specific networks with more stochastic pa-
rameters than a single-layer linear regression net-
work used in Pipeline Training, as detailed below.
Bayesian Two-layer MLP: The linear regres-
sion layers take the hidden state h ∈ Rd, through
a two-layer MLP with tanh activation function:
h(cid:7) = tanh(Wh + b); ˆy = wT h(cid:7) + b
(4)
ˆy
est
le
où
et
W ∈ Rd×d, b, w ∈ Rd and b ∈ R are trainable
parameters.
approximated score,
Bayesian Hierarchical Convolution: Drawing
on the finding that a hierarchical convolution neu-
ral network (HConv) is effective in low-resource
settings (Wang et al., 2020b), and that increas-
ing the capacity of task-specific layers can boost
performance (Chung et al., 2020), we train a
large-capacity network as follows. HConv is struc-
tured as a two-layer convolutional network, avec
kernel size k = 2, 3, 4 in the first layer and k = 2
in the second (Wang et al., 2020b). The prior dis-
tributions of the weights and bias are based on
Eq. (2) for Bayesian inference, and the inference
method follows Bayes by Backprop (Blundell
et coll., 2015).
4.2 Predictive Uncertainty-based Sampling
Given a pre-trained uncertainty model f (je, ·),
and a (large-scale) unlabeled data pool Du =
{x1, x2, · · · , xi, · · · , xU }, the distribution of the
predicted yi for input xi is:
P. (yi) = fθ(xi) ∼ N (μi, σi)
(5)
where μi and σi are the mean and standard
deviation of the normal distribution of yi.
Our aim is to sample a subset D(cid:7)
u from Du
in which the uncertainty model is expected to
be sufficiently confident in predicting D(cid:7)
toi, c'est
have a confidence interval as narrow as possible
under a given confidence level. Par exemple,
sous 99% confidence, the confidence interval
[μi −2.58σi, μi +2.58σi] is expected to be narrow.
Put differently, the distribution is concentrated
around the mean with small standard deviation.
Basé sur ceci, we propose a simple instance
selection method based on predictive uncertainty.
For each instance xi in Du, if σi < τ , select xi;
D(cid:7)
← xi. The threshold τ is a global hyperparam-
u
eter tuned over the validation set, or in the case of
self-training and active learning, using a heuristic
strategy.1
The strategy is based on the observation that
the model can generally predict precisely for
1We also experimented with a strategy for tuning τ
based on the principle of discarding the majority so that
remaining examples are as clean as possible. Specifically,
we set τ to the marginal value corresponding to the left
boundary of the peak of the std probability distribution, but
found little difference in results, so omit it from the paper.
684
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
l
a
c
_
a
_
0
0
4
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
instances of extreme polarity, such as labels in
the ranges [0, 1] and [4, 5] for STS. We posit that
cases whose predictive uncertainty is at the same
level as these well-predicted examples are also
predicted accurately. Formally, after inference,
the unlabeled data pool is Du = {(xi, μi, σi)},
i ∈ [1, U ], where U is the number of unlabeled
instances. The standard deviation of all well-
predicted examples can be vectorized as σ = [σi],
where σi is the std whose μi is at an extremum,
such as 0 ≤ μi ≤ 1 or 4 ≤ μi ≤ 5 for STS. We
then set τ = mean(σ).
5 Uncertainty Evaluation Metrics
Evaluating uncertainty estimates of predictions
is challenging in a regression setting, as the
‘‘ground truth’’ uncertainty is usually not avail-
able (Lakshminarayanan et al., 2017). To evalu-
ate model predictions, we consider four metrics.
Pearson Correlation: It is vital to assess the
predictive accuracy of the system, regardless of
the uncertainty estimate. We use Pearson corre-
lation r to evaluate the correlation between the
system’s average predictions and ground truth
quality scores.
Calibration Error (CAL): One way to under-
stand if models can be trusted is by analysing
whether they are calibrated. Gneiting et al. (2007)
defined calibration in a regression setting as the
asymptotic consistency between the probabilistic
forecasts Fi and the true data-generating distri-
butions Gi, with the index i referring to each
example.
Practically, Fi
is the cumulative probability
distribution P (Y ≤ yi), Gi
is generally esti-
mated by empirical distribution functions based
on the observations only. So calibration measures
if the predictive confidence estimates are aligned
with the empirical correctness likelihoods. Given
a confidence level pj, the empirical accuracy is
calculated:
(cid:5)
(cid:4)
n
i=1
I
ˆpj =
(cid:6)
−1(pj)
yi ≤ Fi
n
−1 is used to denote the quantile func-
where Fi
−1(p) = inf {y : p ≤ Fi(y)}, that is map-
tion Fi
ping from [0,1] → Y. The expected calibration
m
j=1 wj · (pj − ˆpj)2, with m con-
error cal =
fidence levels 0 ≤ p1 < · · · < pm ≤ 1, is the
distance of predictive confidence away from the
empirical accuracy.
(cid:4)
685
(cid:4)
strongly through logarithmic
Negative Log-Probability Density (NLPD)
complements CAL’s equal treatment to over- and
under-confidence. It penalises over-confidence
more
scaling:
n
LNLPD = − 1
i=1 log p(yi = ti|xi)), favouring
n
In Gaussian predictive
under-confident ones.
distributions with mean mi and variance vi, the
NLPD loss incurred for predicting at input xi
with true associated target ti is given by:
LNLPD =
(cid:7)
n(cid:3)
i=1
1
2n
log vi +
(cid:8)
(ti − mi)2
vi
Sharpness (SHP): The metrics above do not
account for the concentration of the predictive
distributions, which generally favours predictors
that produce wide and uninformative confidence
intervals. To guarantee useful uncertainty esti-
mation, confidence intervals should not only be
calibrated, but also sharp and ‘‘tight’’ around the
predicted value. The numerical width of prediction
intervals (Gneiting et al., 2007; Song et al., 2019)
and the mean of variance (Kuleshov et al., 2018;
Zelikman et al., 2020) are often used to quantify
sharpness. We apply the latter in our work, with
a lower score implying higher sharpness.
To interpret mixed results, for example when a
model attains the best sharpness but with infinitely
large NLPD, we suggest that Pearson correlation
(r) has primacy, followed by CAL and NLPD, then
SHP. That is, when models have comparable r,
the comparison of CAL/NLPD is more meaning-
ful, and if those are also similar, SHP should be
considered; otherwise, it’s largely meaningless.
6 Evaluation of Uncertainty Estimation
We expect that the incorporation of uncertainty
estimation should not harm predictive perfor-
mance compared to point estimation without un-
certainty, in both in- and out-of-domain scenarios.
Additionally, uncertainty estimates should reflect
‘‘what the model does not know’’, making it pos-
sible to determine whether a prediction can be
trusted based on the output distribution. This is
quantified intrinsically with CAL and NLPD (the
lower, the better), and extrinsically via instance
selection in Section 7.
6.1 Experimental Setup
Pipeline Training: We use SBERT as an off-
the-shelf sentence encoder. We fine-tune SBERT
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
l
a
c
_
a
_
0
0
4
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
separately over each STS corpus based on the pre-
trained bert-base-nli-mean-tokens, using
the same configuration as the original paper (4
epochs with training batch size of 16). For the
cross-lingual DA task, we use distiluse-base-
multilingual-cased-v1.
To represent a sentence pair (S1, S2) using
SBERT, we use the concatenation of the embed-
dings u ⊕ v, along with their absolute difference
|u − v| and element-wise multiplication v × t.
‘‘SBERT Bayesian LR’’ and ‘‘SBERT Sparse
GP Regression’’ indicates that features are fed
into Bayesian LR and sparse GP regression,
respectively, implemented in pyro.2
tasks. The input format
End-to-End Training: We apply pre-trained
BERT as the LM encoder (Devlin et al., 2019), us-
ing bert-base-uncased for monolingual tasks
and bert-base-multilingual-cased for cross-
lingual
is [CLS] S1
[SEP] S2 [SEP] for text pair (S1, S2), and [CLS]
S [SEP] for a single text S. BERT Bayesian LR
and BERT Bayesian ConvLR denote task-specific
networks based on a two-layer MLP and HConv,
respectively, implemented based on the Hugging-
face Transformer framework and blitz for BBB
estimation (Esposito, 2020).
MC-Dropout: We apply MC-dropout to base
models BERT LR and BERT ConvLR, with
dropout rate = 0.1 and 30 iterations of sampling.3
Point Estimation: In addition to the uncertainty
estimation approaches, we also compare with four
non-Bayesian methods: (1) cosine similarity; (2)
optimization of deterministic LR with SBERT
(SBERT LR); (3) fine-tuned BERT LR; and (4)
fine-tuned BERT ConvLR.
Training Configuration: The maximum se-
quence length is set to 128 for STS and DA, and
256 for SA. The learning rate (lr), training batch
size, and training epochs are optimized over the
validation set. In the situation that a validation set
is not available (i.e., EBMSASS and MedSTS), we
provisionally split the training data into 80%:20%
training:dev data, and tune hyperparameters over
the dev data. We then retrain the model over
the full training dataset, and evaluate on the test
set. Tuned hyperparameter settings of the pipeline
are shown in Table 3. End-to-end is based on
grid-searching over [8, 16, 32] × [1e-5, 2e-5] ×
2https://pyro.ai/.
3No significant difference was observed when sampling
[1, 2, 3, · · · 10] for batch size, lr, and epochs,
respectively. Generally, the best setting is batch
size = 16, lr = 2e-5, and epochs = 3, although
BERT ConvLR based on BBB requires more
epochs to converge. Further details of the training
regimen and hyperparameter settings are provided
in our Github repository.4
6.2 Sentence-Pair STS
In this section, we compare the various uncer-
tainty estimation approaches from Section 4.1
over STS, in terms of correlation and the metrics
for uncertainty estimation, aiming to empirically
establish:
1. Which uncertainty estimation strategy is
most accurate, most calibrated, and sharpest?
2. Which method performs best
in out-of-
domain settings?
6.2.1 In-Domain Performance
To observe the influence of data size and domain
distribution on uncertainty estimation, we experi-
ment over the large-scale general-domain STS-B,
in addition to the smaller-scale domain-specific
MedSTS (clinical domain) and EBMSASS (bio-
medical domain) datasets. There are three main
findings from the results in Table 2.
Uncertainty models do not degrade accuracy.
With SBERT, GP-based models have higher cor-
relation than either cosine similarity or LR. In
the case of BERT, estimation by MC-dropout is
competitive with corresponding point estimates.
Thus, they have comparable raw performance, in
addition to providing uncertainty estimates.
End-to-end training based on BERT results
in higher correlation and narrower confidence
intervals, but poorer calibration and NLPD.
Results over the three datasets show that end-
to-end training based on BERT overall performs
much better than pipeline training using SBERT,
but BERT-based models are poorly calibrated
compared to SBERT-based Bayesian linear re-
gression and sparse GP regression using fixed
sentence features (as can be seen in the higher
NLPD numbers for BERT-based models). This is
consistent with prior work (Guo et al., 2017).
MC-dropout is superior to BBB inference,
and sparse GP regression performs better than
SBERT Bayesian LR, regardless of data size
4https://github.com/yuxiaw/Uncertainty
20, 30, 40, or 50 times, so we report only on 30.
-regression.
686
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
l
a
c
_
a
_
0
0
4
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
STS-B test
EBMSASS test
MedSTS test
Yelp test
r ↑
CAL ↓ NLPD ↓
SHP↓
r ↑
CAL ↓ NLPD ↓
SHP↓
r ↑
CAL ↓ NLPD ↓
SHP↓
r ↑
CAL ↓ NLPD ↓
SHP↓
SBERT Cosine similarity
SBERT LR
SBERT Bayesian LR
SBERT Sparse GP Regression
BERT LR
BERT ConvLR
BERT Bayesian LR (BBB)
BERT Bayesian ConvLR (BBB)
BERT LR MC dropout
BERT ConvLR MC dropout
0.842
0.835
0.810
0.847
0.868
0.855
0.848
0.849
0.868
0.855
N/A
N/A
0.046
0.065
N/A
N/A
0.648
0.614
N/A
N/A
N/A
N/A
+∞
0.521
0.495 2061.0
0.181
0.202
4.659
5.830
N/A
N/A
1.632
1.621
N/A
N/A
0.005
0.015
0.215
0.209
0.773
0.743
0.688
0.788
0.914
0.922
0.914
0.898
0.921
0.922
N/A
N/A
0.443
0.195
N/A
N/A
1.095
0.541
N/A
N/A
N/A
N/A
0.669 1177.2
0.618
327.3
0.054
0.093
0.036
2.137
N/A
N/A
2.156
1.627
N/A
N/A
0.005
0.010
0.140
0.085
0.784
0.776
0.740
0.781
0.858
0.846
0.848
0.835
0.859
0.852
N/A
N/A
0.101
0.073
N/A
N/A
0.801
0.499
N/A
N/A
N/A
N/A
0.514 6594.3
0.506 1037.2
0.163
0.219
4.118
6.402
N/A
N/A
2.092
1.453
N/A
N/A
0.006
0.017
0.168
0.146
—
0.666
0.671
0.689
0.826
0.822
0.827
0.797
0.827
0.823
N/A
N/A
0.019
0.049
N/A
N/A
0.447
0.573
N/A
N/A
N/A
N/A
0.531 3908.6
1.513
119.2
0.267
0.291
7.285
8.214
N/A
N/A
0.753
1.507
N/A
N/A
0.083
0.089
0.153
0.150
Table 2: Correlation r and uncertainty prediction quality metrics (CAL, NLPD, and SHP) on three STS
datasets (STS-B, EBMSASS, and MedSTS) and a SA rating dataset (Yelp), with SBERT and BERT
sentence embeddings with various task-specific layers: Cosine similarity = calculate cosine similarity
between vectors representing S1 and S2; LR = single-layer linear regression; Bayesian LR = Bayesian
linear regression; and Sparse GP Regression = Sparse Gaussian process regression. N/A indicates that
the method doesn’t produce an uncertainty estimate to apply the given metric to.
LR
Bayes LR
GP Reg
lr
epoch
lr
epoch
lr
epoch
STS-B
0.1
EBMSASS 0.1
0.1
MedSTS
0.1
Yelp
0.1
en-zh
0.1
ru-en
0.1
si-en
100
15
100
600
50
40
199
0.01
2500
0.01 10000
8500
0.01
2500
0.01
300
0.03
400
0.03
300
0.03
25
0.1
25
0.1
25
0.1
25
0.1
200
0.1
200
0.1
0.1 1000
Table 3: Learning rate (lr) and training epochs
(epoch) for pipeline training based on SBERT.
and domain. Under both BERT LR and ConvLR,
MC-dropout achieves higher or equal correla-
tion, and much lower CAL and NLPD than BBB
in end-to-end training. Among methods based
on SBERT, sparse GP regression requires many
fewer iterations to converge, and outperforms
Bayesian LR in correlation and NLPD, and is
comparable for CAL and SHP.
6.2.2 Out-of-Domain Performance
Apart from in-domain evaluation, out-of-domain
performance is also an important concern. We
expect that a model trained on domain A will
generate more uncertain predictions on domain B,
with lower correlation, larger CAL and NLPD, and
a wider confidence interval (Lakshminarayanan
et al., 2017). Given two models trained on do-
main A with similar point-estimate performance
on domain B, that is competitive r, the model with
the lower NLPD is arguably the better model, as
this indicates that the model gives sharper distri-
butions when the prediction is correct, and flatter
ones when wrong.
Using models fine-tuned over the general-
domain STS-B, we evaluate on the biomedical
EBMSASS and clinical MedSTS test sets. In
contrast with the results in Table 2, in which mod-
els have been fine-tuned with in-domain labeled
data, Table 4 shows a steep decline in r of more
than 10 points on average for EBMSASS, and
7 for MedSTS. Meanwhile, both CAL and NLPD
increase by a large margin.
MC-dropout is not always best. Interestingly,
we find that BERT Bayesian LR performs well
in this setting, obtaining the highest correlation
and smallest SHP on EBMSASS and PeerRead.
This suggests that BERT Bayesian LR has bet-
ter generalizability over these two domains, but
the substantially higher NLPD also reveals that
its predictions are over-confident. By and large,
MC-dropout stably offers accurate and calibrated
predictions in out-of-domain settings. ConvLR in
particular outperforms Bayesian inference across
all metrics.
BERT ConvLR tends to be inferior to BERT
LR in the out-of-domain setting. We speculate
this is because of its smaller capacity to memorize
task-specific knowledge, as eight layers of the
BERT encoder are frozen in BERT ConvLR.
6.3 Single-sentence Sentiment Rating
We perform in-domain SA evaluation on Yelp,
and out-of-domain evaluation by applying the
fine-tuned Yelp model
to PeerRead test data.
We find:
Fine-tuned sentence embeddings are vital to
the performance of pipeline uncertainty esti-
mation. As shown in Table 2, performance over
Yelp, EBMSASS, and MedSTS based on SBERT
687
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
l
a
c
_
a
_
0
0
4
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
EBMSASS test
r ↑ CAL ↓ NLPD ↓ SHP↓
MedSTS test
r ↑ CAL ↓ NLPD ↓ SHP↓
PeerRead test
r ↑ CAL ↓ NLPD ↓ SHP↓
N/A
0.716
SBERT Cosine similarity
0.696
N/A
SBERT LR
0.684 0.091
SBERT Bayesian LR
SBERT Sparse GP Regression 0.726 0.211
BERT LR
BERT ConvLR
BERT Bayesian LR
BERT Bayesian ConvLR
BERT LR MC dropout
BERT ConvLR MC dropout
N/A
0.838
0.806
N/A
0.867 0.625
0.811 0.714
0.838 0.280
0.814 0.194
N/A
N/A
0.400
0.586
N/A
N/A
5165
1043.
3.517
4.649
N/A
0.731
N/A
0.718
N/A
N/A
0.672 0.038
1.325
1.609 0.723 0.129
N/A
N/A
0.568
0.634
— N/A
N/A
0.256
N/A
N/A
1.506
0.241 0.116
1.604 0.427 0.021
N/A
N/A
N/A
N/A
1.018 1.245
0.771 1.339
N/A
0.786
N/A
0.776
N/A
N/A
0.005
0.768 0.619
0.011 0.770 0.523
0.795 0.199
0.137
0.788 0.240
0.153
N/A
N/A
N/A
N/A
11081 0.005
0.017
1527.
0.188
5.060
0.158
8.447
N/A
N/A
0.669
N/A
0.627
N/A
0.694 0.522
7606.
189.0
0.608 0.990
0.676 0.400 21.75
0.635 0.456 36.37
N/A
N/A
0.009
0.086
0.160
0.138
Table 4: Results on EBMSASS, MedSTS and PeerRead test sets using models trained on general-
purpose STS-B and Yelp for STS and SA, respectively.
en-zh test
r ↑ CAL ↓ NLPD ↓ SHP↓
ru-en test
r ↑ CAL ↓ NLPD ↓ SHP↓
si-en test
r ↑ CAL ↓ NLPD ↓ SHP↓
N/A
0.115
SBERT Cosine similarity
0.270
N/A
SBERT LR
0.280 0.025
SBERT Bayesian LR
SBERT Sparse GP Regression 0.384 0.026
N/A
BERT LR
BERT ConvLR
N/A
BERT Bayesian LR
BERT Bayesian ConvLR
BERT LR MC dropout
BERT ConvLR MC dropout
N/A
N/A
N/A
N/A
0.155 0.908
0.143 0.892
N/A
N/A
0.395
0.436
N/A
N/A
0.385 0.726 11600 0.005
0.066
683.7
0.378 1.780
9.216 0.190
0.407 0.250
0.441 0.268
0.127
13.33
N/A
N/A
N/A
0.428
N/A
N/A
0.616
N/A
0.223 0.771
0.625 0.013
0.207 0.776
0.626 0.007
N/A
N/A
0.621
0.641
N/A
N/A
0.644 0.515 11666 0.005
0.069
0.609 1.775
0.126
0.637 0.315
0.649 0.333
0.106
723.4
17.00
22.28
N/A
N/A
N/A
N/A
N/A
0.097
N/A
N/A
0.397
N/A
0.193 0.934
0.371 0.013
0.191 0.931
0.366 0.010
N/A
N/A
N/A
0.504
0.524
N/A
N/A
N/A
0.506 0.568 10971 0.005
0.059
638.5
0.503 1.758
6.578 0.200
0.527 0.178
0.530 0.275
0.133
10.19
Table 5: Results for DA-style quality estimation over the three WMT language pairs.
is substantially worse than with BERT. We spec-
ulate this is due to poor feature representations.
That is, on the STS task, we continue to fine-tune
sentence embeddings over each STS dataset. As
a result of being unable to fine-tune SBERT on
SA (as there is no paired data), the representations
for Yelp are pre-trained using SNLI only, which
is neither task- nor domain-specific. Compared
with the similarly sized STS-B where embeddings
are fine-tuned, the performance gap for Yelp be-
tween SBERT and BERT is more than 0.15, but
less than 0.02 for STS-B. Equally, though we
fine-tune SBERT for EBMSASS and MedSTS,
each has fewer than 1k training instances. Poor
domain-specific sentence embeddings result in
gaps of 0.15 and 0.07.
Meanwhile, for SBERT in the upper half of
Table 4, the out-of-domain correlation on Peer-
Read is extremely poor; the gap of 6 points on
EBMSASS and MedSTS relative to in-domain
results (0.78 in Table 2) further confirms our
hypothesis.
LR outperforms ConvLR in out-of-domain
SA. In both point and Bayesian estimates, ConvLR
performs better than LR (Table 4), similar to STS.
6.4 Cross-lingual Sentence-pair DA
We evaluate on machine translation quality esti-
mation (QE) over three language pairs using DA,
using 7,000 training instances in each case. The
results are shown in Table 5. We first observe
that using embeddings directly from pretrained
SBERT with cosine similarity underperforms
other methods that involve fine-tuning.
Traditional Bayesian LR and GP models
achieve results competitive with deep uncer-
tainty models when the input sentence embedding
is expressive enough, and with smaller CAL
and NLPD. Related uncertainty prediction work
(Glushkova et al., 2021) argued that GPs are not
competitive or easy to integrate with current neural
architectures. In contrast, our results demonstrate
that GPs can achieve comparable results to deep
neural networks, while also being better calibrated.
688
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
l
a
c
_
a
_
0
0
4
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
N2C2-STS test
MedSTS test
PeerRead test
r1 / r2 ↑
CAL ↓ NLPD ↓
r1 / r2 ↑
CAL ↓ NLPD ↓
r1 / r2 ↑
CAL ↓ NLPD ↓
0.853 / 0.857 0.384
0.861 / 0.862 0.511
0.860 / 0.864 0.493
Semi-supervised:
BERT LR
+Du
+D(cid:7)
u
BERT ConvLR 0.874 / 0.875 0.509
+Du
0.875 / 0.876 0.522
+D(cid:7)
0.875 / 0.879 0.535
u
0.682 / 0.663 0.568
0.687 / 0.673 0.624
0.743 / 0.729 0.630
Zero-shot:
BERT LR
+ Du
+ D(cid:7)
u
BERT ConvLR 0.728 / 0.722 0.612
+ Du
0.746 / 0.737 0.653
+ D(cid:7)
0.763 / 0.748 0.628
u
6.571
9.232
8.476
11.51
13.50
11.44
17.08
40.10
23.67
21.06
47.68
40.32
0.858 / 0.859
0.860 / 0.861
0.863 / 0.866
0.846 / 0.853
0.846 / 0.855
0.857 / 0.864
0.158
0.224
0.181
0.201
0.215
0.222
3.903
5.267
4.758
5.968
6.403
6.129
0.686 / 0.686
0.655 / 0.656
0.720 / 0.720
0.691 / 0.692
0.671 / 0.683
0.699 / 0.697
0.370
0.394
0.340
0.346
0.453
0.374
15.95
19.26
19.89
16.98
25.50
21.78
0.786 / 0.795
0.796 / 0.797
0.793 / 0.792
0.199
0.266
0.296
0.776 / 0.788 0.240
0.790 / 0.794
0.332
0.809 / 0.810 0.303
5.060
11.94
8.907
8.447
16.45
15.26
0.669 / 0.676
0.023 / 0.006
0.678 / 0.675
0.400
1.728
0.495
0.456
0.627 / 0.635
0.138 / 0.119 1.748
0.483
0.656 / 0.659
21.75
387.3
52.72
36.37
546.1
57.77
Table 6: Results on three low-resource regression datasets: clinical STS: MedSTS, N2C2-STS, and
PeerRead. r1 are the results without MC-dropout, while r2, CAL, and NLPD are based on applying 30
iterations of MC-dropout. There are two setups: (1) semi-supervised (upper half) = domain gold-labeled
data is available; and (2) zero-shot (bottom half). In each case, Du = unlabeled data pool selected
based on the model probability; D(cid:7)
u = unlabeled data pool selected based on hyperparameter τ over the
predicted std; and row 1,4,7,10 = baseline for each setting.
ConvLR consistently outperforms LR for
BERT-based models. In the cross-lingual scenario,
SBERT models have smaller CAL and NLPD, and
larger SHP, analogous to the monolingual setting.
7
Instance Selection Through
Uncertainty
In self-training, a model is first trained using la-
beled data, then used to predict labels for unlabeled
data instances. Instances with higher-probability
predictions are then adopted as pseudo-labels, and
used to re-train the model in conjunction with
the labeled training data. Active learning is simi-
lar, expect that instances are selected for explicit
human labelling rather than pseudo-labeled, of-
ten based on estimates of model confidence or
uncertainty. In both tasks, accurate estimation of
labelling (un)certainty is critical.
In this section, we evaluate the uncertainty-
based instance selection method from Section 4.2
in the settings of self-training and active learn-
ing, over the tasks of STS, SA rating, and cross-
lingual DA.
7.1 Self-training STS and SA
In self-training, we experiment
in both semi-
supervised (limited gold-standard training data)
and zero-shot scenarios, over three low-resource
datasets: MedSTS, N2C2-STS, and PeerRead.
Experimental Setup: As we require high cor-
relation to ensure high-quality pseudo-labels, and
lower CAL and NLPD to guarantee that predic-
tions are neither over- nor under-confident, we
employ MC-dropout over LR and ConvLR. Addi-
tionally, to alleviate domain data sparsity, we first
fine-tune the regressor on two general datasets—
STS-B for STS and Yelp for SA (general-purpose
STS/SA)—also providing the proxy for the zero-
shot setting. We continue to fine-tune on domain
training data in the semi-supervised scenario, and
predict (μ, σ) for Du by applying dropout 30
times. All results in Table 6 are obtained using
train batch size = 16, learning rate = 2e-5, and
training epochs = 3.
Unlabeled Data Pool: For clinical STS, we
extract sentences from MIMIC-III covering topics
of medication, diagnosis, follow-up instructions,
and test, then synthetically balance across each
unit score interval, resulting in 1,534 sentence
pairs, which we denote as Du. For PeerRead, we
use 1,014 reviews from ICLR 2017 without labels
as Du. To expand Du in the zero-shot setting, we
remove the gold-standard labels and integrate the
resulting unlabeled data into Du.
689
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
l
a
c
_
a
_
0
0
4
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Results and Analysis: As seen in Table 6,
semi-supervision improves correlation, at the cost
of being more uncertain and miscalibrated, with
larger CAL and NLPD. Predictive confidence
threshold selection can further improve the ac-
curacy. It also effectively calibrates the model,
resulting in much lower CAL and NLPD, com-
pared with directly incorporating unlabeled data
(‘‘+Du’’).
In the zero-shot setting, CAL and NLPD increase
for all tasks under both LR and ConvLR with Du,
making predictions less reliable, especially for
PeerRead where the model totally collapses. This
matches our intuition that the distribution of the
pseudo-labeled data differs from the true distribu-
tion, and that learning from this data impedes the
model. This problem is alleviated by retaining only
the highly confident subset D(cid:7)
u, as its distribution
is closer to the gold-standard for well-calibrated
models. This is also consistent with the observa-
tion that CAL and NLPD in the zero-shot setting
are much larger than in the semi-supervised set-
ting, as the latter benefits from the guidance of the
gold-standard distribution.
Note that if we merely assess the model with
Pearson correlation as in most previous work, we
can only observe the improvement due to data
augmentation, neglecting the risk of the model
being more miscalibrated, and producing less re-
liable predictions. Further, CAL and NLPD are
useful metrics to evaluate the effectiveness of the
data sampling strategy used in self-training.
7.2 Cross-lingual DA
We evaluate self-training and active-learning on
DA-based machine translation quality estimation
using BERT LR.
Experimental Setup: We use three language
pairs: WMT 2020 DA en-zh, ru-en, and si-en,
in each case splitting the original 7k training
instances into a training set D of 3k instances and
4k unlabeled data pool Du, keeping the original
validation and test sets. The lr is set to 2e-5, and
training epochs and batch size are tuned by grid
search over the validation set based on the range
[1,2,3,4,5] × [16, 32]. Other settings follow STS
and SA above, but without a general-purpose base
model. As a baseline, we use D fine-tuned on the
validation set, and evaluate the best configuration
on test.
en-zh (high)
rdev ↑ rtest ↑
ru-en (medium)
rdev ↑ rtest ↑
si-en (low)
rdev ↑ rtest ↑
Baseline
0.407
+ pseudo Du 0.434
+ D(cid:7)
0.438
u
+ D(cid:7)
∪ D(cid:7)
0.445
u
a
+ gold Du
0.453
0.374
0.400
0.404
0.422
0.592
0.604
0.606
0.615
0.599
0.619
0.603
0.628
0.427
0.449
0.443
0.466
0.478
0.488
0.482
0.496
0.395
0.600
0.621
0.466
0.504
Table 7: Results for DA-based quality estimation
in WMT 2020 (dev/test) for three language pairs:
en-zh, ru-en and si-en. Baseline = training with
3,000 gold-labeled instances. Row ‘‘+ D(cid:7)
a’’
u
is active learning.
∪ D(cid:7)
Results and Analysis: As shown in Table 7,
directly incorporating pseudo Du substantially
outperforms baselines for all three language pairs.
This differs from the results for STS and SA in
the semi-supervised setting, but is consistent with
the results in the zero-shot setting. It indicates that
a high-performance model requires high-quality
data to further gain improvements; lower-quality
models are more tolerant to lower data quality.
We select the most confident 1,904, 1,985, and
2,462 instances with τ = 0.15, 0.13 and 0.19
for en-zh, ru-en and si-en, respectively. Equal or
higher performance is achieved when this subset of
instances is added to the training data, as compared
to the complete Du.
u
Simulating active learning, we also explore the
annotation of Du − D(cid:7)
u with human gold scores,
i.e. D(cid:7)
∪ D(cid:7)
a. The results show that with D(cid:7)
a,
our model achieves results competitive with using
all of Du with gold labels. This reveals that it
is not necessary to annotate the entire dataset,
but we can focus on the subset where the model
is not confident. In this way, data annotation is
more efficient, and models generalize better over
unseen data.
8 Analysis
In this section, we conduct further analysis to
better understand the results of the experiments.
Qualitative Comparison: In both in-domain
and out-of-domain evaluation, end-to-end train-
ing based on BERT, particularly BBB estimation,
obtains much larger NLPD than pipeline train-
ing based on SBERT, especially GP regression.
We speculate that end-to-end uncertainty mod-
els are confident for both correct and incorrect
690
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
l
a
c
_
a
_
0
0
4
8
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
STS-B test
EBMSASS test
MedSTS test
Yelp test
r ↑ CAL ↓ NLPD ↓ SHP↓
r ↑ CAL ↓ NLPD ↓ SHP↓
r ↑ CAL ↓ NLPD ↓ SHP↓
r ↑ CAL ↓ NLPD ↓ SHP↓
0.833 N/A
simCSE Cosine
simCSE LR
0.849 N/A
simCSE Bayes LR 0.850 0.051
simCSE Sparse GP 0.853 0.002
N/A
N/A
0.700 N/A
N/A
N/A
0.703 N/A
0.381 0.891 0.738 0.048
0.368 0.960 0.757 0.210
N/A
N/A
0.696 N/A
N/A
N/A
0.675 N/A
0.102 0.900 0.693 0.002
0.218 0.962 0.694 0.034
N/A — N/A
N/A
N/A
0.688 N/A
N/A
0.295 0.885 0.668 0.005
0.346 0.960 0.681 0.004
N/A
N/A
N/A
N/A
0.377 0.846
0.360 0.880
Table 8: Pipeline model results for the simCSE sentence encoder (Gao et al., 2021).
SBERT Sparse GP BERT Bayesian LR (BBB)
Incorrect Predictions:
S1: You will want to clean the area first.
S2: You will also want to remove the seeds.
Gold score = 0
Prediction: 2.22 ± 1.62
1.95 ± 0.0037
Correct Predictions:
S1: He was referring to ..., ... last Sunday.
S2: Next week, ... Sunday ..., will take up his position.
Gold score = 4
Prediction: 3.89 ± 1.58
4.14 ± 0.0056
Table 9: Predictions for two STS-B examples by
GP regression and BBB.
predictions, i.e. have small variance over all in-
stances, thus resulting in the smaller SHP and
larger NLPD. Meanwhile, models with extremely
small NLPD are less confident in inaccurate pre-
dictions, and might also be under-confident in
correct predictions.
We score sentence pairs in the STS-B test set
using BERT Bayesian LR (BBB) and SBERT
GP.5 Overall, the incorrect predictions (> 1 depuis
the true score) by BBB have a much smaller
variance compared to those predicted by GP. Pour
correct predictions (≤ 1 of the true score), BBB
has a higher variance than for incorrect predic-
tion, which is counter-intuitive. Though the std
for SBERT GP regression on correct predictions
is much larger than BBB, it’s slightly less than
that for incorrect ones. This fits the expectation
that when a model is good at uncertainty pre-
diction, the model should be more confident for
correct predictions than incorrect ones. Examples
where both models are correct and incorrect are
presented in Table 9.
The near-zero variance of BBB (0.005 on av-
erage) results in infinite NLPD because of the
in the NLPD formula. Larger
element
SHP of GP tends to produce smaller NLPD in spite
(ti−mi)2
vi
5These two were chosen because they have similar r, mais
one has the largest NLPD and the other has the smallest.
of being under-confident on correct cases—the
variance of 1.57 is much larger than the true gap
de 0.01. So NLPD is not a perfect metric, favour-
ing under-confident models. We therefore suggest
a metric priority order of r, CAL, NLPD and SHP.
Impact of Sentence Embedding: The quality
of sentence embeddings is critical for uncertainty
entraînement, affecting not only the correlation, mais
also the uncertainty metrics. Instead of SBERT,
we also experimented with simCSE, the current
state-of-the-art sentence encoder
(Gao et al.,
2021). We train three pipeline models with STS-B
training data based on sup-simcse-bert-base-
uncased, using the same settings as the first row of
Tableau 3, and evaluate on the STS-B, ENMSASS,
and MedSTS test sets. In Table 8, contrasting with
the results in Table 2 for STS-B and Yelp, et
results in Table 4 for EBMSASS and MedSTS,
the correlation improves for all datasets other
than MedSTS, and CAL and NLPD drop. Ce
suggests that better sentence encoders boost pipe-
line performance.
High-disagreement Label Detection: A natu-
ral question to ask in the instance selection is what
types of instances are selected and discarded,
and how this correlates with the underlying la-
bel uncertainty in the data. When models are
well-calibrated, the predicted variance will reflect
the true label uncertainty, both aleatoric and epis-
temic. En tant que tel, if we select instances with smaller
variance, we are effectively filtering out instances
with higher inherent label uncertainty, as should
be reflected in the labels assigned by independent
annotators. We verify this hypothesis below.
We apply the model fine-tuned on STS-B over
BIOSSES and EBMSASS (1000 instances each),
for which five raw annotations for each instance
can be accessed to approximate an empirical
label distribution. KL-Divergence (KL) is used
to measure the distance between the predicted
and empirical probability. In Table 10, the trend
in KL values on the two datasets is consistent
with CAL/NLPD across all estimation methods,
691
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
je
un
c
_
un
_
0
0
4
8
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
EBMSASS
BIOSSES
r ↑ CAL / NLPD ↓ KL1 / KL2 ↓
r ↑ CAL / NLPD ↓ KL1 / KL2 ↓
8.75 / 1.23
0.828 0.236 / 3.319
LR MC
ConvLR MC 0.806 0.201 / 4.668
12.74 / 1.46
0.854 0.633 / 5351.
LR BBB
16297 / 5.00
ConvLR BBB 0.806 0.736 / 1091. 2373.7 / 4.13
8.82 / 1.54
0.870 0.250 / 4.488
0.823 0.304 / 12.59
19.64 / 2.06
0.836 0.530 / 11972 16598 / 4.90
0.804 0.923 / 2076. 2631.2 / 5.01
Intrinsic metrics results on EBM-
Tableau 10:
SASS 1000 and BIOSSES based on a model
trained on STS-B. KL1 = KL-Divergence(p(cid:4)q),
KL2 = KL-Divergence(q(cid:4)p): p = gold empirical
distribution; q = predicted distribution.
indirectly suggesting that CAL and NLPD are ef-
fective metrics in the absence of empirical label
distributions.
Do large-variance instances selected by strate-
gies in Section 4.2 overlap with high-disagreement
instances? Without a ground truth of high-
disagreement annotations, they are identified by
two steps iteratively: (1) select labels whose std
is greater than α, beginning from 0.3; et (2)
manually check whether for all selected instances,
at least two out of the five annotations differ from
the others by ≥ 1.0; if not α+=0.1, otherwise
end. This results in 137 et 31 label disagree-
ments when α = 0.5 et 0.4, for EBMSASS and
BIOSSES, respectivement.
Using BERT LR MC-dropout, a learned thresh-
old of τ = 0.162 results in Acc = 0.48, F1 = 0.28 à
high-disagreement label detection on EBMSASS.
For BIOSSES, τ = 0.1 leads to Acc = 0.37, F1 =
0.48. Under ConvLR MC, EBMSASS has Acc =
0.46, F1 = 0.31 as τ = 0.124; BIOSSES: τ = 0.157
with Acc = 0.45, F1 = 0.48.
En tant que tel, high-disagreement labels can be de-
tected by the large-variance criterion, obtaining
Acc = 0.44, F1 = 0.39 on average. This is not good
as a binary classifier, since regarding all instances
as the majority-class ‘‘clean’’ performs better. Mais
in our context, it is effective as a data augmenta-
tion strategy—selecting clean examples from an
out-of-domain corpus. Detecting noisy labels is
not just a binary classification task requiring high
accuracy, but critical to recognize and filter noisy
instances from a whole training corpus, even at
the cost of removing clean labels.
9 Conclusion
We comprehensively investigated a range of
uncertainty estimation methods over different re-
gression tasks, using pre-trained language models.
Bayesian linear regression and sparse Gaussian
process regression based on fixed features ob-
tain lower calibration error and NLPD compared
with fine-tuning large-capacity deep networks
end-to-end, but are inferior in terms of correlation.
When embeddings are sufficiently expressive,
they are comparable in performance to deep un-
certainty models.
To reduce uncertainty resulting from noisy la-
bels and limited labeled data in specific domains,
we proposed a simple instance selection method
based on uncertainty model predictive confidence.
This approach demonstrated consistent perfor-
mance improvements on three regression tasks in
both self-training and active-learning settings, et-
derscoring its effectiveness and generalizability.
Remerciements
We thank the anonymous reviewers and three
editors for their helpful comments. Yuxia Wang is
supported by scholarships from the University of
Melbourne and China Scholarship Council (CSC).
Les références
Moloud Abdar, Farhad Pourpanah, Sadiq Hussain,
Dana Rezazadegan, Li Liu, Mohammad
Ghavamzadeh, Paul Fieguth, Xiaochun Cao,
Abbas Khosravi, U. Rajendra Acharya, et
Vladimir Makarenkov, and Saeid Nahav.
2020. A review of uncertainty quantification
in deep learning: Techniques, applications and
challenges. arXiv preprint arXiv:2011.06225.
https://doi.org/10.1016/j.inffus
.2021.05.008
Daniel Beck, Trevor Cohn, and Lucia Specia.
2014. Joint emotion analysis via multi-task
Gaussian processes. In Proceedings of
le
2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 1798–1803. https://est ce que je.org/10
.3115/v1/D14-1190
Yoshua Bengio,
J´erˆome Louradour, Ronan
Collobert, and Jason Weston. 2009. Curriculum
learning. In Proceedings of the 26th Annual
International Conference on Machine Learn-
ing, ICML 2009, volume 382, pages 41–48.
ACM. https://doi.org/10.1145/1553374
.1553380
692
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
je
un
c
_
un
_
0
0
4
8
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Zsolt Bitvai and Trevor Cohn. 2015. Predict-
ing peer-to-peer loan rates using Bayesian
non-linear regression. In Proceedings of the
AAAI Conference on Artificial Intelligence.
Charles Blundell,
Julien Cornebise, Koray
Kavukcuoglu, and Daan Wierstra. 2015. Weight
In Inter-
uncertainty in neural networks.
national Conference on Machine Learning,
pages 1613–1622. http://procédure
.mlr.press/v37/blundell15.pdf.
Enrico Camporeale and Algo Car`e. 2020. Estima-
tion of accurate and calibrated uncertainties in
deterministic models. CoRR, abs/2003.05103.
Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo
Lopez-Gazpio,
and Lucia Specia. 2017.
SemEval-2017 task 1: Semantic textual similarity
multilingual and crosslingual focused evalua-
tion. In Proceedings of the 11th International
Workshop on Semantic Evaluation (SemEval-
2017), pages 1–14. Vancouver, Canada.
https://www.aclweb.org/anthology/S17
-2001.
Chacha Chen,
Junjie Liang, Fenglong Ma,
Lucas M. Verre, Jimeng Sun, and Cao Xiao.
2020. Unite: Uncertainty-based health risk pre-
diction leveraging multi-sourced data. arXiv
preprint arXiv:2010.11389. https://est ce que je
.org/10.1145/3442381.3450087
Hyung Won Chung, Thibault F´evry, Henry
Tsai, Melvin Johnson, and Sebastian Ruder.
2020. Rethinking embedding coupling in pre-
trained language models. arXiv preprint arXiv:
2010.12821.
Courtney Corley and Rada Mihalcea. 2005. Mea-
suring the semantic similarity of texts. En Pro-
ceedings of the ACL Workshop on Empirical
Modeling of Semantic Equivalence and Entail-
ment, pages 13–18. https://est ce que je.org/10
.3115/1631862.1631865
Sharon E. Davis, Thomas A. Lasko, Guanhua
Chen, Edward D. Siew, and Michael E.
Matheny. 2017. Calibration drift
in regres-
sion and machine learning models for acute
kidney injury. Journal of the American Medi-
cal Informatics Association, 24(6):1052–1061.
https://doi.org/10.1093/jamia/ocx030,
PubMed: 28379439
Shrey Desai and Greg Durrett. 2020. Calibra-
tion of pre-trained transformers. arXiv preprint
arXiv:2003.07892. https://est ce que je.org/10
.18653/v1/2020.emnlp-main.21
Jacob Devlin, Ming-Wei Chang, Kenton Lee, et
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
le 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186.
https://www
Minneapolis, Minnesota.
.aclweb.org/anthology/N19-1423.
Piero Esposito. 2020. BLiTZ – Bayesian Lay-
ers in Torch Zoo (a Bayesian deep learing
library for Torch). https://github.com
/piEsposito/blitz-bayesian-deep
-learning/.
Yarin Gal and Zoubin Ghahramani. 2016. Dropout
as a Bayesian approximation: Representing
model uncertainty in deep learning. In Inter-
national Conference on Machine Learning,
pages 1050–1059. http://procédure
.mlr.press/v48/gal16.pdf.
Tianyu Gao, Xingcheng Yao, and Danqi Chen.
2021. SimCSE: Simple contrastive learning of
sentence embeddings. In Empirical Methods in
Natural Language Processing (EMNLP).
Taisiya Glushkova, Chrysoula Zerva, Ricardo
Rei, and Andr´e F. T. Martins. 2021. Uncertainty-
aware machine translation evaluation. CoRR,
abs/2109.06352. https://doi.org/10.18653
/v1/2021.findings-emnlp.330
Tilmann Gneiting, Fadoua Balabdaoui, et
Adrian E. Raftery. 2007. Probabilistic fore-
casts, calibration and sharpness. Journal of the
Royal Statistical Society: Série B (Statistical
Méthodologie), 69(2):243–268. https://est ce que je
.org/10.1111/j.1467-9868.2007.00587.x
Yvette Graham, Timothy Baldwin, Alistair Moffat,
and Justin Zobel. 2017. Can machine transla-
tion systems be evaluated by the crowd alone?
Natural Language Engineering, 23(1):3–30.
https://doi.org/10.1017/S1351324915
000339
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q.
Weinberger. 2017. On calibration of modern
neural networks. In International Conference
on Machine Learning, pages 1321–1330.
693
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
je
un
c
_
un
_
0
0
4
8
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Hamed Hassanzadeh, Anthony Nguyen, and Karin
Verspoor. 2019. Quantifying semantic simi-
larity of clinical evidence in the biomedical
literature to facilitate related evidence synthesis.
Journal of Biomedical Informatics. https://
doi.org/10.1016/j.jbi.2019.103321,
PubMed: 31676460
Jos´e Miguel Hern´andez-Lobato
and Ryan
Adams. 2015. Probabilistic backpropagation
pour
scalable learning of Bayesian neural
réseaux. In International Conference on Ma-
chine Learning, pages 1861–1869. http://
proceedings.mlr.press/v37/hernandez
-lobatoc15.pdf.
Zhengbao Jiang, Frank F. Xu, Jun Araki, et
Graham Neubig. 2020. How can we know
what language models know? Transactions of
the Association for Computational Linguistics,
8423–438. https://est ce que je.org/10.1162
/tacl_a_00324
Dongyeop Kang, Waleed Ammar, Bhavana Dalvi,
Madeleine van Zuylen, Sebastian Kohlmeier,
Eduard Hovy, and Roy Schwartz. 2018. UN
dataset of peer reviews (PeerRead): Collection,
insights and NLP applications. In Proceedings
of the 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 1647–1661,
La Nouvelle Orléans, Louisiana. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N18-1149
Brian Keith, Exequiel Fuentes, and Claudio
Meneses. 2017. A hybrid approach for senti-
ment analysis applied to paper. In Proceedings
of ACM SIGKDD Conference.
Alex Kendall and Yarin Gal. 2017. What
uncertainties do we need in Bayesian deep
learning for computer vision? In Advances
in Neural Information Processing Systems,
pages 5574–5584. https://procédure
.neurips.cc/paper/2017/hash/2650d6
089a6d640c5e85b2b88265dc2b-Abstrac
t.html.
.mlr.press/v80/kuleshov18a/kuleshov
18a.pdf.
M.. Pawan Kumar, Benjamin Packer,
et
Daphne Koller. 2010. Self-paced learning
for latent variable models. In Advances in
Neural
Systems,
Information Processing
volume 1. https://papers.nips.cc/paper
/2010/file/e57c6b956a6521b28495f2886c
a0977a-Paper.pdf.
Balaji Lakshminarayanan, Alexander Pritzel, et
Charles Blundell. 2017. Simple and scalable
predictive uncertainty estimation using deep
ensembles. In Advances in Neural Information
Processing Systems. https://arxiv.org
/pdf/1612.01474.pdf
Max-Heinrich Laves, Sontje Ihler, Jacob F.
Fast, L¨uder A. Kahrs, and Tobias Ortmaier.
2020. Well-calibrated regression uncertainty in
medical imaging with deep learning. In Medical
Imaging with Deep Learning, pages 393–412.
http://proceedings.mlr.press/v121
/laves20a/laves20a.pdf.
Felix Leibfried, Vincent Dutordoir, S. T. John, et
Nicolas Durrande. 2020. A tutorial on sparse
gaussian processes and variational inference.
arXiv preprint arXiv:2012.13962.
Specia Lucia, Fomicheva Marina, Blain Fr´ed´eric,
Guzm´an Paco, Chaudhary Vishrav, Fonseca
Erick, and Martins Andr´e. 2020. WMT 2020
quality estimation dataset. https://www
.statmt.org/wmt20/qualityestimation
-task.html.
Kristian Miok, Gregor Pirs, and Marko Robnik-
Sikonja. 2020. Bayesian methods for semi-
supervised text annotation. arXiv preprint
arXiv:2010.14872.
Robert Pinsler, Jonathan Gordon, Eric Nalisnick,
and Jos´e Miguel Hern´andez-Lobato. 2019.
Bayesian batch active learning as
sparse
subset approximation. In Advances in Neural
Information Processing Systems, volume 32,
pages 6359–6370. https://procédure
.neurips.cc/paper/2019/file/84c2d48
60a0fc27bcf854c444fb8b400-Paper.pdf.
Volodymyr Kuleshov, Nathan Fenner,
et
Stefano Ermon. 2018. Accurate uncertainties
for deep learning using calibrated regression. Dans
International Conference on Machine Learning,
pages 2796–2804. http://procédure
Daniel Preot¸iuc-Pietro and Trevor Cohn. 2013.
A temporal model of text periodicities using
Gaussian processes. In Proceedings of the 2013
Conference on Empirical Methods in Natural
Language Processing, pages 977–988.
694
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
je
un
c
_
un
_
0
0
4
8
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Puria Radmard, Yassir Fathullah, and Aldo Lipani.
2021. Subsequence based deep active learning
for named entity recognition. In Proceedings of
the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th Inter-
national Joint Conference on Natural Language
Processing, ACL/IJCNLP 2021, (Volume 1:
Long Papers), Virtual Event, August 1–6, 2021,
pages 4310–4321. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/2021.acl-long.332
Nils Reimers and Iryna Gurevych. 2019. Sentence-
BERT: Sentence embeddings using Siamese
BERT-networks. In Proceedings of the 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessation (EMNLP-IJCNLP), pages 3982–3992,
Hong Kong, Chine. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/D19-1410
C. Rusmassen and C. Williams. 2005. Gaussian
process for machine learning. https://est ce que je
.org/10.7551/mitpress/3206.001.0001
Omkar Sabnis. 2018. Yelp review dataset.
https://www.kaggle.com/omkarsabnis
/yelp-reviews-dataset
Burr Settles. 2009. Active learning literature
survey. University of Wisconsin-Madison
Department of Computer Sciences. http://
burrsettles.com/pub/settles.active
learning.pdf.
Burr Settles and Mark Craven. 2008. An analysis
of active learning strategies for sequence la-
beling tasks. Dans 2008 Conference on Empiri-
cal Methods in Natural Language Processing,
the Confer-
EMNLP 2008, Proceedings of
ence, 25–27 October 2008, Honolulu, Hawaii,
Etats-Unis, A meeting of SIGDAT, a Special Interest
Group of the ACL, pages 1070–1079. ACL.
https://doi.org/10.3115/1613715
.1613855
Aili Shen, Daniel Beck, Bahar Salehi, Jianzhong
Qi, and Timothy Baldwin. 2019. Modelling
uncertainty in collaborative document quality
assessment. In Proceedings of the 5th Work-
shop on Noisy User-generated Text (W-NUT
2019), pages 191–201, Hong Kong, Chine.
Association for Computational Linguistics.
695
https://doi.org/10.18653/v1/D19
-5525
Joachim Sicking, Maram Akila, Maximilian
Pintz, Tim Wirtz, Asja Fischer, and Stefan
Wrobel. 2021. A novel regression loss for
non-parametric uncertainty optimization. arXiv
preprint arXiv:2101.02726.
Gizem So˘gancıo˘glu, Hakime ¨Ozt¨urk, and Arzucan
¨Ozg¨ur. 2017. BIOSSES: A semantic sentence
similarity estimation system for the biomed-
ical domain. Bioinformatics, 33(14):i49–i58.
https://doi.org/10.1093/bioinforma
tics/btx238, PubMed: 28881973
regression.
Hao Song, Tom Diethe, Meelis Kull, et
Peter Flach. 2019. Distribution calibration
International Confer-
pour
ence on Machine Learning, pages 5897–5906.
http://proceedings.mlr.press/v97
/song19a/song19a.pdf
Dans
Nandan Thakur, Nils Reimers,
Johannes
Daxenberger, and Iryna Gurevych. 2020. Aug-
mented sbert: Data augmentation method for
improving bi-encoders for pairwise sentence
scoring tasks. arXiv preprint arXiv:2010.08240.
https://doi.org/10.18653/v1/2021
.naacl-main.28
inducing variables
Michalis Titsias. 2009. Variational
learning
de
in sparse Gaussian
processes. In Artificial intelligence and statis-
tics, pages 567–574. http://procédure
.mlr.press/v5/titsias09a/titsias
09a.pdf.
Juozas Vaicenavicius, David Widmann, Carl
Andersson, Fredrik Lindsten, Jacob Roll, et
Thomas Sch¨on. 2019. Evaluating model
calibration in classification. In The 22nd Inter-
national Conference on Artificial Intelligence
and Statistics, pages 3459–3467. http://
proceedings.mlr.press/v89/vaicenavici
us19a/vaicenavicius19a.pdf.
Yu Wan, Baosong Yang, Derek F. Wong, Yikai
Zhou, Lidia S. Chao, Haibo Zhang, and Boxing
Chen. 2020. Self-paced learning for neural
machine translation. In Proceedings of the 2020
Conference on Empirical Methods in Natural
Language Processing, EMNLP 2020, En ligne,
November 16–20, 2020, pages 1074–1080.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.emnlp-main.80
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
je
un
c
_
un
_
0
0
4
8
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Yanshan Wang, Naveed Afzal, Sunyang Fu, Liwei
Wang, Feichen Shen, Majid Rastegar-Mojarad,
and Hongfang Liu. 2018. MedSTS: A resource
for clinical semantic textual similarity. Lan-
guage Resources and Evaluation, pages 1–16.
https://doi.org/10.1007/s10579-018
-9431-1
Yanshan Wang, Sunyang Fu, Feichen Shen,
Sam Henry, Ozlem Uzuner, and Hongfang
Liu. 2020un. Le 2019 n2c2/OHNLP track on
clinical semantic textual similarity: Overview.
JMIR Medical Informatics, 8(11). https://
doi.org/10.2196/23375
Yuxia Wang, Fei Liu, Karin Verspoor, et
Timothy Baldwin. 2020b. Evaluating the utility
of model configurations and data augmen-
tation on clinical semantic textual similar-
the 19th SIGBioMed
ville.
Workshop on Biomedical Language Processing,
pages 105–111, En ligne. Association for Com-
In Proceedings of
putational Linguistics. https://doi.org
/10.18653/v1/2020.bionlp-1.11
Yuxia Wang, Karin Verspoor, and Timothy
Baldwin. 2020c. Learning from unlabeled data
for clinical semantic textual similarity. En Pro-
ceedings of the 3rd Clinical NLP Workshop,
En ligne. EMNLP. https://est ce que je.org/10
.18653/v1/2020.clinicalnlp-1.25
Boyang Xue, Jianwei Yu, Junhao Xu, Shansong
Liu, Shoukang Hu, Zi Ye, Mengzhe Geng,
Xunying Liu, and Helen Meng. 2021. Bayes-
ian transformer language models for speech
reconnaissance. arXiv preprint arXiv:2102.04754.
https://doi.org/10.1109/ICASSP39728
.2021.9414046
Eric Zelikman, Christopher Healy, Sharon Zhou,
and Anand Avati. 2020. Crude: Calibrating re-
gression uncertainty distributions empirically.
arXiv preprint arXiv:2005.12496.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
8
3
2
0
2
9
9
5
1
/
/
t
je
un
c
_
un
_
0
0
4
8
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3