Uncertainty Estimation and Reduction of Pre-trained Models for - IA de Investigación especializada en el MIT

Estimación de incertidumbre y reducción de modelos previamente entrenados para
Text Regression

Yuxia Wang(cid:2)

Daniel Beck(cid:2)

Timothy Baldwin(cid:2)

Karin Verspoor†(cid:2)

(cid:2) The University of Melbourne, Melbourne, Victoria, Australia
†RMIT University, Melbourne, Victoria, Australia

yuxiaw@student.unimelb.edu.au

d.beck@unimelb.edu.au

tb@ldwin.net

karin.verspoor@rmit.edu.au

Abstracto

State-of-the-art classification and regression
models are often not well calibrated, y
cannot reliably provide uncertainty estimates,
limiting their utility in safety-critical applica-
tions such as clinical decision-making. Mientras
recent work has focused on calibration of
classifiers, there is almost no work in NLP
on calibration in a regression setting. En esto
paper, we quantify the calibration of pre-
trained language models for text regression,
both intrinsically and extrinsically. We fur-
ther apply uncertainty estimates to augment
training data in low-resource domains. Nuestro
experiments on three regression tasks in both
self-training and active-learning settings show
that uncertainty estimation can be used to in-
crease overall performance and enhance model
generalización.

Introducción

Modern neural network models, particularly those
based on pre-training and fine-tuning, tener
achieved impressive results across a broad spec-
trum of NLP tasks, in terms of evaluation metrics
such as classification accuracy or F-score for
classification tasks and mean squared error for
regression tasks. Sin embargo,
the standard train-
ing regime fails to take model uncertainty into
cuenta, and tends to result in over-fitting and
poor generalization, especially in limited training
data situations.

Además, these models have been empiri-
cally demonstrated to have poor calibration—the
predictive probability does not reflect the true cor-
rectness likelihood, and they are generally over-
confident when they make wrong predictions (guo
et al., 2017; Desai and Durrett, 2020; Jiang et al.,
2020). Put differently, the models do not know
what they don’t know. This is particularly the

680

case in low-resource settings. Sin embargo, faithfully
assessing the uncertainty of model predictions is
as important as obtaining high accuracy in many
safety-critical applications, such as autonomous
driving or clinical decision support (Chen et al.,
2020; Kendall and Gal, 2017; Davis et al., 2017).
If models were able to more faithfully capture
their lack of certainty when they make erroneous
predicciones, they could be used more reliably in
critical decision-making contexts, and avoid ca-
tastrophic errors.

In the context of text regression, we aim to al-
leviate over-fitting and improve generalizability
in low-resource settings by taking the uncer-
tainty sourced from both the data and model into
cuenta. Específicamente, we address: (1) data uncer-
tainty by filtering noisy annotations from (either
pseudo or gold) labeled data based on predictive
confidence, preventing models from memorizing
out-of-distribution examples; y (2) model un-
certainty to accurately estimate both the target
value and predictive confidence by uncertainty
modelos, providing more reliable and interpretable
predicciones, meanwhile effectively supporting de-
noising in (1).

Uncertainty estimation has been extensively
explored in the context of classification (guo
et al., 2017; Vaicenavicius et al., 2019; Desai
and Durrett, 2020; Jiang et al., 2020), but is rela-
tively unexplored for regression tasks, due to the
complexities in dealing with a continuous target
espacio. The output of a classifier passed through a
softmax layer naturally provides a discrete prob-
ability distribution, while in a regression setting
the output is a single numerical value.

We compare four well-studied techniques for
uncertainty estimation, as applied to pre-trained
language models (LMs): Gaussian processes
(Shen et al., 2019; Camporeale and Car`e, 2020),

Transacciones de la Asociación de Lingüística Computacional, volumen. 10, páginas. 680–696, 2022. https://doi.org/10.1162/tacl a 00483
Editor de acciones: Dani Yogatama. Lote de envío: 11/2021; Lote de revisión: 02/2022; Publicado 6/2022.
C(cid:3) 2022 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1

/
t

a
C
_
a
_
0
0
4
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Bayesian linear regression (Hern´andez-Lobato
and Adams, 2015), Bayes by backprop, and Monte
carlo (MC) dropout. To comprehensively assess
uncertainty quality, we evaluate results intrin-
sically using various metrics, and extrinsically
with several downstream experiments. Our anal-
ysis shows that predictions are highly uncertain
and inaccurate in low-resource scenarios.

Two major types of uncertainty have been
identificado: aleatoric uncertainty captures noise
inherent in the observations; and epistemic un-
certainty accounts for uncertainty in the model,
which can be explained away given enough data,
compensating for limited knowledge (Kendall and
Gal, 2017). En otras palabras, uncertainty results pri-
marily from noisy human annotations, insufficient
labeled data, and out-of-domain text in practice
(Glushkova et al., 2021). We therefore propose
a simple method to filter noisy labels and se-
lect high-quality instances from an unlabeled data
pool based on the predictive confidence, cual
on the one hand alleviates both aleatoric and epis-
temic uncertainty, and on the other hand, mejora
accuracy and generalization thanks to increased
training data.

En este trabajo, we explore how to estimate un-
certainty in a regression setting with pre-trained
language models, and evaluate estimation quality
both intrinsically and extrinsically. Intrinsic un-
certainty estimation provides the basis for our
proposed data selection strategy: By filtering
noise based on confidence thresholding, and mit-
igating exposure bias, our approach is shown
to be effective at improving both performance
and generalization in low-resource settings, en
self-training, and active learning settings.

2 Fondo

We first review approaches for estimating the
predictive uncertainty of deep neural networks
then meth-
(DNNs)
in a regression setting,
ods for
reducing uncertainty and improving
generalización.

2.1 Uncertainty Estimation in DNNs

Bayesian Estimation Bayesian approaches pro-
vide a general framework for dealing with un-
certainty estimation, Por ejemplo, in the form of
Gaussian processes (GPs: Camporeale and Car`e,
2020; Shen et al., 2019) and Bayesian neural

redes (Hern´andez-Lobato and Adams, 2015).
Sin embargo, prior work has either been based on
hand-crafted features, or based on small-scale neu-
ral networks with only one or two hidden layers,
which are far removed from modern pre-trained
LMs. How to combine deterministic pre-trained
LMs with Bayesian methods to achieve both high
accuracy and accurate uncertainty estimation is an
open problem, particularly in a regression setting.
While applying Bayesian estimation to all
model parameters in large-scale LMs is theo-
retically possible, in practice it is prohibitively
expensive in both model training and evaluation
(Xue et al., 2021). Concretely, the true Bayesian
posterior on the weights P (w|D) is generally
approximated by variational inference, minimiz-
ing the KL divergence with a parameterized dis-
tribution q(w|i):

i(cid:2) = arg min

KL[q(w|i)(cid:4)PAG (w|D)]
(cid:2)

= arg min

q(w|i) registro

q(w|i)
PAG (w)PAG (D|w)

Deriving uncertainty estimates by integrating over
millions of model parameters, and initializing the
prior distribution for each are both non-trivial.
One simple strategy for combining them is Bayes
by backprop (BBB: Blundell et al., 2015), dónde-
by unbiased Monte Carlo gradients are minimized:

norte(cid:3)

yo=1

log q(w(i)|i) − log P (w(i)) − log P (D|w(i))

donde w(i) denotes the ith Monte Carlo sample
drawn from the variational posterior q(w(i)|i).

Ensemble Estimation Another approach is to
estimate uncertainty by ensemble, typically with
MC-dropout (Gal and Ghahramani, 2016) y
deep ensembles (Lakshminarayanan et al., 2017),
which are agnostic to model structure.

MC-dropout casts dropout training in DNNs as
approximate Bayesian inference in deep Gaussian
procesos. The predictive probability of the deep
GP model (integrated with respect to the finite
rank covariance function parameters w) given
precision parameter τ > 0 es:

(cid:2)

pag(y|X, D) =

pag(y|X, w)pag(w|D)dw

pag(y|X, w) = N (y; ˆy(X, w), τ −1ID)

681

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1

/
t

a
C
_
a
_
0
0
4
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

The dropout NNs are kept on during evaluation,
without changing either the model or the opti-
mization strategy. MC-dropout and its variants
have been extensively used to estimate regression
uncertainty due to their simplicity and scalability
in implementation (Zelikman et al., 2020; Laves
et al., 2020; Sicking et al., 2021).

The deep ensemble approach trains multiple
copies of the variance networks from different
network initializations to estimate predictive dis-
tributions. It operates similarly to sub-networks
of MC dropout, but is computationally more ex-
pensive due to the need to train multiple models.
Además, the need to split the training data into
multiple folds to train different networks exacer-
bates overfitting in small-data scenarios. Given
our specific focus on low-data scenarios, we focus
exclusively on MC dropout in this paper.

The only work we are aware of for estimating
uncertainty with transformers in a regression set-
ting is Glushkova et al. (2021), who use ensemble
estimation of uncertainty for machine translation
quality evaluation, comparing the translated sen-
tence with a reference translation. A diferencia de, nosotros
experiment in a cross-lingual setting, comparing a
source sentence and its translation directly.

2.2 Selecting Clean Instances

To reduce the uncertainty from both data and
modelo, we draw on approaches that can filter noisy
labels from labeled data, and select clean instances
from unlabeled data, thus eliminating aleatoric
incertidumbre, and reducing epistemic uncertainty
due to the enhanced knowledge learned from the
augmented data. In brief, we need a method to
distinguish noisy and clean labels.

It has been shown that in data augmentation,
self-training, and zero-shot learning, using the
right sampling strategy is critical (Thakur et al.,
2020; Wang y cols., 2020C). Sin embargo, previous
work has mainly focused on label distribution
balance, and lexical and semantic similarity, pero
not uncertainty.

En este trabajo, we propose a simple method lever-
aging predictive confidence, to select high-quality
instancias, which is related to uncertainty-based
sampling in active learning (Settles, 2009). Cómo-
alguna vez, most work in active learning has focused
on classification rather than regression, either ex-
tracting the least probable or the most informative
examples with large entropy (Settles and Craven,
2008; Pinsler et al., 2019; Radmard et al., 2021).

Our approach also has a similar flavor to
self-paced curricular learning (Bengio et al., 2009;
Kumar et al., 2010; Wan et al., 2020), en el cual
the aim is to choose ‘‘hard’’ examples and gra-
dually increase the difficulty of learning con-
tent, differing from the criteria in our setting—
‘‘clean’’ ones.

According to a recent review of uncertainty
estimation for DNNs (Abdar et al., 2020), allá
is little work on using aleatoric uncertainty for
denoising and sampling in NLP tasks. The most
relevant work is that by Miok et al. (2020), OMS
aims to guide the annotation process for the binary
classification task of hate speech detection.

3 Tasks and Notation

en este documento, we consider text regression across
three separate tasks, and a total of 10 conjuntos de datos.

Tasks STS: Semantic textual similarity assesses
the degree of semantic equivalence between two
pieces of text (Corley and Mihalcea, 2005). El
aim is to predict a similarity score for a sentence
pair (S1, S2), generally in the range [0, 5], dónde
0 indicates complete dissimilarity and 5 indicates
equivalence in meaning. Como ejemplo:

S1: Total minutes spent in timed codes: 10 mins.
S2: Total minutes spent in timed codes: 33 mins.

might be labeled 4, as the two texts differ only in
very specific content (underlined).

SA: Sentiment analysis rating involves predict-
ing a sentiment score for a review S, in the range
1 (extremely negative) a 5 (extremely positive).
Y: Machine translation quality estimation,
based on the direct assessment approach (graham
et al., 2017), aims to predict a normalised quality
score for text pair (S1, S2), where S2 is machine
translated from S1. Tal como, it is similar to STS,
but differs in that it is cross-lingual.

Notation and Assumptions Throughout
este
paper, raw examples, column vectors, and ma-
trices are denoted in lower-case italics, bold, y
upper-case italics, respectivamente (p.ej., X, X, y
X). θencoder and θreg represent parameters of the
encoder and task-specific regression layers, y
F (i, ·) refers to the whole model. Take a dataset
re = {(x1, y1), (xi, yi), · · · , (xN , entonces )}, dónde
(xi, yi) is the ith instance, yi ∈ R, and xi =
s(θencoder, xi) is the hidden state of xi. El

682

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1

/
t

a
C
_
a
_
0
0
4
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Dataset

Size (train, prueba, desarrollador)

Range

Domain

STS-B (2017)
MedSTS (2018)
N2C2-STS (2019)
BIOSSES (2017)
EBMSASS (2019)

5749, 1379, 1500
750, 318, —
1642, 412, —
100, —, —
700, 300, —

Yelp (2018)
PeerRead (2018)

7000, 1500, 1500,
713, 290, —

[0, 5]
[0, 5]
[0, 5]
[0, 4]
[1, 5]

[1,5]
[1,5]

general
clinical
clinical
biomedical
biomedical

product
paper

WMT en-zh (2020)
WMT ru-en (2020)
WMT si-en (2020)

7000, 1000, 1000
7000, 1000, 1000
7000, 1000, 1000

[0, 100]
[0, 100]
[0, 100]

high-resource
medium-resource
low-resource

Mesa 1: STS/SA rating/QE-DA datasets. Tren,
Prueba, Dev Size = number of text pairs, range =
label range. En la práctica, QE-DA is normalised by
z-score.

loss function is the empirical risk of the mean
yo=1 (F (i, xi)−yi)2
square error (MSE): L = 1
norte

(cid:4)

norte

Datasets We evaluate on different-sized data-
sets across various domains for STS and SA, y
three same-sized datasets for DA, summarized
en mesa 1.

For STS, we use: (1) one large-scale general
conjunto de datos, STS-B (Cer et al., 2017); (2) two small
clinical data sets, MedSTS (Wang y cols., 2018) y
N2C2-STS (Wang y cols., 2020a); y (3) two small
biomedical data sets, BIOSSES (Soˇgancıoˇglu
et al., 2017) and EBMSASS (Hassanzadeh et al.,
2019), each of which is 5-way annotated.

For SA, we use: (1) a large-scale product review
conjunto de datos, Yelp (Sabnis, 2018); y (2) a small pa-
per review rating dataset, PeerRead (Kang et al.,
2018), augmented with 399 Spanish paper re-
puntos de vista (Keith et al., 2017) machine-translated into
Inglés.

For DA, we use the three language pairs from
WMT2020 (Lucia et al., 2020), en-zh, ru-en, y
si-en, corresponding to high-, medium-, and low-
resource settings in terms of the source language.

4 Método

Cifra 1: Overview of pipeline and end-to-end training
workflow. izquierda: SBERT is fine-tuned separately with
STS/NLI labeled data using MSE/NLL loss; middle:
well-trained SBERT provides off-the-shelf sentence
embeddings
to GP/Cosine similarity. End-to-end
(bien): under MC-dropout, keep dropout on in infer-
ence; in BBB, parameters of LR/HConv are stochastic
variables.

ción, either in a pipeline approach, or end-to-
end. Cifra 1 provides an overview.

Pipeline Training To estimate probability dis-
tributions for the regression task of document
quality assessment, Shen et al. (2019) used a
Gaussian process (médico de cabecera) with Radial Basis Function
(FBR) kernel function over hand-crafted features.
We build off this in applying Bayesian linear re-
gression and sparse GP regression to pre-trained
sentence encoders,
such as Sentence-BERT
(SBERT; Reimers y Gurévych, 2019). For text
input x, we generate x = s(θencoder, X) ∈ Rd.
In this way, we leverage contextualized sentence
representaciones, while avoiding the complexity
of estimating uncertainty directly from a large-
scale Bayesian neural network.

Bayesian Linear Regression: The prior distri-
bution of a Bayesian linear layer with parameters
w and b is set to be a Gaussian distribution:

ˆy = w(cid:2) · x + b + ε
w ∼ N (μ, p2I); b ∼ N (μ, 1)

(1)

(2)

En esta sección, we first
introduce approaches
for estimating regression uncertainty based on
pre-trained LMs, then propose a simple method to
sample ‘‘clean’’ instances from unlabeled data to
augment training data based on predictive un-
certeza. The proposed methods can be applied
equally in semi-supervised and unsupervised set-
tings (including active learning and self-learning).

4.1 Bayesian Regression using LMs

We investigate two alternatives to combine pre-
trained transformer LMs with Bayesian estima-

where ˆy is the approximated value and ε is the
observation noise, which is assumed to be an
independent and identically distributed random
variable ε ∼ N (0, p2).

Gaussian Processes (GPs) (Rusmassen and
williams, 2005) are a natural way to generalize
the concept of a multivariate normal distribution
determined by a mean vector μ and covariance
matrix Σ,
to describe a real-valued function.
They provide a mathematically elegant framework
for Bayesian inference and offer principled un-
certainty estimates for regression problems with

683

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1

/
t

a
C
_
a
_
0
0
4
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

a closed-form posterior (Leibfried et al., 2020).
Given (xi, yi), yi = f (xi) + εi, where f (·) es un
real-valued function with input xi that is sampled
from a GP, and where εi are scalar indepen-
dent and identically distributed random variables
corresponding to observation noise.

The prior on data generation can be encapsu-
lated in the distribution of f (·). We assume that
F (·) is distributed according to a GP, eso es,

F (X) ∼ GP(metro(X), k(X, X(cid:7))))

(3)

where m(X) is a mean function, and k(X, X(cid:7)) es
a covariance or kernel function, correspondiente
to μ and Σ of a multivariate normal distribution.
Siguiendo la práctica común, we fix the mean
function to zero, and use a RBF as the kernel
función (Preot¸iuc-Pietro and Cohn, 2013; Arroyo
et al., 2014; Bitvai and Cohn, 2015; Shen et al.,
2019).

Computing the exact posterior requires the stor-
age and inversion of an (N × N ) matrix, cual
is quadratic in the amount of training data N
and has cubic computational complexity, ambos de
which are infeasible for large datasets. Thus we
use sparse GPs, which approximate an exact GP
by using a small set of latent inducing points
(Titsias, 2009), learned by variational inference.

End-to-end Training Rather than pre-training
a LM and task-specific model separately, Xue
et al. (2021) jointly trained them by only applying
Bayesian estimation to a subset of the model
parámetros. This requires training entirely from
scratch, while we seek to leverage pre-trained
LMs. We apply Bayesian inference to task-
specific layers, keeping parameters of the LM
deterministic and making task-specialised param-
eters stochastic during fine-tuning. En tono rimbombante,
being deterministic is not equivalent to being fro-
zen: Parameters are updated as in non-Bayesian
optimization, rather than kept fixed during back-
propagation.

To increase randomness, we evaluate on two
task-specific networks with more stochastic pa-
rameters than a single-layer linear regression net-
work used in Pipeline Training, as detailed below.
Bayesian Two-layer MLP: The linear regres-
sion layers take the hidden state h ∈ Rd, a través de
a two-layer MLP with tanh activation function:

h(cid:7) = tanh(¿Qué? + b); ˆy = wT h(cid:7) + b

(4)

ˆy

dónde
y
W ∈ Rd×d, b, w ∈ Rd and b ∈ R are trainable
parámetros.

approximated score,

Bayesian Hierarchical Convolution: Drawing
on the finding that a hierarchical convolution neu-
ral network (HConv) is effective in low-resource
settings (Wang y cols., 2020b), and that increas-
ing the capacity of task-specific layers can boost
actuación (Chung et al., 2020), we train a
large-capacity network as follows. HConv is struc-
tured as a two-layer convolutional network, con
kernel size k = 2, 3, 4 in the first layer and k = 2
in the second (Wang y cols., 2020b). The prior dis-
tributions of the weights and bias are based on
ecuación. (2) for Bayesian inference, and the inference
method follows Bayes by Backprop (Blundell
et al., 2015).

4.2 Predictive Uncertainty-based Sampling
Given a pre-trained uncertainty model f (i, ·),
y un (large-scale) unlabeled data pool Du =
{x1, x2, · · · , xi, · · · , xU }, the distribution of the
predicted yi for input xi is:

PAG (yi) = fθ(xi) ∼ N (μi, σi)

(5)

where μi and σi are the mean and standard
deviation of the normal distribution of yi.
Our aim is to sample a subset D(cid:7)

u from Du
in which the uncertainty model is expected to
be sufficiently confident in predicting D(cid:7)
tu, eso es
have a confidence interval as narrow as possible
under a given confidence level. Por ejemplo,
bajo 99% confidence, the confidence interval
[μi −2.58σi, μi +2.58σi] is expected to be narrow.
Put differently, the distribution is concentrated
around the mean with small standard deviation.

Based on this, we propose a simple instance
selection method based on predictive uncertainty.
For each instance xi in Du, if σi < τ , select xi; D(cid:7) ← xi. The threshold τ is a global hyperparam- u eter tuned over the validation set, or in the case of self-training and active learning, using a heuristic strategy.1 The strategy is based on the observation that the model can generally predict precisely for 1We also experimented with a strategy for tuning τ based on the principle of discarding the majority so that remaining examples are as clean as possible. Specifically, we set τ to the marginal value corresponding to the left boundary of the peak of the std probability distribution, but found little difference in results, so omit it from the paper. 684 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 3 2 0 2 9 9 5 1 / / t l a c _ a _ 0 0 4 8 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 instances of extreme polarity, such as labels in the ranges [0, 1] and [4, 5] for STS. We posit that cases whose predictive uncertainty is at the same level as these well-predicted examples are also predicted accurately. Formally, after inference, the unlabeled data pool is Du = {(xi, μi, σi)}, i ∈ [1, U ], where U is the number of unlabeled instances. The standard deviation of all well- predicted examples can be vectorized as σ = [σi], where σi is the std whose μi is at an extremum, such as 0 ≤ μi ≤ 1 or 4 ≤ μi ≤ 5 for STS. We then set τ = mean(σ). 5 Uncertainty Evaluation Metrics Evaluating uncertainty estimates of predictions is challenging in a regression setting, as the ‘‘ground truth’’ uncertainty is usually not avail- able (Lakshminarayanan et al., 2017). To evalu- ate model predictions, we consider four metrics. Pearson Correlation: It is vital to assess the predictive accuracy of the system, regardless of the uncertainty estimate. We use Pearson corre- lation r to evaluate the correlation between the system’s average predictions and ground truth quality scores. Calibration Error (CAL): One way to under- stand if models can be trusted is by analysing whether they are calibrated. Gneiting et al. (2007) defined calibration in a regression setting as the asymptotic consistency between the probabilistic forecasts Fi and the true data-generating distri- butions Gi, with the index i referring to each example. Practically, Fi is the cumulative probability distribution P (Y ≤ yi), Gi is generally esti- mated by empirical distribution functions based on the observations only. So calibration measures if the predictive confidence estimates are aligned with the empirical correctness likelihoods. Given a confidence level pj, the empirical accuracy is calculated: (cid:5) (cid:4) n i=1 I ˆpj = (cid:6) −1(pj) yi ≤ Fi n −1 is used to denote the quantile func- where Fi −1(p) = inf {y : p ≤ Fi(y)}, that is map- tion Fi ping from [0,1] → Y. The expected calibration m j=1 wj · (pj − ˆpj)2, with m con- error cal = fidence levels 0 ≤ p1 < · · · < pm ≤ 1, is the distance of predictive confidence away from the empirical accuracy. (cid:4) 685 (cid:4) strongly through logarithmic Negative Log-Probability Density (NLPD) complements CAL’s equal treatment to over- and under-confidence. It penalises over-confidence more scaling: n LNLPD = − 1 i=1 log p(yi = ti|xi)), favouring n In Gaussian predictive under-confident ones. distributions with mean mi and variance vi, the NLPD loss incurred for predicting at input xi with true associated target ti is given by: LNLPD = (cid:7) n(cid:3) i=1 1 2n log vi + (cid:8) (ti − mi)2 vi Sharpness (SHP): The metrics above do not account for the concentration of the predictive distributions, which generally favours predictors that produce wide and uninformative confidence intervals. To guarantee useful uncertainty esti- mation, confidence intervals should not only be calibrated, but also sharp and ‘‘tight’’ around the predicted value. The numerical width of prediction intervals (Gneiting et al., 2007; Song et al., 2019) and the mean of variance (Kuleshov et al., 2018; Zelikman et al., 2020) are often used to quantify sharpness. We apply the latter in our work, with a lower score implying higher sharpness. To interpret mixed results, for example when a model attains the best sharpness but with infinitely large NLPD, we suggest that Pearson correlation (r) has primacy, followed by CAL and NLPD, then SHP. That is, when models have comparable r, the comparison of CAL/NLPD is more meaning- ful, and if those are also similar, SHP should be considered; otherwise, it’s largely meaningless. 6 Evaluation of Uncertainty Estimation We expect that the incorporation of uncertainty estimation should not harm predictive perfor- mance compared to point estimation without un- certainty, in both in- and out-of-domain scenarios. Additionally, uncertainty estimates should reflect ‘‘what the model does not know’’, making it pos- sible to determine whether a prediction can be trusted based on the output distribution. This is quantified intrinsically with CAL and NLPD (the lower, the better), and extrinsically via instance selection in Section 7. 6.1 Experimental Setup Pipeline Training: We use SBERT as an off- the-shelf sentence encoder. We fine-tune SBERT l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 3 2 0 2 9 9 5 1 / / t l a c _ a _ 0 0 4 8 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 separately over each STS corpus based on the pre- trained bert-base-nli-mean-tokens, using the same configuration as the original paper (4 epochs with training batch size of 16). For the cross-lingual DA task, we use distiluse-base- multilingual-cased-v1. To represent a sentence pair (S1, S2) using SBERT, we use the concatenation of the embed- dings u ⊕ v, along with their absolute difference |u − v| and element-wise multiplication v × t. ‘‘SBERT Bayesian LR’’ and ‘‘SBERT Sparse GP Regression’’ indicates that features are fed into Bayesian LR and sparse GP regression, respectively, implemented in pyro.2 tasks. The input format End-to-End Training: We apply pre-trained BERT as the LM encoder (Devlin et al., 2019), us- ing bert-base-uncased for monolingual tasks and bert-base-multilingual-cased for cross- lingual is [CLS] S1 [SEP] S2 [SEP] for text pair (S1, S2), and [CLS] S [SEP] for a single text S. BERT Bayesian LR and BERT Bayesian ConvLR denote task-specific networks based on a two-layer MLP and HConv, respectively, implemented based on the Hugging- face Transformer framework and blitz for BBB estimation (Esposito, 2020). MC-Dropout: We apply MC-dropout to base models BERT LR and BERT ConvLR, with dropout rate = 0.1 and 30 iterations of sampling.3 Point Estimation: In addition to the uncertainty estimation approaches, we also compare with four non-Bayesian methods: (1) cosine similarity; (2) optimization of deterministic LR with SBERT (SBERT LR); (3) fine-tuned BERT LR; and (4) fine-tuned BERT ConvLR. Training Configuration: The maximum se- quence length is set to 128 for STS and DA, and 256 for SA. The learning rate (lr), training batch size, and training epochs are optimized over the validation set. In the situation that a validation set is not available (i.e., EBMSASS and MedSTS), we provisionally split the training data into 80%:20% training:dev data, and tune hyperparameters over the dev data. We then retrain the model over the full training dataset, and evaluate on the test set. Tuned hyperparameter settings of the pipeline are shown in Table 3. End-to-end is based on grid-searching over [8, 16, 32] × [1e-5, 2e-5] × 2https://pyro.ai/. 3No significant difference was observed when sampling [1, 2, 3, · · · 10] for batch size, lr, and epochs, respectively. Generally, the best setting is batch size = 16, lr = 2e-5, and epochs = 3, although BERT ConvLR based on BBB requires more epochs to converge. Further details of the training regimen and hyperparameter settings are provided in our Github repository.4 6.2 Sentence-Pair STS In this section, we compare the various uncer- tainty estimation approaches from Section 4.1 over STS, in terms of correlation and the metrics for uncertainty estimation, aiming to empirically establish: 1. Which uncertainty estimation strategy is most accurate, most calibrated, and sharpest? 2. Which method performs best in out-of- domain settings? 6.2.1 In-Domain Performance To observe the influence of data size and domain distribution on uncertainty estimation, we experi- ment over the large-scale general-domain STS-B, in addition to the smaller-scale domain-specific MedSTS (clinical domain) and EBMSASS (bio- medical domain) datasets. There are three main findings from the results in Table 2. Uncertainty models do not degrade accuracy. With SBERT, GP-based models have higher cor- relation than either cosine similarity or LR. In the case of BERT, estimation by MC-dropout is competitive with corresponding point estimates. Thus, they have comparable raw performance, in addition to providing uncertainty estimates. End-to-end training based on BERT results in higher correlation and narrower confidence intervals, but poorer calibration and NLPD. Results over the three datasets show that end- to-end training based on BERT overall performs much better than pipeline training using SBERT, but BERT-based models are poorly calibrated compared to SBERT-based Bayesian linear re- gression and sparse GP regression using fixed sentence features (as can be seen in the higher NLPD numbers for BERT-based models). This is consistent with prior work (Guo et al., 2017). MC-dropout is superior to BBB inference, and sparse GP regression performs better than SBERT Bayesian LR, regardless of data size 4https://github.com/yuxiaw/Uncertainty 20, 30, 40, or 50 times, so we report only on 30. -regression. 686 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 3 2 0 2 9 9 5 1 / / t l a c _ a _ 0 0 4 8 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 STS-B test EBMSASS test MedSTS test Yelp test r ↑ CAL ↓ NLPD ↓ SHP↓ r ↑ CAL ↓ NLPD ↓ SHP↓ r ↑ CAL ↓ NLPD ↓ SHP↓ r ↑ CAL ↓ NLPD ↓ SHP↓ SBERT Cosine similarity SBERT LR SBERT Bayesian LR SBERT Sparse GP Regression BERT LR BERT ConvLR BERT Bayesian LR (BBB) BERT Bayesian ConvLR (BBB) BERT LR MC dropout BERT ConvLR MC dropout 0.842 0.835 0.810 0.847 0.868 0.855 0.848 0.849 0.868 0.855 N/A N/A 0.046 0.065 N/A N/A 0.648 0.614 N/A N/A N/A N/A +∞ 0.521 0.495 2061.0 0.181 0.202 4.659 5.830 N/A N/A 1.632 1.621 N/A N/A 0.005 0.015 0.215 0.209 0.773 0.743 0.688 0.788 0.914 0.922 0.914 0.898 0.921 0.922 N/A N/A 0.443 0.195 N/A N/A 1.095 0.541 N/A N/A N/A N/A 0.669 1177.2 0.618 327.3 0.054 0.093 0.036 2.137 N/A N/A 2.156 1.627 N/A N/A 0.005 0.010 0.140 0.085 0.784 0.776 0.740 0.781 0.858 0.846 0.848 0.835 0.859 0.852 N/A N/A 0.101 0.073 N/A N/A 0.801 0.499 N/A N/A N/A N/A 0.514 6594.3 0.506 1037.2 0.163 0.219 4.118 6.402 N/A N/A 2.092 1.453 N/A N/A 0.006 0.017 0.168 0.146 — 0.666 0.671 0.689 0.826 0.822 0.827 0.797 0.827 0.823 N/A N/A 0.019 0.049 N/A N/A 0.447 0.573 N/A N/A N/A N/A 0.531 3908.6 1.513 119.2 0.267 0.291 7.285 8.214 N/A N/A 0.753 1.507 N/A N/A 0.083 0.089 0.153 0.150 Table 2: Correlation r and uncertainty prediction quality metrics (CAL, NLPD, and SHP) on three STS datasets (STS-B, EBMSASS, and MedSTS) and a SA rating dataset (Yelp), with SBERT and BERT sentence embeddings with various task-specific layers: Cosine similarity = calculate cosine similarity between vectors representing S1 and S2; LR = single-layer linear regression; Bayesian LR = Bayesian linear regression; and Sparse GP Regression = Sparse Gaussian process regression. N/A indicates that the method doesn’t produce an uncertainty estimate to apply the given metric to. LR Bayes LR GP Reg lr epoch lr epoch lr epoch STS-B 0.1 EBMSASS 0.1 0.1 MedSTS 0.1 Yelp 0.1 en-zh 0.1 ru-en 0.1 si-en 100 15 100 600 50 40 199 0.01 2500 0.01 10000 8500 0.01 2500 0.01 300 0.03 400 0.03 300 0.03 25 0.1 25 0.1 25 0.1 25 0.1 200 0.1 200 0.1 0.1 1000 Table 3: Learning rate (lr) and training epochs (epoch) for pipeline training based on SBERT. and domain. Under both BERT LR and ConvLR, MC-dropout achieves higher or equal correla- tion, and much lower CAL and NLPD than BBB in end-to-end training. Among methods based on SBERT, sparse GP regression requires many fewer iterations to converge, and outperforms Bayesian LR in correlation and NLPD, and is comparable for CAL and SHP. 6.2.2 Out-of-Domain Performance Apart from in-domain evaluation, out-of-domain performance is also an important concern. We expect that a model trained on domain A will generate more uncertain predictions on domain B, with lower correlation, larger CAL and NLPD, and a wider confidence interval (Lakshminarayanan et al., 2017). Given two models trained on do- main A with similar point-estimate performance on domain B, that is competitive r, the model with the lower NLPD is arguably the better model, as this indicates that the model gives sharper distri- butions when the prediction is correct, and flatter ones when wrong. Using models fine-tuned over the general- domain STS-B, we evaluate on the biomedical EBMSASS and clinical MedSTS test sets. In contrast with the results in Table 2, in which mod- els have been fine-tuned with in-domain labeled data, Table 4 shows a steep decline in r of more than 10 points on average for EBMSASS, and 7 for MedSTS. Meanwhile, both CAL and NLPD increase by a large margin. MC-dropout is not always best. Interestingly, we find that BERT Bayesian LR performs well in this setting, obtaining the highest correlation and smallest SHP on EBMSASS and PeerRead. This suggests that BERT Bayesian LR has bet- ter generalizability over these two domains, but the substantially higher NLPD also reveals that its predictions are over-confident. By and large, MC-dropout stably offers accurate and calibrated predictions in out-of-domain settings. ConvLR in particular outperforms Bayesian inference across all metrics. BERT ConvLR tends to be inferior to BERT LR in the out-of-domain setting. We speculate this is because of its smaller capacity to memorize task-specific knowledge, as eight layers of the BERT encoder are frozen in BERT ConvLR. 6.3 Single-sentence Sentiment Rating We perform in-domain SA evaluation on Yelp, and out-of-domain evaluation by applying the fine-tuned Yelp model to PeerRead test data. We find: Fine-tuned sentence embeddings are vital to the performance of pipeline uncertainty esti- mation. As shown in Table 2, performance over Yelp, EBMSASS, and MedSTS based on SBERT 687 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 3 2 0 2 9 9 5 1 / / t l a c _ a _ 0 0 4 8 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 EBMSASS test r ↑ CAL ↓ NLPD ↓ SHP↓ MedSTS test r ↑ CAL ↓ NLPD ↓ SHP↓ PeerRead test r ↑ CAL ↓ NLPD ↓ SHP↓ N/A 0.716 SBERT Cosine similarity 0.696 N/A SBERT LR 0.684 0.091 SBERT Bayesian LR SBERT Sparse GP Regression 0.726 0.211 BERT LR BERT ConvLR BERT Bayesian LR BERT Bayesian ConvLR BERT LR MC dropout BERT ConvLR MC dropout N/A 0.838 0.806 N/A 0.867 0.625 0.811 0.714 0.838 0.280 0.814 0.194 N/A N/A 0.400 0.586 N/A N/A 5165 1043. 3.517 4.649 N/A 0.731 N/A 0.718 N/A N/A 0.672 0.038 1.325 1.609 0.723 0.129 N/A N/A 0.568 0.634 — N/A N/A 0.256 N/A N/A 1.506 0.241 0.116 1.604 0.427 0.021 N/A N/A N/A N/A 1.018 1.245 0.771 1.339 N/A 0.786 N/A 0.776 N/A N/A 0.005 0.768 0.619 0.011 0.770 0.523 0.795 0.199 0.137 0.788 0.240 0.153 N/A N/A N/A N/A 11081 0.005 0.017 1527. 0.188 5.060 0.158 8.447 N/A N/A 0.669 N/A 0.627 N/A 0.694 0.522 7606. 189.0 0.608 0.990 0.676 0.400 21.75 0.635 0.456 36.37 N/A N/A 0.009 0.086 0.160 0.138 Table 4: Results on EBMSASS, MedSTS and PeerRead test sets using models trained on general- purpose STS-B and Yelp for STS and SA, respectively. en-zh test r ↑ CAL ↓ NLPD ↓ SHP↓ ru-en test r ↑ CAL ↓ NLPD ↓ SHP↓ si-en test r ↑ CAL ↓ NLPD ↓ SHP↓ N/A 0.115 SBERT Cosine similarity 0.270 N/A SBERT LR 0.280 0.025 SBERT Bayesian LR SBERT Sparse GP Regression 0.384 0.026 N/A BERT LR BERT ConvLR N/A BERT Bayesian LR BERT Bayesian ConvLR BERT LR MC dropout BERT ConvLR MC dropout N/A N/A N/A N/A 0.155 0.908 0.143 0.892 N/A N/A 0.395 0.436 N/A N/A 0.385 0.726 11600 0.005 0.066 683.7 0.378 1.780 9.216 0.190 0.407 0.250 0.441 0.268 0.127 13.33 N/A N/A N/A 0.428 N/A N/A 0.616 N/A 0.223 0.771 0.625 0.013 0.207 0.776 0.626 0.007 N/A N/A 0.621 0.641 N/A N/A 0.644 0.515 11666 0.005 0.069 0.609 1.775 0.126 0.637 0.315 0.649 0.333 0.106 723.4 17.00 22.28 N/A N/A N/A N/A N/A 0.097 N/A N/A 0.397 N/A 0.193 0.934 0.371 0.013 0.191 0.931 0.366 0.010 N/A N/A N/A 0.504 0.524 N/A N/A N/A 0.506 0.568 10971 0.005 0.059 638.5 0.503 1.758 6.578 0.200 0.527 0.178 0.530 0.275 0.133 10.19 Table 5: Results for DA-style quality estimation over the three WMT language pairs. is substantially worse than with BERT. We spec- ulate this is due to poor feature representations. That is, on the STS task, we continue to fine-tune sentence embeddings over each STS dataset. As a result of being unable to fine-tune SBERT on SA (as there is no paired data), the representations for Yelp are pre-trained using SNLI only, which is neither task- nor domain-specific. Compared with the similarly sized STS-B where embeddings are fine-tuned, the performance gap for Yelp be- tween SBERT and BERT is more than 0.15, but less than 0.02 for STS-B. Equally, though we fine-tune SBERT for EBMSASS and MedSTS, each has fewer than 1k training instances. Poor domain-specific sentence embeddings result in gaps of 0.15 and 0.07. Meanwhile, for SBERT in the upper half of Table 4, the out-of-domain correlation on Peer- Read is extremely poor; the gap of 6 points on EBMSASS and MedSTS relative to in-domain results (0.78 in Table 2) further confirms our hypothesis. LR outperforms ConvLR in out-of-domain SA. In both point and Bayesian estimates, ConvLR performs better than LR (Table 4), similar to STS. 6.4 Cross-lingual Sentence-pair DA We evaluate on machine translation quality esti- mation (QE) over three language pairs using DA, using 7,000 training instances in each case. The results are shown in Table 5. We first observe that using embeddings directly from pretrained SBERT with cosine similarity underperforms other methods that involve fine-tuning. Traditional Bayesian LR and GP models achieve results competitive with deep uncer- tainty models when the input sentence embedding is expressive enough, and with smaller CAL and NLPD. Related uncertainty prediction work (Glushkova et al., 2021) argued that GPs are not competitive or easy to integrate with current neural architectures. In contrast, our results demonstrate that GPs can achieve comparable results to deep neural networks, while also being better calibrated. 688 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 3 2 0 2 9 9 5 1 / / t l a c _ a _ 0 0 4 8 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 N2C2-STS test MedSTS test PeerRead test r1 / r2 ↑ CAL ↓ NLPD ↓ r1 / r2 ↑ CAL ↓ NLPD ↓ r1 / r2 ↑ CAL ↓ NLPD ↓ 0.853 / 0.857 0.384 0.861 / 0.862 0.511 0.860 / 0.864 0.493 Semi-supervised: BERT LR +Du +D(cid:7) u BERT ConvLR 0.874 / 0.875 0.509 +Du 0.875 / 0.876 0.522 +D(cid:7) 0.875 / 0.879 0.535 u 0.682 / 0.663 0.568 0.687 / 0.673 0.624 0.743 / 0.729 0.630 Zero-shot: BERT LR + Du + D(cid:7) u BERT ConvLR 0.728 / 0.722 0.612 + Du 0.746 / 0.737 0.653 + D(cid:7) 0.763 / 0.748 0.628 u 6.571 9.232 8.476 11.51 13.50 11.44 17.08 40.10 23.67 21.06 47.68 40.32 0.858 / 0.859 0.860 / 0.861 0.863 / 0.866 0.846 / 0.853 0.846 / 0.855 0.857 / 0.864 0.158 0.224 0.181 0.201 0.215 0.222 3.903 5.267 4.758 5.968 6.403 6.129 0.686 / 0.686 0.655 / 0.656 0.720 / 0.720 0.691 / 0.692 0.671 / 0.683 0.699 / 0.697 0.370 0.394 0.340 0.346 0.453 0.374 15.95 19.26 19.89 16.98 25.50 21.78 0.786 / 0.795 0.796 / 0.797 0.793 / 0.792 0.199 0.266 0.296 0.776 / 0.788 0.240 0.790 / 0.794 0.332 0.809 / 0.810 0.303 5.060 11.94 8.907 8.447 16.45 15.26 0.669 / 0.676 0.023 / 0.006 0.678 / 0.675 0.400 1.728 0.495 0.456 0.627 / 0.635 0.138 / 0.119 1.748 0.483 0.656 / 0.659 21.75 387.3 52.72 36.37 546.1 57.77 Table 6: Results on three low-resource regression datasets: clinical STS: MedSTS, N2C2-STS, and PeerRead. r1 are the results without MC-dropout, while r2, CAL, and NLPD are based on applying 30 iterations of MC-dropout. There are two setups: (1) semi-supervised (upper half) = domain gold-labeled data is available; and (2) zero-shot (bottom half). In each case, Du = unlabeled data pool selected based on the model probability; D(cid:7) u = unlabeled data pool selected based on hyperparameter τ over the predicted std; and row 1,4,7,10 = baseline for each setting. ConvLR consistently outperforms LR for BERT-based models. In the cross-lingual scenario, SBERT models have smaller CAL and NLPD, and larger SHP, analogous to the monolingual setting. 7 Instance Selection Through Uncertainty In self-training, a model is first trained using la- beled data, then used to predict labels for unlabeled data instances. Instances with higher-probability predictions are then adopted as pseudo-labels, and used to re-train the model in conjunction with the labeled training data. Active learning is simi- lar, expect that instances are selected for explicit human labelling rather than pseudo-labeled, of- ten based on estimates of model confidence or uncertainty. In both tasks, accurate estimation of labelling (un)certainty is critical. In this section, we evaluate the uncertainty- based instance selection method from Section 4.2 in the settings of self-training and active learn- ing, over the tasks of STS, SA rating, and cross- lingual DA. 7.1 Self-training STS and SA In self-training, we experiment in both semi- supervised (limited gold-standard training data) and zero-shot scenarios, over three low-resource datasets: MedSTS, N2C2-STS, and PeerRead. Experimental Setup: As we require high cor- relation to ensure high-quality pseudo-labels, and lower CAL and NLPD to guarantee that predic- tions are neither over- nor under-confident, we employ MC-dropout over LR and ConvLR. Addi- tionally, to alleviate domain data sparsity, we first fine-tune the regressor on two general datasets— STS-B for STS and Yelp for SA (general-purpose STS/SA)—also providing the proxy for the zero- shot setting. We continue to fine-tune on domain training data in the semi-supervised scenario, and predict (μ, σ) for Du by applying dropout 30 times. All results in Table 6 are obtained using train batch size = 16, learning rate = 2e-5, and training epochs = 3. Unlabeled Data Pool: For clinical STS, we extract sentences from MIMIC-III covering topics of medication, diagnosis, follow-up instructions, and test, then synthetically balance across each unit score interval, resulting in 1,534 sentence pairs, which we denote as Du. For PeerRead, we use 1,014 reviews from ICLR 2017 without labels as Du. To expand Du in the zero-shot setting, we remove the gold-standard labels and integrate the resulting unlabeled data into Du. 689 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 3 2 0 2 9 9 5 1 / / t l a c _ a _ 0 0 4 8 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Results and Analysis: As seen in Table 6, semi-supervision improves correlation, at the cost of being more uncertain and miscalibrated, with larger CAL and NLPD. Predictive confidence threshold selection can further improve the ac- curacy. It also effectively calibrates the model, resulting in much lower CAL and NLPD, com- pared with directly incorporating unlabeled data (‘‘+Du’’). In the zero-shot setting, CAL and NLPD increase for all tasks under both LR and ConvLR with Du, making predictions less reliable, especially for PeerRead where the model totally collapses. This matches our intuition that the distribution of the pseudo-labeled data differs from the true distribu- tion, and that learning from this data impedes the model. This problem is alleviated by retaining only the highly confident subset D(cid:7) u, as its distribution is closer to the gold-standard for well-calibrated models. This is also consistent with the observa- tion that CAL and NLPD in the zero-shot setting are much larger than in the semi-supervised set- ting, as the latter benefits from the guidance of the gold-standard distribution. Note that if we merely assess the model with Pearson correlation as in most previous work, we can only observe the improvement due to data augmentation, neglecting the risk of the model being more miscalibrated, and producing less re- liable predictions. Further, CAL and NLPD are useful metrics to evaluate the effectiveness of the data sampling strategy used in self-training. 7.2 Cross-lingual DA We evaluate self-training and active-learning on DA-based machine translation quality estimation using BERT LR. Experimental Setup: We use three language pairs: WMT 2020 DA en-zh, ru-en, and si-en, in each case splitting the original 7k training instances into a training set D of 3k instances and 4k unlabeled data pool Du, keeping the original validation and test sets. The lr is set to 2e-5, and training epochs and batch size are tuned by grid search over the validation set based on the range [1,2,3,4,5] × [16, 32]. Other settings follow STS and SA above, but without a general-purpose base model. As a baseline, we use D fine-tuned on the validation set, and evaluate the best configuration on test. en-zh (high) rdev ↑ rtest ↑ ru-en (medium) rdev ↑ rtest ↑ si-en (low) rdev ↑ rtest ↑ Baseline 0.407 + pseudo Du 0.434 + D(cid:7) 0.438 u + D(cid:7) ∪ D(cid:7) 0.445 u a + gold Du 0.453 0.374 0.400 0.404 0.422 0.592 0.604 0.606 0.615 0.599 0.619 0.603 0.628 0.427 0.449 0.443 0.466 0.478 0.488 0.482 0.496 0.395 0.600 0.621 0.466 0.504 Table 7: Results for DA-based quality estimation in WMT 2020 (dev/test) for three language pairs: en-zh, ru-en and si-en. Baseline = training with 3,000 gold-labeled instances. Row ‘‘+ D(cid:7) a’’ u is active learning. ∪ D(cid:7) Results and Analysis: As shown in Table 7, directly incorporating pseudo Du substantially outperforms baselines for all three language pairs. This differs from the results for STS and SA in the semi-supervised setting, but is consistent with the results in the zero-shot setting. It indicates that a high-performance model requires high-quality data to further gain improvements; lower-quality models are more tolerant to lower data quality. We select the most confident 1,904, 1,985, and 2,462 instances with τ = 0.15, 0.13 and 0.19 for en-zh, ru-en and si-en, respectively. Equal or higher performance is achieved when this subset of instances is added to the training data, as compared to the complete Du. u Simulating active learning, we also explore the annotation of Du − D(cid:7) u with human gold scores, i.e. D(cid:7) ∪ D(cid:7) a. The results show that with D(cid:7) a, our model achieves results competitive with using all of Du with gold labels. This reveals that it is not necessary to annotate the entire dataset, but we can focus on the subset where the model is not confident. In this way, data annotation is more efficient, and models generalize better over unseen data. 8 Analysis In this section, we conduct further analysis to better understand the results of the experiments. Qualitative Comparison: In both in-domain and out-of-domain evaluation, end-to-end train- ing based on BERT, particularly BBB estimation, obtains much larger NLPD than pipeline train- ing based on SBERT, especially GP regression. We speculate that end-to-end uncertainty mod- els are confident for both correct and incorrect 690 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 3 2 0 2 9 9 5 1 / / t l a c _ a _ 0 0 4 8 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 STS-B test EBMSASS test MedSTS test Yelp test r ↑ CAL ↓ NLPD ↓ SHP↓ r ↑ CAL ↓ NLPD ↓ SHP↓ r ↑ CAL ↓ NLPD ↓ SHP↓ r ↑ CAL ↓ NLPD ↓ SHP↓ 0.833 N/A simCSE Cosine simCSE LR 0.849 N/A simCSE Bayes LR 0.850 0.051 simCSE Sparse GP 0.853 0.002 N/A N/A 0.700 N/A N/A N/A 0.703 N/A 0.381 0.891 0.738 0.048 0.368 0.960 0.757 0.210 N/A N/A 0.696 N/A N/A N/A 0.675 N/A 0.102 0.900 0.693 0.002 0.218 0.962 0.694 0.034 N/A — N/A N/A N/A 0.688 N/A N/A 0.295 0.885 0.668 0.005 0.346 0.960 0.681 0.004 N/A N/A N/A N/A 0.377 0.846 0.360 0.880 Table 8: Pipeline model results for the simCSE sentence encoder (Gao et al., 2021). SBERT Sparse GP BERT Bayesian LR (BBB) Incorrect Predictions: S1: You will want to clean the area first. S2: You will also want to remove the seeds. Gold score = 0 Prediction: 2.22 ± 1.62 1.95 ± 0.0037 Correct Predictions: S1: He was referring to ..., ... last Sunday. S2: Next week, ... Sunday ..., will take up his position. Gold score = 4 Prediction: 3.89 ± 1.58 4.14 ± 0.0056 Table 9: Predictions for two STS-B examples by GP regression and BBB. predictions, i.e. have small variance over all in- stances, thus resulting in the smaller SHP and larger NLPD. Meanwhile, models with extremely small NLPD are less confident in inaccurate pre- dictions, and might also be under-confident in correct predictions. We score sentence pairs in the STS-B test set using BERT Bayesian LR (BBB) and SBERT GP.5 Overall, the incorrect predictions (> 1 de
the true score) by BBB have a much smaller
variance compared to those predicted by GP. Para
correct predictions (≤ 1 of the true score), BBB
has a higher variance than for incorrect predic-
ciones, which is counter-intuitive. Though the std
for SBERT GP regression on correct predictions
is much larger than BBB, it’s slightly less than
that for incorrect ones. This fits the expectation
that when a model is good at uncertainty pre-
diction, the model should be more confident for
correct predictions than incorrect ones. Examples
where both models are correct and incorrect are
presented in Table 9.

The near-zero variance of BBB (0.005 on av-
erage) results in infinite NLPD because of the
in the NLPD formula. Larger
element
SHP of GP tends to produce smaller NLPD in spite

(ti−mi)2
vi

5These two were chosen because they have similar r, pero

one has the largest NLPD and the other has the smallest.

of being under-confident on correct cases—the
variance of 1.57 is much larger than the true gap
de 0.01. So NLPD is not a perfect metric, favour-
ing under-confident models. We therefore suggest
a metric priority order of r, CAL, NLPD and SHP.
Impact of Sentence Embedding: The quality
of sentence embeddings is critical for uncertainty
training, affecting not only the correlation, pero
also the uncertainty metrics. Instead of SBERT,
we also experimented with simCSE, the current
state-of-the-art sentence encoder
(Gao et al.,
2021). We train three pipeline models with STS-B
training data based on sup-simcse-bert-base-
uncased, using the same settings as the first row of
Mesa 3, and evaluate on the STS-B, ENMSASS,
and MedSTS test sets. En mesa 8, contrasting with
the results in Table 2 for STS-B and Yelp, y
results in Table 4 for EBMSASS and MedSTS,
the correlation improves for all datasets other
than MedSTS, and CAL and NLPD drop. Este
suggests that better sentence encoders boost pipe-
line performance.

High-disagreement Label Detection: A natu-
ral question to ask in the instance selection is what
types of instances are selected and discarded,
and how this correlates with the underlying la-
bel uncertainty in the data. When models are
well-calibrated, the predicted variance will reflect
the true label uncertainty, both aleatoric and epis-
temic. Tal como, if we select instances with smaller
variance, we are effectively filtering out instances
with higher inherent label uncertainty, as should
be reflected in the labels assigned by independent
annotators. We verify this hypothesis below.

We apply the model fine-tuned on STS-B over
BIOSSES and EBMSASS (1000 instances each),
for which five raw annotations for each instance
can be accessed to approximate an empirical
label distribution. KL-Divergence (KL) se utiliza
to measure the distance between the predicted
and empirical probability. En mesa 10, the trend
in KL values on the two datasets is consistent
with CAL/NLPD across all estimation methods,

691

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1

/
t

a
C
_
a
_
0
0
4
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

EBMSASS

BIOSSES

r ↑ CAL / NLPD ↓ KL1 / KL2 ↓

8.75 / 1.23
0.828 0.236 / 3.319
LR MC
ConvLR MC 0.806 0.201 / 4.668
12.74 / 1.46
0.854 0.633 / 5351.
LR BBB
16297 / 5.00
ConvLR BBB 0.806 0.736 / 1091. 2373.7 / 4.13

8.82 / 1.54
0.870 0.250 / 4.488
0.823 0.304 / 12.59
19.64 / 2.06
0.836 0.530 / 11972 16598 / 4.90
0.804 0.923 / 2076. 2631.2 / 5.01

Intrinsic metrics results on EBM-
Mesa 10:
SASS 1000 and BIOSSES based on a model
trained on STS-B. KL1 = KL-Divergence(pag(cid:4)q),
KL2 = KL-Divergence(q(cid:4)pag): p = gold empirical
distribución; q = predicted distribution.

indirectly suggesting that CAL and NLPD are ef-
fective metrics in the absence of empirical label
distributions.

Do large-variance instances selected by strate-
gies in Section 4.2 overlap with high-disagreement
instancias? Without a ground truth of high-
disagreement annotations, they are identified by
two steps iteratively: (1) select labels whose std
is greater than α, beginning from 0.3; y (2)
manually check whether for all selected instances,
at least two out of the five annotations differ from
the others by ≥ 1.0; if not α+=0.1, de lo contrario
end. This results in 137 y 31 label disagree-
ments when α = 0.5 y 0.4, for EBMSASS and
BIOSSES, respectivamente.

Using BERT LR MC-dropout, a learned thresh-
old of τ = 0.162 results in Acc = 0.48, F1 = 0.28 en
high-disagreement label detection on EBMSASS.
For BIOSSES, τ = 0.1 leads to Acc = 0.37, F1 =
0.48. Under ConvLR MC, EBMSASS has Acc =
0.46, F1 = 0.31 as τ = 0.124; BIOSSES: τ = 0.157
with Acc = 0.45, F1 = 0.48.

Tal como, high-disagreement labels can be de-
tected by the large-variance criterion, obtaining
Acc = 0.44, F1 = 0.39 on average. This is not good
as a binary classifier, since regarding all instances
as the majority-class ‘‘clean’’ performs better. Pero
in our context, it is effective as a data augmenta-
tion strategy—selecting clean examples from an
out-of-domain corpus. Detecting noisy labels is
not just a binary classification task requiring high
exactitud, but critical to recognize and filter noisy
instances from a whole training corpus, even at
the cost of removing clean labels.

9 Conclusión

We comprehensively investigated a range of
uncertainty estimation methods over different re-
gression tasks, using pre-trained language models.

Bayesian linear regression and sparse Gaussian
process regression based on fixed features ob-
tain lower calibration error and NLPD compared
with fine-tuning large-capacity deep networks
end-to-end, but are inferior in terms of correlation.
When embeddings are sufficiently expressive,
they are comparable in performance to deep un-
certainty models.

To reduce uncertainty resulting from noisy la-
bels and limited labeled data in specific domains,
we proposed a simple instance selection method
based on uncertainty model predictive confidence.
This approach demonstrated consistent perfor-
mance improvements on three regression tasks in
both self-training and active-learning settings, y-
derscoring its effectiveness and generalizability.

Expresiones de gratitud

We thank the anonymous reviewers and three
editors for their helpful comments. Yuxia Wang is
supported by scholarships from the University of
Melbourne and China Scholarship Council (CSC).

Referencias

Moloud Abdar, Farhad Pourpanah, Sadiq Hussain,
Dana Rezazadegan, Li Liu, Mohammad
Ghavamzadeh, Paul Fieguth, Xiaochun Cao,
Abbas Khosravi, Ud.. Rajendra Acharya, y
Vladimir Makarenkov, and Saeid Nahav.
2020. A review of uncertainty quantification
in deep learning: Techniques, applications and
challenges. arXiv preimpresión arXiv:2011.06225.
https://doi.org/10.1016/j.inffus
.2021.05.008

Daniel Beck, Trevor Cohn, and Lucia Specia.
2014. Joint emotion analysis via multi-task
Gaussian processes. En procedimientos de
el
2014 Conferencia sobre métodos empíricos
en procesamiento del lenguaje natural (EMNLP),
pages 1798–1803. https://doi.org/10
.3115/v1/D14-1190

Yoshua Bengio,

J´erˆome Louradour, Ronan
Collobert, y Jason Weston. 2009. Curriculum
aprendiendo. In Proceedings of the 26th Annual
Conferencia internacional sobre aprendizaje automático-
En g, ICML 2009, volumen 382, pages 41–48.
ACM. https://doi.org/10.1145/1553374
.1553380

692

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1

/
t

a
C
_
a
_
0
0
4
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Zsolt Bitvai and Trevor Cohn. 2015. Predict-
ing peer-to-peer loan rates using Bayesian
non-linear regression. En Actas de la
Conferencia AAAI sobre Inteligencia Artificial.

Charles Blundell,

Julien Cornebise, Koray
Kavukcuoglu, and Daan Wierstra. 2015. Weight
En el Inter-
uncertainty in neural networks.
Conferencia nacional sobre aprendizaje automático,
pages 1613–1622. http://proceedings
.mlr.press/v37/blundell15.pdf.

Enrico Camporeale and Algo Car`e. 2020. Estima-
tion of accurate and calibrated uncertainties in
deterministic models. CORR, abs/2003.05103.

Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo
Lopez-Gazpio,
and Lucia Specia. 2017.
SemEval-2017 task 1: Semantic textual similarity
multilingual and crosslingual focused evalua-
ción. In Proceedings of the 11th International
Workshop on Semantic Evaluation (SemEval-
2017), pages 1–14. vancouver, Canada.
https://www.aclweb.org/anthology/S17
-2001.

Chacha Chen,

Junjie Liang, Fenglong Ma,
Lucas M. Glass, Jimeng Sun, and Cao Xiao.
2020. Unite: Uncertainty-based health risk pre-
diction leveraging multi-sourced data. arXiv
preprint arXiv:2010.11389. https://doi
.org/10.1145/3442381.3450087

Hyung Won Chung, Thibault F´evry, Henry
Tsai, Melvin Johnson, and Sebastian Ruder.
2020. Rethinking embedding coupling in pre-
trained language models. arXiv preimpresión arXiv:
2010.12821.

Courtney Corley and Rada Mihalcea. 2005. Mea-
suring the semantic similarity of texts. En profesional-
ceedings of the ACL Workshop on Empirical
Modeling of Semantic Equivalence and Entail-
mento, pages 13–18. https://doi.org/10
.3115/1631862.1631865

Sharon E. davis, Thomas A. Lasko, Guanhua
Chen, Edward D. Siew, y Michael E..
Matheny. 2017. Calibration drift
in regres-
sion and machine learning models for acute
kidney injury. Journal of the American Medi-
cal Informatics Association, 24(6):1052–1061.
https://doi.org/10.1093/jamia/ocx030,
PubMed: 28379439

Shrey Desai and Greg Durrett. 2020. Calibra-
tion of pre-trained transformers. arXiv preprint

arXiv:2003.07892. https://doi.org/10
.18653/v1/2020.emnlp-main.21

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. En procedimientos de
el 2019
Conferencia del Capítulo Norteamericano
de la Asociación de Linguis Computacional-
tics: Tecnologías del lenguaje humano, Volumen 1
(Artículos largos y cortos), páginas 4171–4186.
https://www
Mineápolis, Minnesota.
.aclweb.org/anthology/N19-1423.

Piero Esposito. 2020. BLiTZ – Bayesian Lay-
ers in Torch Zoo (a Bayesian deep learing
library for Torch). https://github.com
/piEsposito/blitz-bayesian-deep
-learning/.

Yarin Gal and Zoubin Ghahramani. 2016. Dropout
as a Bayesian approximation: Representing
model uncertainty in deep learning. En el Inter-
Conferencia nacional sobre aprendizaje automático,
pages 1050–1059. http://proceedings
.mlr.press/v48/gal16.pdf.

Tianyu Gao, Xingcheng Yao, and Danqi Chen.
2021. SimCSE: Simple contrastive learning of
sentence embeddings. In Empirical Methods in
Natural Language Processing (EMNLP).

Taisiya Glushkova, Chrysoula Zerva, Ricardo
Rei, and Andr´e F. t. Martins. 2021. Incertidumbre-
aware machine translation evaluation. CORR,
abs/2109.06352. https://doi.org/10.18653
/v1/2021.findings-emnlp.330

Tilmann Gneiting, Fadoua Balabdaoui, y
Adrian E. Raftery. 2007. Probabilistic fore-
casts, calibration and sharpness. Journal of the
Royal Statistical Society: Serie B (Statistical
Metodología), 69(2):243–268. https://doi
.org/10.1111/j.1467-9868.2007.00587.x

Yvette Graham, Timothy Baldwin, Alistair Moffat,
and Justin Zobel. 2017. Can machine transla-
tion systems be evaluated by the crowd alone?
Natural Language Engineering, 23(1):3–30.
https://doi.org/10.1017/S1351324915
000339

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q.
Weinberger. 2017. On calibration of modern
neural networks. In International Conference
on Machine Learning, pages 1321–1330.

693

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1

/
t

a
C
_
a
_
0
0
4
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Hamed Hassanzadeh, Anthony Nguyen, and Karin
Verspoor. 2019. Quantifying semantic simi-
larity of clinical evidence in the biomedical
literature to facilitate related evidence synthesis.
Journal of Biomedical Informatics. https://
doi.org/10.1016/j.jbi.2019.103321,
PubMed: 31676460

Jos´e Miguel Hern´andez-Lobato

y ryan
Adams. 2015. Probabilistic backpropagation
para
scalable learning of Bayesian neural
redes. In International Conference on Ma-
chine Learning, pages 1861–1869. http://
proceedings.mlr.press/v37/hernandez
-lobatoc15.pdf.

Zhengbao Jiang, Frank F. Xu, Jun Araki, y
Graham Neubig. 2020. How can we know
what language models know? Transactions of
la Asociación de Lingüística Computacional,
8423–438. https://doi.org/10.1162
/tacl_a_00324

Dongyeop Kang, Waleed Ammar, Bhavana Dalvi,
Madeleine van Zuylen, Sebastian Kohlmeier,
Eduard Hovy, and Roy Schwartz. 2018. A
dataset of peer reviews (PeerRead): Collection,
insights and NLP applications. En procedimientos
del 2018 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
Volumen 1 (Artículos largos), pages 1647–1661,
Nueva Orleans, Luisiana. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/N18-1149

Brian Keith, Exequiel Fuentes, and Claudio
Meneses. 2017. A hybrid approach for senti-
ment analysis applied to paper. En procedimientos
of ACM SIGKDD Conference.

Alex Kendall and Yarin Gal. 2017. Qué
uncertainties do we need in Bayesian deep
learning for computer vision? In Advances
en sistemas de procesamiento de información neuronal,
pages 5574–5584. https://proceedings
.neurips.cc/paper/2017/hash/2650d6
089a6d640c5e85b2b88265dc2b-Abstrac
t.html.

.mlr.press/v80/kuleshov18a/kuleshov
18a.pdf.

METRO. Pawan Kumar, Benjamin Packer,

y
Daphne Koller. 2010. Self-paced learning
for latent variable models. In Advances in
Neural
Sistemas,
Information Processing
volumen 1. https://papers.nips.cc/paper
/2010/file/e57c6b956a6521b28495f2886c
a0977a-Paper.pdf.

Balaji Lakshminarayanan, Alexander Pritzel, y
Charles Blundell. 2017. Simple and scalable
predictive uncertainty estimation using deep
ensembles. In Advances in Neural Information
Sistemas de procesamiento. https://arxiv.org
/pdf/1612.01474.pdf

Max-Heinrich Laves, Sontje Ihler, Jacob F.
Fast, L¨uder A. Kahrs, and Tobias Ortmaier.
2020. Well-calibrated regression uncertainty in
medical imaging with deep learning. In Medical
Imaging with Deep Learning, pages 393–412.
http://proceedings.mlr.press/v121
/laves20a/laves20a.pdf.

Felix Leibfried, Vincent Dutordoir, S. t. John, y
Nicolas Durrande. 2020. A tutorial on sparse
gaussian processes and variational inference.
arXiv preimpresión arXiv:2012.13962.

Specia Lucia, Fomicheva Marina, Blain Fr´ed´eric,
Guzm´an Paco, Chaudhary Vishrav, Fonseca
Erick, and Martins Andr´e. 2020. WMT 2020
quality estimation dataset. https://www
.statmt.org/wmt20/qualityestimation
-task.html.

Kristian Miok, Gregor Pirs, and Marko Robnik-
Sikonja. 2020. Bayesian methods for semi-
supervised text annotation. arXiv preprint
arXiv:2010.14872.

Robert Pinsler, Jonathan Gordon, Eric Nalisnick,
and Jos´e Miguel Hern´andez-Lobato. 2019.
Bayesian batch active learning as
sparse
subset approximation. En avances en neurología
Sistemas de procesamiento de información, volumen 32,
pages 6359–6370. https://proceedings
.neurips.cc/paper/2019/file/84c2d48
60a0fc27bcf854c444fb8b400-Paper.pdf.

Volodymyr Kuleshov, Nathan Fenner,

y
Stefano Ermon. 2018. Accurate uncertainties
for deep learning using calibrated regression. En
International Conference on Machine Learning,
pages 2796–2804. http://proceedings

Daniel Preot¸iuc-Pietro and Trevor Cohn. 2013.
A temporal model of text periodicities using
Gaussian processes. En Actas de la 2013
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, pages 977–988.

694

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1

/
t

a
C
_
a
_
0
0
4
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Puria Radmard, Yassir Fathullah, and Aldo Lipani.
2021. Subsequence based deep active learning
for named entity recognition. En procedimientos de
the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th Inter-
national Joint Conference on Natural Language
Procesando, ACL/IJCNLP 2021, (Volumen 1:
Artículos largos), Virtual Event, August 1–6, 2021,
pages 4310–4321. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2021.acl-long.332

Nils Reimers and Iryna Gurevych. 2019. Oración-
BERT: Sentence embeddings using Siamese
BERT-networks. En Actas de la 2019
Jornada sobre Métodos Empíricos en Natural
El procesamiento del lenguaje y la IX Internacional
Conferencia conjunta sobre lenguaje natural Pro-
cesando (EMNLP-IJCNLP), pages 3982–3992,
Hong Kong, Porcelana. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D19-1410

C. Rusmassen and C. williams. 2005. Gaussian
process for machine learning. https://doi
.org/10.7551/mitpress/3206.001.0001

Omkar Sabnis. 2018. Yelp review dataset.
https://www.kaggle.com/omkarsabnis
/yelp-reviews-dataset

Burr Settles. 2009. Active learning literature
survey. University of Wisconsin-Madison
Department of Computer Sciences. http://
burrsettles.com/pub/settles.active
learning.pdf.

Burr Settles and Mark Craven. 2008. An analysis
of active learning strategies for sequence la-
beling tasks. En 2008 Conferencia sobre Empiri-
Métodos cal en el procesamiento del lenguaje natural,
the Confer-
EMNLP 2008, Actas de
ence, 25–27 October 2008, Honolulu, Hawaii,
EE.UU, A meeting of SIGDAT, a Special Interest
Group of the ACL, pages 1070–1079. LCA.
https://doi.org/10.3115/1613715
.1613855

Aili Shen, Daniel Beck, Bahar Salehi, Jianzhong
chi, and Timothy Baldwin. 2019. Modelling
uncertainty in collaborative document quality
evaluación. In Proceedings of the 5th Work-
shop on Noisy User-generated Text (W-NUT
2019), pages 191–201, Hong Kong, Porcelana.
Asociación de Lingüística Computacional.

695

https://doi.org/10.18653/v1/D19
-5525

Joachim Sicking, Maram Akila, Maximilian
Pintz, Tim Wirtz, Asja Fischer, and Stefan
Wrobel. 2021. A novel regression loss for
non-parametric uncertainty optimization. arXiv
preprint arXiv:2101.02726.

Gizem So˘gancıo˘glu, Hakime ¨Ozt¨urk, and Arzucan
¨Ozg¨ur. 2017. BIOSSES: A semantic sentence
similarity estimation system for the biomed-
ical domain. Bioinformatics, 33(14):i49–i58.
https://doi.org/10.1093/bioinforma
tics/btx238, PubMed: 28881973

regression.

Hao Song, Tom Diethe, Meelis Kull, y
Peter Flach. 2019. Distribution calibration
International Confer-
para
ence on Machine Learning, pages 5897–5906.
http://proceedings.mlr.press/v97
/song19a/song19a.pdf

Nandan Thakur, Nils Reimers,

Johannes
Daxenberger, and Iryna Gurevych. 2020. Aug-
mented sbert: Data augmentation method for
improving bi-encoders for pairwise sentence
scoring tasks. arXiv preimpresión arXiv:2010.08240.
https://doi.org/10.18653/v1/2021
.naacl-main.28

inducing variables

Michalis Titsias. 2009. Variational

aprendiendo
de
in sparse Gaussian
procesos. In Artificial intelligence and statis-
tics, pages 567–574. http://proceedings
.mlr.press/v5/titsias09a/titsias
09a.pdf.

Juozas Vaicenavicius, David Widmann, Carl
andersson, Fredrik Lindsten, Jacob Roll, y
Thomas Sch¨on. 2019. Evaluating model
calibration in classification. In The 22nd Inter-
national Conference on Artificial Intelligence
and Statistics, pages 3459–3467. http://
proceedings.mlr.press/v89/vaicenavici
us19a/vaicenavicius19a.pdf.

Yu Wan, Baosong Yang, Derek F. Wong, Yikai
zhou, Lidia S. chao, Haibo Zhang, and Boxing
Chen. 2020. Self-paced learning for neural
machine translation. En Actas de la 2020
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, EMNLP 2020, En línea,
November 16–20, 2020, pages 1074–1080.
Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/2020
.emnlp-main.80

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1

/
t

a
C
_
a
_
0
0
4
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Yanshan Wang, Naveed Afzal, Sunyang Fu, Liwei
Wang, Feichen Shen, Majid Rastegar-Mojarad,
and Hongfang Liu. 2018. MedSTS: A resource
for clinical semantic textual similarity. Lan-
guage Resources and Evaluation, pages 1–16.
https://doi.org/10.1007/s10579-018
-9431-1

Yanshan Wang, Sunyang Fu, Feichen Shen,
Sam Henry, Ozlem Uzuner, and Hongfang
Liu. 2020a. El 2019 n2c2/OHNLP track on
clinical semantic textual similarity: Overview.
JMIR Medical Informatics, 8(11). https://
doi.org/10.2196/23375

Yuxia Wang, Fei Liu, Karin Verspoor, y
Timothy Baldwin. 2020b. Evaluating the utility
of model configurations and data augmen-
tation on clinical semantic textual similar-
the 19th SIGBioMed
idad.
Workshop on Biomedical Language Processing,
pages 105–111, En línea. Asociación para Com-

En procedimientos de

Lingüística putacional. https://doi.org
/10.18653/v1/2020.bionlp-1.11

Yuxia Wang, Karin Verspoor, and Timothy
Baldwin. 2020C. Learning from unlabeled data
for clinical semantic textual similarity. En profesional-
ceedings of the 3rd Clinical NLP Workshop,
En línea. EMNLP. https://doi.org/10
.18653/v1/2020.clinicalnlp-1.25

Boyang Xue, Jianwei Yu, Junhao Xu, Shansong
Liu, Shoukang Hu, Zi Ye, Mengzhe Geng,
Xunying Liu, and Helen Meng. 2021. Bayes-
ian transformer language models for speech
recognition. arXiv preimpresión arXiv:2102.04754.
https://doi.org/10.1109/ICASSP39728
.2021.9414046

Eric Zelikman, Christopher Healy, Sharon Zhou,
and Anand Avati. 2020. Crude: Calibrating re-
gression uncertainty distributions empirically.
arXiv preimpresión arXiv:2005.12496.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
8
3
2
0
2
9
9
5
1

/
t

a
C
_
a
_
0
0
4
8
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

696
Descargar PDF