Collective Human Opinions in Semantic Textual Similarity

Yuxia Wang♠

Shimin Tao♣
Timothy Baldwin♠♥

Ning Xie♣
Karin Verspoor♠♦

Hao Yang♣

♠ The University of Melbourne, Melbourne, Victoria, Australia

♣ Huawei TSC, Beijing, China

♥ MBZUAI, Abu Dhabi, UAE

♦RMIT University, Melbourne, Victoria, Australia

yuxiaw@student.unimelb.edu.au

karin.verspoor@rmit.edu.au
{taoshimin,nicolas.xie,yanghao30}@huawei.com

tb@ldwin.net

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Abstract

Despite the subjective nature of seman-
tic textual similarity (STS) and pervasive
disagreements in STS annotation, existing
benchmarks have used averaged human rat-
ings as gold standard. Averaging masks the
true distribution of human opinions on exam-
ples of low agreement, and prevents models
from capturing the semantic vagueness that the
individual ratings represent. In this work, we
introduce USTS, the first Uncertainty-aware
STS dataset with ∼15,000 Chinese sentence
pairs and 150,000 labels,
to study collec-
tive human opinions in STS. Analysis reveals
that neither a scalar nor a single Gaussian
fits a set of observed judgments adequately.
We further show that current STS models
cannot capture the variance caused by hu-
man disagreement on individual instances, but
rather reflect the predictive confidence over
the aggregate dataset.

Introduction

involving the prediction of

Semantic textual similarity (STS) is a funda-
language understanding (NLU)
mental natural
task,
the degree
of semantic equivalence between two pieces
of text (S1,S2). STS has been approached in
various ways, ranging from early efforts us-
ing string- or knowledge-based measures and
count-based co-occurrence models (Resnik, 1999;
Barr´on-Cede˜no et al., 2010; Matveeva et al.,
2005), to modern neural networks.

Broadly speaking, the goal of the STS task
is to train models to make a similarity assess-
ment that matches what a human would make.
Gold-standard scores are typically assigned by
asking multiple raters to label a pair of sentences
and then taking the average (Agirre et al., 2012,
2013, 2014, 2015, 2016; Marelli et al., 2014;

997

So˘gancıo˘glu et al., 2017; Wang et al., 2018). The
underlying assumption here is that there is a sin-
gle ‘‘true’’ similarity score between S1 and S2,
and that this label can be approximated by aver-
aging multiple—possibly noisy—human ratings.
While this assumption might be reasonable in set-
tings such as educational testing with well-defined
knowledge or norms (Trask and Trask, 1999), it
is not the case for more subjective NLU tasks.

Pavlick and Kwiatkowski (2019) show that in
natural language inference (NLI), disagreements
often persist even if more ratings are collected or
when the amount of context provided to raters is
increased. High disagreement has been observed
in a number of existing NLI datasets (Nie et al.,
2020). In STS, concerns about inconsistent judg-
ments have been raised, particularly for difficult
boundary cases in complex domains, where even
expert annotators can disagree about the ‘‘true’’
label (Wang et al., 2020; Olmin and Lindsten,
2022). Identifying and discarding ‘‘noisy’’ labels
during training can reduce generalization error
(Wang et al., 2022a,b). We reexamine whether
the disagreement observed among raters should
be attributed to ‘‘noise’’ and resolved via dis-
missing, or should rather treated as an inherent
quality of the STS labels. Specifically, our primary
contributions are:

1. We develop USTS, the first Uncertainty-
aware STS dataset with a total of ∼15,000
Chinese sentence pairs and 150,000 labels.
We study the human assessments and in-
vestigate how best to integrate them into a
gold label across varying degrees of observed
human disagreement.

2. We show that state-of-the-art STS models
cannot capture disagreement when trained
using a single averaged rating, and argue that

Transactions of the Association for Computational Linguistics, vol. 11, pp. 997–1013, 2023. https://doi.org/10.1162/tacl a 00584
Action Editor: Saif Mohammad. Submission batch: 7/2022; Revision batch: 1/2023; Published 8/2023.
c(cid:7) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

STS evaluation should incentivize models to
predict distributions over human judgments,
especially for cases of low agreement.

No. 1

3. We discuss the practicalities of transferring
labels across languages in building a multi-
lingual STS corpus, and present evidence to
suggest that this may be problematic in the
continuous labeling space.

2 Background

2.1 Semantic Textual Similarity Task

Data Collection and Annotation: As STS re-
quires a sentence pair, to construct a dataset,
ideally sentence pairs should be sampled to popu-
late the spectrum of differing degrees of semantic
equivalence, which is a huge challenge. If pairs
of sentences are taken at random, the vast major-
ity would be totally unrelated, and only a very
small fraction would have some degree of seman-
tic equivalence (Agirre et al., 2012). Accordingly,
previous work has either resorted to string simi-
larity metrics (e.g., edit distance or bag-of-word
overlap) (Agirre et al., 2013, 2014, 2015, 2016;
So˘gancıo˘glu et al., 2017; Wang et al., 2018),
or reused existing datasets from tasks related to
STS, such as paraphrasing based on news/video
descriptions
(Agirre et al., 2012) and NLI
(Marelli et al., 2014).

In terms of annotation, for general text (e.g.,
news, glosses, or image descriptions), it has mostly
been performed using crowdsourcing via plat-
forms such as Amazon Mechanical Turk with five
crowd workers (Cer et al., 2017). For knowledge-
rich domains such as clinical and biomedical text,
on the other hand, a smaller number of expert
annotators has been used, such as two clinical
experts for MedSTS (Wang et al., 2018). Raters
are asked to assess similarity independently on the
basis of semantic equivalence using a continuous
value in range [0, 5]. Then a gold label is computed
by averaging these human ratings.

Is Averaging Appropriate? Averaging has
been the standard approach to generating gold
this
labels since Lee et al. (2005). However,
there
approach relies on the assumption that
is a well-defined gold-standard interpretation +
score, and that any variance in independent rat-
ings is arbitrary rather than due to systematic
differences in interpretation. An example of this

LOW HUMAN DISAGREEMENT

Kenya Supreme Court upholds
election result.
Kenya SC upholds election result.
5.0
N (μ = 4.9, σ = 0.1)

[4.5, 4.7, 4.8, 5.0, 5.0, 5.0, 5.0, 5.0,
5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0]
3.5
Lack of knowledge of the corre-
spondence between Supreme Court
and SC.

HIGH HUMAN DISAGREEMENT

A man is carrying a canoe with a
dog.
A dog is carrying a man in a canoe.
1.8
N (μ = 1.7, σ = 1.0)

[0.0, 0.3, 0.5, 0.5, 1.2, 1.5, 1.5, 1.8,
2.0, 2.0, 2.0, 2.0, 2.5, 3.5, 3.5]
4.3
Uncertainty about the impact of key
differences in event participants on
instances of high lexical overlap
HIGH HUMAN DISAGREEMENT

Someone is grating a carrot.
A woman is grating an orange food.
2.5
N (μ = 2.4, σ = 1.1)

[0.5, 1.0, 1.0, 1.8, 1.8, 1.8, 2.0, 2.2,
2.5, 3.0, 3.0, 3.2, 3.5, 3.6, 4.5]
0.6
Failure to associate carrot with
orange food.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

S2
Old label
New label

Annotations

Prediction
Reason

No. 2

S2
Old label
New label

Annotations

Prediction
Reason

No. 3

S1
S2
Old label
New label

Annotations

Prediction
Reason

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Table 1: Examples with varying levels of human
disagreement from the STS-B validation set.
‘‘Old label’’ = gold label of STS-B; ‘‘New label’’ =
full distribution aggregated by 15 new ratings;
and ‘‘Prediction’’ = similarity score predicted by
SOTA STS model.1

effect can be seen in case No. 1 in Table 1. In prac-
tice, however, high levels of disagreement can be
observed among annotators in different domains.2

1The individual annotations for STS-B are not available,
so we collected new ratings from 15 PhD NLPers. bert-base
fine-tuned on the STS-B training data (r = 0.91) is used for
prediction, same as the one in Section 3.1.1 for selection.

2σ > 0.5 for 9% and 11% pairs in biomedical STS cor-
pora: BIOSSES and EBMSASS; inter-annotator agreement
Cohen’s κ = 0.60/0.67 for two clinical datasets (Wang et al.,
2020).

998

In such cases, a simple average fails to
capture the latent distribution of human opin-
ions/interpretations, and masks the uncertain
nature of subjective assessments. With Nos. 2 and
3 in Table 1, for example, the average scores μ of
1.7 and 2.4 do not convey the fact that the ratings
vary substantially (σ > 1.0). While the integrated
score may reflect the average opinion, it neither
captures the majority viewpoint nor exposes the
inherent disagreements among raters. Put differ-
ently, not all average scores of a given value
convey the same information. Consider three
scenarios that all average to 3.0: (3,3,3,3,3)/5,
(1,3.5,3.5,3.5,3.5)/5, and (2,4,2,4,3)/5. The inher-
ent level of human agreement varies greatly in
these three cases.

Looking to the system predictions, the model
prediction of 3.5 for No. 1 in Table 1 is clearly
incorrect, as it lies well outside the (tight) range of
human annotations in the range [4.5, 5.0]. While
the model prediction of 4.3 for No. 2 also lies out-
side the annotation range of [0.0, 3.5], it is closer
to an extremum, and there is much lower agree-
ment here, suggesting that the prediction is better
than that for No. 1. No. 3 seems to be better again,
as the model prediction of 0.6 is both (just) within
the annotation range of [0.5, 4.5] and closer to the
average for a similarly low-agreement instance.
Based on the standard evaluation methodology in
STS research of calculating the Pearson correla-
tion over the mean rating, however, No. 1 would
likely be assessed as being a more accurate pre-
diction than Nos. 2 or 3, based solely on how close
the scalar prediction is to the annotator mean. A
more nuanced evaluation should take into con-
sideration the relative distribution of annotator
scores, and assuming a model which outputs a
score distribution rather than a simple scalar, the
relative fit between the two. We return to explore
this question in Section 5.

Based on these observations, we firstly study
how to aggregate a collection of ratings into a
representation which better reflects the ground
truth, and further go on to consider evaluation
metrics which measure the fit between the distri-
bution of annotations and score distribution of a
given model.

2.2 Human Disagreements in Annotations

Individual Annotation Uncertainty Past dis-
cussions of disagreement on STS have mostly

focused on uncertainty stemming from an indi-
vidual annotator and the noisiness of the data
collection process. They tend to attribute an out-
lier label to ‘‘inattentive’’ raters. This has led to
the design of annotation processes to control the
reliability of individual ratings and achieve high
inter-annotator agreement (Wang et al., 2018).
However, disagreements persist.

Inherent Disagreements Among Humans
Studies in NLI have demonstrated that disagree-
ments among annotations are reproducible signals
(Pavlick and Kwiatkowski, 2019). It has also been
acknowledged that disagreement is an intrinsic
property of subjective tasks (Nie et al., 2020;
Wang et al., 2022c; Plank, 2022).

Despite this, most work in STS still has
attributed high levels of disagreement
to
poor-quality data (Wang et al., 2022a), and has
focused on reducing the uncertainty in STS
modeling and providing reliable predictions
(Wang et al., 2022b). Little attention has been
paid to analyzing the inherent underlying varia-
tion in STS annotations on a continuous rating
scale, or how to fit the collective human opinions
to a mathematical representation. Does a real
value, Gaussian distribution, Gaussian mixture
model, or a more complicated distribution most
effectively approximate the latent truth?

The shortage of individual annotator labels in
STS has been a critical obstacle to in-depth anal-
ysis of disagreements among human judgments,
since only the averaged similarity scores are avail-
able to the public for almost all STS datasets, apart
from two small-scale biomedical benchmarks with
0.1k and 1k examples, respectively. To this end,
we first construct a large-scale STS corpus in this
work with 4-19 annotators for each of almost 15k
sentence pairs. We focus on analyzing disagree-
ments among annotators instead of the individual
uncertainty, presuming that each individual rater
is attentive under a quality-controlled annotation
process.

2.3 Chinese STS Corpus

Most progress on STS, driven by large-scale in-
vestment in datasets and advances in pre-training,
has centered around English.3 Efforts to build
comparable datasets for other languages have

3English STS models have achieved r = 0.91, while for
Chinese the best results are markedly lower at r = 0.82 for
STS-B test.

999

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

largely focused on (automatically) translating ex-
isting English STS datasets (Huertas-Garc´ıa et al.,
2021; Yang et al., 2019). However, this approach
may come with biases (see Section 6). Our dataset
is generated from Chinese rather than English
sources, and we employ native Chinese speak-
ers as annotators, producing the first large-scale
Chinese STS dataset.4

3 Data Collection

We collected STS judgments from multiple anno-
tators to estimate the distribution, for sentence
pairs drawn from three multilingual sources.
Sections 3.1 and 3.2 provide details of the collec-
tion, along with challenges in the annotation and
how we ensure data quality. All data and annota-
tions are available at https://github.com
/yuxiaw/USTS.

3.1 Data Sources

The first step is to gather sentence pairs. In re-
sponse to the rapid rise in STS performance and
insights into the shortcomings of current models
and limitations of existing datasets, we create a
new corpus that not only incorporates inherent
human disagreements in the gold label representa-
tion, but also includes more challenging examples,
on which state-of-the-art STS models tend to make
wrong predictions.

Common Errors: Our analysis over general
STS-B and clinical N2C2-STS exposes three ma-
jor error types. More than half of errors lie in
subsets where human agreement is low. High
uncertainty in STS labeling leads to pervasive
disagreement among human judgments.

Another is attributed to the lack of reasoning,
as Nos. 1 and 3 in Table 1 reveal: (1) matching
an abbreviation with its full name, e.g., Supreme
Court to SC; and (2) building connections be-
tween descriptions that are lexically divergent but
semantically related, e.g., carrot and orange food.
The other is the failure to distinguish pairs with
high lexical overlap but opposite meaning, due to
word substitution or reordering.

However, these types of examples account for
only a tiny proportion of existing test sets and
have minimal impact on results. Thus, our goal is
to gather more cases of high ambiguity, requiring

reasoning abilities and more semantic attention in
annotation.

As our data sources, we use sentences from TED
talks, and sentence pairs from NLI and paraphrase
corpora, as detailed below. The combined dataset
contains 14,951 pairs, over which we perform ba-
sic data cleaning to remove repeated punctuation
marks (e.g., multiple quotation marks, dashes, or
blank spaces).

3.1.1 TED-X

Compared to written texts such as essays, spo-
ken texts are more spontaneous and typically
less formal (Clark, 2002). Without any contex-
tual cues such as prosody or multi-modality to
help interpret utterances, readers may have trou-
ble understanding, especially for single sentences
out of context (Chafe, 1994), resulting in high
uncertainty in labeling. We therefore choose TED
speech transcriptions to gather high-ambiguity
examples.

Selecting Single Sentences TED2020 contains
a crawl of nearly 4000 TED and TED-X tran-
scripts, translated into more than 100 languages.
Sentences are aligned to create a parallel cor-
pus (Reimers and Gurevych, 2020). We extracted
157,047 sentences for zh-cn with character length
ranging between 20 and 100, and aligned it with
the other 8 languages of en, de, es, fr, it, ja, ko, ru,
and traditional zh.

Pairing by Retrieval Sentence pairs generated
by random sampling are prone to be seman-
tically distant. To avoid pairs with similarity
scores overwhelmingly distributed in the range
[0, 1], we use embedding-based retrieval. For
each sentence, we search for the two most sim-
ilar sentences based on faiss (Johnson et al.,
2017) using the SimCSE sentence embedding of
sup-simcse-bert-base-uncased (Gao et al., 2021),
obtaining 155,659 pairs after deduplication.5 That
is, we use (approximate) cosine similarity based
on contextualized sentence embeddings instead
of the surface string-based measures of previous
work to sample sentence pairs. This is expected to
find pairs with a higher level of semantic overlap,
rather than some minimal level of lexical match.

4Apart from translated STS-B, there are only two Chinese
corpora related to STS: BQ (Chen et al., 2018) and LCQMC
(Liu et al., 2018) for paraphrase detection (binary).

5Note that we base this on the English versions of each
sentence, due to the higher availability of pre-trained language
models and sentence encoders for English.

1000

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Score
5
4
3
2
1
0

Description

The two sentences are completely equivalent, as they mean the same thing.
The two sentences are mostly equivalent, but some unimportant details differ.
The two sentences are roughly equivalent, but some important information differs/missing.
The two sentences are not equivalent, but share some details.
The two sentences are not equivalent, but are on the same topic.
The two sentences are completely dissimilar.

Table 2: Similarity scores with descriptions (Agirre et al., 2013).

Selecting Low-agreement Examples To se-
lect what we expect to be examples with low
agreement, we leverage the observation that
high-variance examples tend to be associated with
low human agreement (Nie et al., 2020). That is,
we keep pairs with large predictive variance, and
predictions that differ greatly between two agents.
We use a bert-base-uncased-based STS model
fine-tuned on the STS-B training data for pre-
diction. We obtain the mean μ and standard
deviation σ for each example from sub-networks
based on MC-Dropout, where μ is re-scaled to
the same magnitude [0, 1] as the normalized L2
using SimCSE embedding x, and lenword(Sen)
is the word-level length of the English sentence.
We then select instances which satisfy the three
5 μ − (1.0 − L2(x1, x2))| ≥ 0.25;
criteria: (1) | 1
(2) σ ≥ 0.16; and (3) lenword(Sen) ≥ 12.6 This
results in 9,462 sentence pairs.

3.1.2 XNLI

Though sentence pairs from SICK-R and UNLI
(Chen et al., 2020) are annotated with entailment
and contradiction relations and also continuous
labels, they don’t specifically address semantic
equivalence: The scores in SICK-R reflect seman-
tic relatedness rather than similarity, and in UNLI
the annotators were asked to estimate how likely
the situation described in the hypothesis sentence
would be true given the premise.

We use sentence pairs from Cross-lingual NLI
(XNLI; Conneau et al., 2018) where there is la-
bel disagreement (which we hypothesize reflects
ambiguity), noting that the dataset was annotated
for textual entailment in en, and translated into
14 languages: fr, es, de, el, bg, ru, tr, ar, vi, th,
zh, hi, sw, and ur. From the development (2,490)
and test sets (5,010), we select examples where

6We tuned these threshold values empirically, until the
majority of sampled instances fell into the range [1, 3]—the
score interval most associated with ambiguous instances.

there is not full annotation agreement among the
five annotators, resulting in 3,259 sentence pairs
(1,097 dev and 2,162 test).

3.1.3 PAWS-X

We sample 2230 sentence pairs from PAWS-X
(Yang et al., 2019) which are not paraphrases
but have high lexical overlap. Note that this is an
extension of PAWS (Zhang et al., 2019) to include
six typologically diverse languages: fr, es, de, zh,
ja, and ko.

3.2 Annotation

We employ four professional human annotators
(all Chinese native speakers) to assign labels to
the 14,951 Chinese sentence pairs in the first
round, and an additional 15 annotators to provide
additional annotations for 6,051 examples of low
human agreement (as detailed below).

Annotation Guideline Table 2 shows
the
6-point ordinal similarity scale we use, plus
definitions.

Quality Control
It is difficult to ensure that
any divergences in annotations are more likely
due to task subjectivity or language ambiguity
than inattentiveness. We attempt to achieve this
by not using crowdsourced workers, but instead
training up in-house professional annotators with
expert-level knowledge in linguistics, and signif-
icant experience in data labeling. They were first
required to study the annotation guidelines and
exemplars, and then asked to annotate up to 15
instances of high-agreement pre-selected from the
STS-B training set. For each example, the annota-
tion is regarded to be correct when the difference
between the assigned and gold-standard label is
<0.5. Failing this, the annotator is provided with the correct label and asked to annotated another instance. 1001 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 8 4 2 1 5 4 5 2 1 / / t l a c _ a _ 0 0 5 8 4 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Source TED-X XNLI PAWS-X USTS Amount raw σ > 0.5
ratio
Length
S1
S2
pair
Raters
r
ρ
σ
STSb-zh
r
ρ
σ

3259
9462
3458
1597
36.5% 49.0%

2230
996
44.7%

14951
6051
40.5%

39.0
39.2
39.1

0.48
0.50
0.44

0.41
0.43
0.21

34.0
16.9
25.4

0.61
0.58
0.52

0.48
0.50
0.22

43.5
43.3
43.4

0.49
0.41
0.49

0.32
0.18
0.19

38.6
34.9
36.8

0.74
0.68
0.47

0.70
0.63
0.21

Table 3: Details of the USTS dataset. ‘‘r’’ =
Pearson’s correlation; ‘‘ρ’’ = Spearman’s rank
correlation; and ‘‘σ’’ = standard deviation

This procedure was iterated for three rounds to
familiarize the annotators with the task. On com-
pletion of the training, we only retain annotators
who achieve a cumulative accuracy of ≥75%.

3.3 Analysis of First-round Annotations

Dataset Breakdown Table 3 shows the break-
down of instances across the three component sets,
as well as the combined USTS dataset. In terms
of average length (zh character level), XNLI is the
shortest on average (esp. for S2, the hypothesis),
followed by TED-X and PAWS-X.

Inter-annotator Agreement The average Pear-
son (r) and Spearman (ρ) correlation between the
six pairings of annotators, and standard deviation
(σ) among the four annotators, are r = 0.74,
ρ = 0.68, σ = 0.47. These numbers reflect the
fact that there is high disagreement for a substan-
tial number of instances in USTS, in line with
the sampling criteria used to construct the dataset.
As such, aggregating ratings by averaging is not
able to capture the true nature of much of the
data. Two questions naturally arise: (1) at what
level of variance does averaging noticeably bias
the gold label? and (2) how should annotations be
aggregated to fit the latent truth most closely?

High vs. Low Agreement Figure 1 shows the
first-round variance distribution, wherein σ ranges
from 0.0 to 1.5, with 8,900 pairs being less than
0.5. It indicates that on ∼60% examples, the

Figure 1: Standard deviation distribution of the four
first-stage annotators (left) and model predictions
(right).

assessments of four annotators fluctuate around
the average score in a smaller range (0.0–0.5 on
average), while the judgments of the remaining
6,051 pairs are spread out over a wider range
(0.5–1.5).

We sample 100 examples and find that, when
σ ≤ 0.5, generally more than 10 out of 15 anno-
tators highly agree with each other. This basically
satisfies the assumption that makes averaging less
biased: Individual ratings do not vary significantly
(Lee et al., 2005). Less than half of the annotators
reach consensus when σ > 0.7, and less than 5
when σ ≥ 1.0 (referring back to our earlier exam-
ples in Table 1). Thus, we heuristically regard σ =
0.5 as a tipping point for distinguishing examples
of low (σ > 0.5) and high agreement (σ ≤ 0.5).

Accordingly, we split

the data into two
subsets, reflecting the different levels of disagree-
ment: cases where σ ≤ 0.5 are uncontroversial
(USTS-U); and cases where σ > 0.5 are
contentious (USTS-C).

Does the Model Agree with the Annotators?
We take bert-base-chinese and fine-tune it on the
Chinese STS-B training data7 with a learning rate
of 2e-5 for 3 epochs, obtaining r = 0.82/ρ = 0.82
on the validation set, and r = 0.80/ρ = 0.79 on the
test set; we refer to this model as ‘‘STSb-zh’’. We
compute r and ρ between the model prediction
and each of the four annotations, and present the
average results in Table 3.

Both r and ρ across TED-X, XNLI, and
PAWS-X are below 0.5, with PAWS-X being par-
ticularly bad with half of the pairs being predicted

7Chinese STS-B has 5,231, 1,458 and 1,361 exam-
ples for training, validation, and test, respectively; see
https://github.com/pluto-junzeng/CNSD.

1002

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

to be in the range [4, 5]. Predictions of USTS are
primarily concentrated in the range [1, 3], when
majority annotations are in the range [0, 2].

This suggests it is non-trivial for current models
to perform well without training on USTS, and
models tend to over-assign high scores (Figure 1:
predictive σ is < 0.3 vs. annotator ˆσ = 0.47). However, it also leads us to consider whether the distribution estimated based on the four annotators is adequate to generate a gold standard. To this end, we investigate the question How does the collective distribution vary when increasing the number of annotators, on cases of uncontroversial USTS-U and contentious USTS-C? 3.4 Collective Distribution Analysis We measure the distributional variation through (1) fluctuation of μ and σ; and (2) distributional divergence between first-round and second-round annotators. Study Design: We sample 100 instances from USTS-U and 100 from USTS-C, with a ratio of 4:3:3 from TED-X, XNLI, and PAWS-X, respec- tively. We then had another 15 qualified Chinese native annotators score the 200 Chinese sentence pairs. Formally, the annotation matrix AN ×M repre- sents a data set with N examples annotated by M annotators. In our setting, N = 100 and M = 19 for both USTS-U and USTS-C. We capture the variation of μ and σ over 100 examples by av- eraging μ = mean(A[:,:i], axis = 1) and σ = std(A[:,:i], axis = 1), where i ranges from 4 to 19, incorporating the new ratings incrementally. The collective distribution for the first-round annotation A[:,:4] is denoted as p = N (μ1, σ1), and q = N (μ2, σ2) for A[:,4:4+j] as we add new annotators. We observe the KL-Divergence(p(cid:10)q) as we increase j. Hypothesis: We hypothesize that the distribu- tion will remain stable regardless of the number of annotators on the uncontroversial USTS-U, but change substantially on the contentious USTS-C. Results: To plot the value of μ and σ in the same figure, we re-scale μ by subtracting 0.9 in Figure 2. We find that with an increased number of annotators, μ of USTS-U remains stable with minor perturbations, while μ of USTS-C declines and steadily flattens out. Figure 2: Average μ and σ over 100 examples of USTS-U and USTS-C as we incorporate new annotators. On USTS-U, σ ascends slowly and converges to 0.3. This matches our expectation that increas- ing annotators will result in more variance. Yet it still varies in the range [0.1, 0.3] due to the high certainty of the uncontroversial examples. In contrast, σ of USTS-C stays consistently high, indicating that there are still strong disagreements even with more annotators, because of the inher- ent ambiguity of contentious cases. It fluctuates in a larger range of [0.6, 1.0], with a steeper drop. That is, combining more ratings results in large variations in μ and σ for USTS-C, but less for USTS-U. Therefore, the distribution obtained from four annotators is adequate for uncontroversial exam- ples, but insufficient for USTS-C: More annotators are needed to gain a representative distribution. How Many Annotators Should Be Employed? In Figure 2, μ and σ of USTS-C vary substan- tially before M =15, then stabilize. The trend of KL-Divergence in Table 4 demonstrates the same phenomenon: KL declines as the number of anno- tators increases, with a relatively small and stable divergence when j > 10. Combining these two,
we employ 15 extra annotators to score the 6,051
cases for USTS-C in the second-round annotation.

1003

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

j
4
4.26
USTS-U
USTS-C 12.83

6
2.58
5.08

8
0.98
5.45

10
1.03
3.51

14
0.91
2.99

15
0.93
2.82

Table 4: -1ptKL-Divergence between the first-
round distribution and the second-round, for
increasing j.

Figure 3: Distribution of σ (top) and μ1 − μ2 (bottom)
of the first- and second-round annotation distribution.

First-round vs. Second-round: We compare
σ and μ between the first-round (in green) and
second-round (in red) annotations in Figure 3
(top). The shape of the σ distributions is very
similar, but the green bars (σ1) move towards the
right by 0.3 or so, with respect to the red bars (σ2),
leading to the average ˆσ2 = 0.42 (cid:11) ˆσ1 = 0.76.
This indicates that the second-round distribution is
more stable, with less overall variance. Nonethe-
less, 87% of pairs exceed the average deviation
of 0.27 for USTS-U, reflecting the higher number
of disagreements. Additionally, the distribution of
μ1 − μ2 in Figure 3 (bottom) is close to a nor-
mal distribution, within the range of [−1, 2]. The
majority are to the right of zero, indicating that

Annotators

8,085
STS-B
USTS-U 8,900
USTS-C 6,051

5
4
19

–

0.0–5.0
0.0–5.0 0.27 0.91 0.73
0.2–4.4 0.56 0.72 0.63

–

Table 5: Statistical breakdown of STS-B (zh) and
USTS-U/USTS-C; μ = the range of integrated
score.

Figure 4: Human judgment distributions of examples
in Table 1, with uni-, tri-, and bi-modal Gaussian,
respectively. The dotted black line shows the model fit
when using a single Gaussian; the shaded curve shows
the model learned when allowed to fit k components
of a GMM.

annotators in the first round tend to assign higher
scores than in the second, resulting in a larger μ.

3.5 The Resulting Corpus

USTS-U vs. USTS-C The number of examples
in USTS-U and USTS-C is 8,900 and 6,051,
respectively, with largely comparable μ range of
[0, 5] and [0.2, 4.4] (see Table 5). USTS-U has a
much smaller ˆσ of 0.27 than USTS-C (ˆσ = 0.56),
consistent with their inherent uncertainty level.
Analogously, USTS-U has a higher correlation of
r = 0.91 among annotators, compared to r = 0.72
for USTS-C.

4 Aggregation of Human Judgments

For the high-agreement cases of USTS-U, gold la-
bels can be approximated by aggregating multiple
annotations into either a scalar or a single Gaussian
distribution. However, for low-agreement exam-
ples, how to aggregate the human ratings remains
an open question.

Are All Distributions Unimodal Gaussian?
Though most distributions of human assessments
can be assumed to be sampled from an underly-
ing (generative) distribution defined by a single
Gaussian, we observed judgments that a unimodal
Gaussian struggles to fit. The annotations of ex-
amples No. 2 and 3 in Figure 4 exhibit clear bi-
or tri-modal distributions. How often, then, and to
what extent do multimodal distributions fit better?

1004

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Testing

K amount

prop(%)

ˆσ

amount

Train
prop

1
2
3

1772
225
3

88.6
11.3
0.0

0.55
0.63
0.39

3755
294
2

92.7
7.3
0.0

ˆσ

0.48
0.50
0.66

Table 6: The amount and averaged standard devi-
ation ˆσ of examples with k = {1, 2, 3} effective
components of human judgment distributions in
the training and test splits.

This reveals that the GMM does not frequently
use more than one effective component, with much
lower weights on the second or third components.
The majority of held-out human judgments fit a
unimodal distribution well.

Gold Labels: Given that a minority of instances
in USTS-C are bimodally distributed, and that
even for these instances, the weight on the second
components is low, we conservatively use a single
Gaussian to aggregate human judgments for all
cases in this work.

5 Analysis of Model Predictions

Most STS models predict a pointwise similarity
score rather than of a distribution over values.
Wang et al. (2022b) estimated the uncertainty
for continuous labels by MC-Dropout and Gaus-
sian process regression (GPR). However, due to
the lack of gold distributions, they only evaluate
outputs using expected calibration error (ECE)
and negative log-probability density (NLPD), as-
sessing the predictive reliability. It’s unknown
whether these uncertainty-aware models mimic
human disagreements, i.e., the predicted deviation
reflects the variance of human judgments.

To explore this, we experiment over USTS
and incorporate distributional divergence (i.e.,
Kullback-Leibler Divergence [KL]) into the eval-
uation, to observe the fit between the distribution
of collective human judgments and the model
predictive probability. We also examine the abil-
ity of different models to capture the averaged
score for low-agreement cases, and whether a
fits the distribution of
well-calibrated model
annotations better.

Evaluation Metrics: For singular values, STS
accuracy is generally evaluated with Pearson cor-
relation (r) and Spearman rank correlation (ρ),
measuring the linear correlation between model

Figure 5: Left: Log likelihood of
test data un-
der the single-component Gaussian (x-axis) vs. the
k-component GMM (y-axis). The darker the area, the
more the examples concentrate. Right: Weights of top-2
effective component distribution.

We answer this question by fitting human judg-
ments using a Gaussian Mixture Model (GMM),
where the number of components is selected dur-
ing training. This means the model can still choose
to fit the distribution with only one Gaussian
component where appropriate. If additional com-
ponents yield a better fit to the judgments, i.e.,
larger log likelihood is observed than using a
unimodal distribution, we consider the human
judgments to exhibit a multimodal distribution.

Experiments and Results We randomly split
USTS-C into a training (4,051) and test set (2,000),
and use the training data to fit a GMM with:
(1) one component; or (2) the optimal number
of components k. We compute the log likeli-
hood assigned to each example in the test set in
Figure 5 (left), with the unimodal results as the
x-axis and multimodal Gaussian as the y-axis. The
majority of points fall on or above the diagonal
line (y = x), with a multimodal distribution out-
performing a unimodal Gaussian distribution for
83% of instances. However, does this suggest that
most examples exhibit multiple peaks?

Effective Components: We count the effective
components for each sentence pair based on the
weight assigned by the GMM in form of a proba-
bility for each component. We see that, for 11.3%
of pairs, there is a nontrivial second component
(weight ≥ 0.2), and a third component on 3 pairs.
Rarely are there more than three components with
significant weights (see Table 6). Moreover, we
find that the weight of the dominant component
mostly (87%) distributes over 0.8, and that the
weight of the second effective component scatters
across the range 0.25–0.5 (the right of Figure 5).

1005

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Model

r ↑

ρ ↑ ECE ↓ NLPD ↓

r ↑

ρ ↑ ECE ↓ NLPD ↓ KL ↓

r ↑

ρ ↑ ECE ↓ NLPD ↓ KL ↓

STS-B

USTS-U

USTS-C

SBERT-NLI
(1) SBERT-cosine 0.714 0.718
(2) SBERT-GPR
0.741 0.743
Domain-specific
BERT-lr
0.808 0.804
0.811 0.805
BERT-lr-MC
SBERT-cosine 0.779 0.781
SBERT-GPR
0.780 0.782
Domain-generalized

(3)

(4)

0.815 0.813
BERT-lr
0.814 0.811
BERT-lr-MC
SBERT-cosine 0.772 0.772
0.772 0.775
SBERT-GPR

Cross-domain
BERT-lr
0.675 0.667
0.678 0.671
BERT-lr-MC
SBERT-cosine 0.695 0.692
0.726 0.726
SBERT-GPR

(5)

N/A
0.001

N/A
0.167
N/A
0.053

N/A
0.179
N/A
0.017

N/A
0.348
N/A
0.001

N/A
0.532

N/A
4.709
N/A
0.917

N/A
5.865
N/A
0.645

N/A
12.90
N/A
0.555

0.597 0.383
0.709 0.433

N/A
0.020

N/A

N/A
0.033 2.233

0.572 0.442
0.656 0.455

N/A
0.139

N/A

N/A
−0.09 0.576

0.855 0.700
0.856 0.703
0.661 0.387
0.683 0.388

0.860 0.692
0.861 0.697
0.686 0.435
0.707 0.433

0.754 0.650
0.755 0.695
0.647 0.449
0.723 0.481

N/A
0.054
N/A
0.137

N/A
0.060
N/A
0.098

N/A
1.296
N/A
0.020

N/A

N/A
1.079 4.587
N/A
0.651 3.050

N/A

N/A
0.898 4.434
N/A
0.268 2.578

N/A

N/A
10.55 13.95
N/A
0.012 2.215

N/A

0.806 0.707
0.809 0.708
0.596 0.460
0.606 0.444

0.835 0.768
0.838 0.774
0.670 0.523
0.674 0.497

0.725 0.676
0.729 0.687
0.606 0.481
0.675 0.494

N/A
0.046
N/A
0.415

N/A
0.278
N/A
0.157

N/A
1.298
N/A
0.148

N/A

N/A
0.442 6.073
N/A
0.717 0.950

N/A

N/A
0.702 5.401
N/A
−0.04 0.955

N/A

N/A
8.956 12.62
N/A
−0.11 0.555

N/A

Table 7: Test set correlation (r/ρ), ECE, NLPD, and KL using end-to-end (BERT) and pipeline
(SBERT), over STS-B, USTS-U, and USTS-C, under five settings. The bold number is the best result
for BERT, and the underlined number is that for SBERT.

outputs and the average annotation, the degree
of monotonicity under ranking, respectively.

For uncertainty-aware outputs, ECE and NLPD
can be used to assess model reliability in the ab-
sence of gold distributions. ECE measures whether
the estimated predictive confidence is aligned
with the empirical correctness likelihoods. A
well-calibrated model should be less confident on
erroneous predictions and more confident on cor-
rect ones. NLPD penalizes over-confidence more
strongly through logarithmic scaling, favoring
under-confident models.

text pair (S1, S2), implemented based on the
Hugging-Face Transformer
framework. We
fine-tune SBERT separately over each STS
corpus based on bert-base-chinese-nli,
using the same configuration as the original paper.
We use the concatenation of the embeddings
u ⊕ v, along with their absolute difference |u − v|
and element-wise multiplication v × t to represent
a sentence pair, implemented in Pyro.8

We evaluate STS-B, USTS-U, and USTS-C
five training settings, as presented in

under
Table 7:

5.1 Models and Setup

1. Zero-shot: SBERT with no tuning;

BERT with Two-layer MLP: The hidden state
h from the last-layer hidden state of BERT CLS
token (Devlin et al., 2019) is passed through a
two-layer MLP with tanh activation function.
We refer to this model as BERT-lr when making
deterministic predictions, and BERT-lr-MC when
using MC-Dropout (Gal and Ghahramani, 2016)
for uncertainty estimation.

SBERT with GPR:
In contrast with end-to-end
training, sparse GPR is applied to estimate
distributions,
taking encoded sentences from
Sentence-BERT (SBERT; Reimers and Gurevych
[2019]) as input. We also calculate the cosine
similarity between S1 and S2 using SBERT, as a
non-Bayesian counterpart.

Setup: bert-base-chinese is used with
input format [CLS] S1 [SEP] S2 [SEP] for

2. GPR trained on sbert-nli;

3. Domain-specific: fine-tuned on each dataset

separately;

4. Domain-generalized: fine-tuned using the

three datasets combined;

5. Cross-domain: train with STS-B training data
for USTS-U and USTS-C, and with USTS
for STS-B.

5.2 Results and Analysis

USTS is Challenging.
In setting (1) of Table 7,
purely depending on pre-trained semantic rep-
resentation and cosine similarity, correlations
over USTS-U and USTS-C are much lower than
STS-B. This suggests that USTS is a challenging

8https://pyro.ai/.

1006

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

USTS-U( ˆσH = 0.26)

USTS-C( ˆσH = 0.56)

Model

ˆσM

(4) BERT-lr-MC
0.12 0.19
(5) SBERT-GPR −0.07 −0.06 0.67

0.13

0.24

0.23 0.20
−0.05 −0.06 0.54

Table 8: Test set correlation between the predicted
variance and collective human variance.

dataset, but can be learned. USTS-U in particular
achieves large improvements in performance after
domain-specific training in experiments (3)–(4).

Critical Differences Exist Between Model
Outputs and Human Annotations. The models
can capture average opinion, resulting in reason-
able r/ρ between the predicted target value and
the averaged annotations. However, they cannot
capture the variance of human opinions. To quan-
tify how well the predicted variance σM captures
the variance σH of human judgments, we analyze
the outputs of the top-2 settings: BERT-lr-MC
from setting (4) and SBERT-GPR from setting
(5), for USTS-U and USTS-C. We compute the
correlation r and ρ between σM and σH in
Table 8, and visualize the σM with increasing
human disagreement in Figure 6.

There is no apparent correlation between σM
and σH . A given model displays similar devi-
ation σM regardless of the relative amount of
human disagreement. Different models concen-
trate on different parts of the spectrum, e.g.,
BERT-lr-MC is distributed in the range [0.1, 0.2]
while SBERT-GPR is distributed in the range
[0.5, 0.7], and neither follows the line of y = x.
This suggests that the uncertainty captured by
current models is not the uncertainty underly-
ing human disagreements. Rather, it may reflect
the model’s predictive confidence on the data
set as a whole. This finding is not surprising since
none of the models are optimized to capture col-
lective human opinions, but suggests an important
direction for future improvement.

Being Trustworthy is Orthogonal to Being
Accurate. We see that ECE and NLPD do not mir-
ror the results for r/ρ and distributional divergence
KL. This implies the ability required to improve
model reliability differs from that required to per-
form accurately, regardless of whether a target
value or a target distribution is predicted.

Low Human-agreement USTS is Detri-
to Training Sentence Embeddings.

mental

Figure 6: Predicted variance σM (y-axis) with increas-
ing human disagreement (x-axis). The Red and blue
triangles = USTS-U and USTS-C from experiment set-
ting (4) in Table 7, orange and green circles = USTS-U
and USTS-C from experiment setting (5), and black
line is y = x. USTS-U disperses at the left of the x-axis
and low-agreement USTS-C scatters to the right.

Comparing the performance of experiment set-
tings (2) and (5) in Table 7,
tuning SBERT
on USTS hurts results over STS-B across the
board, while training on STS-B benefits both
USTS-U and USTS-C. We speculate that
the
examples in USTS with larger annotator vari-
ance are more ambiguous than STS-B. Forcing
networks to learn from high-ambiguity signals
may inhibit generalization, resulting in worse
representations.

Discussion For instances of high disagreement,
neither a scalar nor a single Gaussian fits a set
of observed judgments adequately. As a direction
for future work, we suggest exploring the direct
estimation of individual ratings (e.g., by few-shot
prompt-based prediction) and evaluating against
the raw collective opinions. This could circumvent
the ineffective training and evaluation caused by
aggregation.

6 Multilingual USTS

Before extending USTS into a multilingual bench-
mark, we question the validity of previous
approaches involving direct transfer of annota-
tions collected for one language to other languages
(Liu et al., 2021; Yang et al., 2019). This strat-
egy assumes that the nuanced semantics of the

1007

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

en-rater

USTS-U
USTS-C

0.69
0.94

+PT

0.67
0.78

+OS

0.53
0.73

+GU

0.38
0.68

Table 9: KL-divergence of labels as ratings from
less proficient language speakers are incorporated.

component sentences is not changed under trans-
lation, and hence the label will be identical. To
test whether this assumption is reasonable, we an-
alyze the impact of language on the annotations,
and discuss whether such ratings are transferable
across languages.

Specifically, we establish whether the label
distribution varies based on language, and how an-
notator proficiency affects the distribution given
the same text.

Collecting Labels Taking English as a pivot
language, we employ native English speak-
ers (‘‘NT’’) and bilingual raters whose mother
language is Mandarin Chinese, including 5 pro-
fessional translators (‘‘PT’’), 5 overseas students
(‘‘OS’’), and 5 general users (‘‘GU’’). Each anno-
tator assigns labels to 100 examples sampled from
each of USTS-U and USTS-C (the same data set
used in Section 3.4), which have been manually
post-edited by professional translators to ensure
content alignment.

Results We average the KL between collective
distributions drawn from 19 raters given zh text,
and 5 native English speakers (NT) given en text.
Table 9 shows there is not a substantial distribu-
tional divergence. Differences decline further as
annotations of the other three groups of bilingual
raters are incorporated.

Detailed analysis of distributions across each
of these groups (Figure 7) reveals that the lan-
guage of the text affects the distribution of human
opinions. On both USTS-U and USTS-C, the
distribution differs substantially between native
Chinese speakers and native English speakers
when given zh and en sentence pairs, respectively.
While the zh annotations cluster in the lower σ
region, those for en are dispersed across a large σ
span.

Figure 7 also shows that the distribution of
professional translators mirrors that of English na-
tives, while general users differ substantially from
both these groups, but are similar to native-speaker

Chinese annotators who are given zh text. We sus-
pect that translators make judgments based on the
meaning of en text directly, but general users may
use translation tools to translate en text back to zh
to support their understanding, meaning they are in
fact rating a Chinese text pair. Intermediate-level
overseas students may mix strategies and thus are
somewhere in between these two extremes.

Discussion The differences we observe may
be attributed to bias introduced during manual
translation. Each sentence in a pair is translated
separately, so while a source pair may have lexical
overlap, this may not carry over under independent
translation. We examine this effect by calculating
the word overlap similarity as Eq (1) for zh/en
pairs, where T1 and T2 are whitespace-tokenised
words for English and based on the jieba segment
tool for Chinese. We calculate string similarity as:

Sim =

len(T1 ∩ T2) + 1
max(len(T1), len(T2)) + 1

(1)

As detailed in Table 10, the lexical overlap sim-
ilarity for en and zh is similar for USTS-U and
USTS-C, suggesting that inconsistencies under
translation are not a primary cause of the observed
discrepancy.

In summary The language of the text impacts
the distribution of human judgments. In our anal-
ysis, English results in higher-uncertainty labeling
than Chinese, for both uncontroversial and con-
tentious cases. This suggests that the previous
assumption that labels remain identical across lan-
guages as long as the meaning of the text is kept
the same, is potentially problematic, even though
pairwise lexical overlap remains similar.

7 Discussion

We focus on the STS task in this work. How-
ever, the methods we propose can be transferred
to other subjective textual regression tasks, such
as sentiment analysis (SA) rating and machine
translation quality estimation in the format of
direct assessment (DA). Similar findings stem-
ming from task subjectivity may be relevant to
other types of NLP tasks relying on human an-
notation. High disagreement among annotators
may occur due to ambiguous labeling, where
it is challenging to compile guidelines that are
widely accepted and consistently interpreted by
all individual annotators.

1008

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Figure 7: Scatter plot of 100 examples sampled from USTS-U (top) and USTS-C (bottom) annotated by Chinese
native, English native, professional translator (PT), overseas students (OS), and general users (GU). We plot (μ, σ)
as coordinate points.

lan

zh
en

USTS-U

USTS-C

USTS

0.42
0.40

0.45
0.43

0.44
0.41

Table 10: Lexical similarity between en and zh
pairs sampled from USTS-U, USTS-C, and the
combination of the two.

In practice, it may be difficult to estimate the
distribution of human annotations in instances
where multiple annotators are difficult to source,
such as occurs in clinical and biomedical STS
due to the need for highly specialized knowledge.
Transfer learning, which relies on patterns learned
from general-purpose USTS, provides a means
to predict such a distribution, if noisily. We pro-
pose to explore the direct estimation of individual
ratings by in-context learning based on large lan-
guage models (LLMs), e.g., GPT-3 (Brown et al.,
2020) and ChatGPT.9 LLMs are able to perform
in-context learn—perform a new task via infer-
ence alone, by conditioning on a few labeled pairs
as part of the input (Min et al., 2022).

ChatGPT appears to be highly effective at style
transfer and tailoring of content to specific au-
diences such as five-year old children or domain
experts, through learning about language style and
tone from interactional data and individual prefer-
ences. This allows it to generate more personalized
responses (Aljanabi et al., 2023). Deshpande et al.
(2023) show that assigning ChatGPT a persona

9https://openai.com/blog/chatgpt.

through the parameter system-role, such as a
bad/horrible person, can increase the toxicity of
generated outputs up to sixfold.

Additionally, Schick and Sch¨utze (2021) show
that generative LLMs can be used to automati-
cally generate labeled STS datasets using targeted
instructions. This data can be utilized to improve
the quality of sentence embeddings. Together,
these imply that LLMs may have utility in
generating personalised semantic similarity as-
sessments, based on annotator meta data (e.g., age,
educational background, or domain expertise).

Simulating variation in judgments between in-
dividual annotators using synthetic personalized
ratings could mitigate ineffective training and
evaluation caused by aggregation, given that nei-
ther a scalar nor a single Gaussian fits the set of
observed judgments adequately for instances of
high disagreement.

8 Conclusion

We presented the first uncertainty-aware STS cor-
pus, consisting of 15k Chinese examples with
more than 150k annotations. The dataset is in-
tended to promote the development of STS
systems from the perspective of capturing inherent
disagreements in STS labeling, and establish less
biased and more nuanced gold labels when large
variances exist among individual ratings.

We additionally examine the models’ ability to
capture the averaged opinion and the distribution
of collective human judgments. Results show that
the uncertainty captured by current models is not

1009

explained by the semantic uncertainty that results
in disagreements among humans. Rather, it tends
to reflect the predictive confidence over the whole
data set. We also found that the text language and
language proficiency of annotators affect labeling
consistency.

Acknowledgments

We thank the anonymous reviewers and editors
for their helpful comments; and Yanqing Zhao,
Samuel Luke Winfield D’Arcy, Yimeng Chen,
and Minghan Wang in Huawei TSC and NLP
Group colleagues in The University of Melbourne
for various discussions. Yuxia Wang is supported
by scholarships from The University of Melbourne
and China Scholarship Council (CSC).

References

Eneko Agirre, Carmen Banea, Claire Cardie,
Daniel M. Cer, Mona T. Diab, Aitor
Gonzalez-Agirre, Weiwei Guo, I˜nigo Lopez-
Gazpio, Montse Maritxalar, Rada Mihalcea,
German Rigau, Larraitz Uria, and Janyce
Wiebe. 2015. Semeval-2015 task 2: Seman-
tic textual similarity, English, Spanish and
pilot on interpretability. In Proceedings of
the 9th International Workshop on Seman-
tic Evaluation, SemEval@NAACL-HLT 2015,
Denver, Colorado, USA, June 4–5, 2015,
pages 252–263. The Association for Com-
puter Linguistics. https://doi.org/10
.18653/v1/s15-2045

Eneko Agirre, Carmen Banea, Claire Cardie,
Daniel M. Cer, Mona T. Diab, Aitor
Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea,
German Rigau, and Janyce Wiebe. 2014.
Semeval-2014 task 10: Multilingual semantic
the
textual
8th International Workshop on Semantic
Evaluation, SemEval@COLING 2014, Dublin,
Ireland, August 23–24, 2014, pages 81–91.
The Association for Computer Linguistics.
https://doi.org/10.3115/v1/s14
-2010

In Proceedings of

similarity.

Eneko Agirre, Carmen Banea, Daniel M. Cer,
Mona T. Diab, Aitor Gonzalez-Agirre, Rada
Mihalcea, German Rigau, and Janyce Wiebe.
2016. Semeval-2016 task 1: Semantic textual
cross-lingual
similarity, monolingual

and

evaluation. In Proceedings of the 10th Inter-
national Workshop on Semantic Evaluation,
SemEval@NAACL-HLT 2016, San Diego,
CA, USA, June 16–17, 2016, pages 497–511.
The Association for Computer Linguistics.
https://doi.org/10.18653/v1/s16
-1081

Eneko Agirre, Daniel M. Cer, Mona T. Diab,
Aitor Gonzalez-Agirre, and Weiwei Guo.
2013. *SEM 2013 shared task: Semantic tex-
tual similarity. In Proceedings of the Second
Joint Conference on Lexical and Computa-
tional Semantics, pages 32–43. Association for
Computational Linguistics.

Eneko Agirre, Mona Diab, Daniel Cer, and Aitor
Gonzalez-Agirre. 2012. Semeval-2012 task 6:
A pilot on semantic textual similarity. In Pro-
ceedings of
the First Joint Conference on
Lexical and Computational Semantics-Volume
1: Proceedings of the main conference and the
shared task, and Volume 2: Proceedings of
the Sixth International Workshop on Seman-
tic Evaluation, pages 385–393. Association for
Computational Linguistics.

Mohammad Aljanabi and et al. 2023. Chat-
GPT: Future directions and open possibilities.
Mesopotamian Journal of Cyber Security,
2023:16–17. https://doi.org/10.58496
/MJCS/2023/003

Alberto Barr´on-Cede˜no, Paolo Rosso, Eneko
Agirre, and Gorka Labaka. 2010. Plagiarism
detection across distant language pairs. In Pro-
ceedings of the 23rd International Conference
on Computational Linguistics, pages 37–45.
Tsinghua University Press.

Tom B. Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah,
Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya
Ramesh, Daniel M. Ziegler,
Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark
Chen, Eric Sigler, Mateusz Litwin, Scott
Gray, Benjamin Chess, Jack Clark, Christopher
Berner, Sam McCandlish, Alec Radford, Ilya
Sutskever, and Dario Amodei. 2020. Language
learners. In Advances
models are few-shot
Information Processing Systems
in Neural

1010

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

33: Annual Conference on Neural Informa-
tion Processing Systems 2020, NeurIPS 2020,
December 6–12, 2020, virtual.

Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo
and Lucia Specia. 2017.
Lopez-Gazpio,
SemEval-2017 task 1: Semantic textual sim-
ilarity multilingual and crosslingual focused
evaluation. In Proceedings of the 11th Inter-
national Workshop on Semantic Evaluation
(SemEval-2017), pages 1–14. Vancouver,
https://doi.org/10.18653
Canada.
/v1/S17-2001

Wallace Chafe. 1994. Discourse, consciousness,
and time: The Flow and Displacement of Con-
scious Experience in Speaking and Writing,
University of Chicago Press.

Jing Chen, Qingcai Chen, Xin Liu, Haijun Yang,
Daohe Lu, and Buzhou Tang. 2018. The BQ
corpus: A large-scale domain-specific Chi-
nese corpus for sentence semantic equivalence
the 2018
identification.
Conference on Empirical Methods in Natu-
ral Language Processing, pages 4946–4951,
Brussels, Belgium. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D18-1536

In Proceedings of

Tongfei Chen, Zhengping Jiang, Adam Poliak,
and Benjamin Van
Keisuke Sakaguchi,
Durme. 2020. Uncertain natural
language
inference. In Proceedings of the 58th Annual
Meeting of
the Association for Computa-
tional Linguistics, pages 8772–8779, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1
/2020.acl-main.774

Herbert H. Clark. 2002. Speaking in time. Speech
Communication, 36(1–2):5–13. https://doi
.org/10.1016/S0167-6393(01)00022-X

Alexis Conneau, Ruty Rinott, Guillaume Lample,
Adina Williams, Samuel R. Bowman, Holger
Schwenk, and Veselin Stoyanov. 2018. XNLI:
Evaluating cross-lingual sentence represen-
the 2018 Con-
tations.
ference on Empirical Methods in Natural
Language Processing. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/D18-1269

In Proceedings of

Ameet Deshpande, Vishvak Murahari, Tanmay
and Karthik

Rajpurohit, Ashwin Kalyan,

Narasimhan. 2023. Toxicity in chatGPT: Anal-
yzing persona-assigned language models. arXiv
preprint arXiv:2304.05335. https://doi
.org/10.48550/arXiv.2304.05335

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of the 2019 Con-
ference of the North American Chapter of the
Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long
and Short Papers), pages 4171–4186. Min-
neapolis, Minnesota. https://doi.org
/10.18653/v1/N19-1423

Yarin Gal and Zoubin Ghahramani. 2016. Dropout
as a Bayesian approximation: Representing
model uncertainty in deep learning. In In-
ternational Conference on Machine Learning,
pages 1050–1059.

Tianyu Gao, Xingcheng Yao, and Danqi Chen.
2021. SimCSE: Simple contrastive learn-
In Proceed-
ing of sentence embeddings.
the 2021 Conference on Empirical
ings of
Methods in Natural Language Processing,
pages 6894–6910, Online and Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.emnlp-main.552

´Alvaro Huertas-Garc´ıa,

Javier Huertas-Tato,
Alejandro Mart´ın, and David Camacho. 2021.
Countering misinformation through semantic-
aware multilingual models. In Intelligent Data
Engineering and Automated Learning–IDEAL
2021, pages 312–323, Cham. Springer Interna-
tional Publishing. https://doi.org/10
.1007/978-3-030-91608-4_31

Jeff Johnson, Matthijs Douze, and Herv´e J´egou.
2017. Billion-scale similarity search with
arXiv:1702.08734.
GPUs.
https://doi.org/10.48550/arXiv
.1702.08734

preprint

arXiv

Michael D. Lee, Brandon Pincombe, and Matthew
Welsh. 2005. An empirical evaluation of models
of text document similarity. In Proceedings of
the Annual Meeting of the Cognitive Science
Society.

Fangyu Liu,

Ivan Vuli´c, Anna Korhonen,
and Nigel Collier. 2021. Learning domain-
specialised representations for cross-lingual

1011

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

biomedical entity linking. In Proceedings of
the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th
International Joint Conference on Natural Lan-
guage Processing (Volume 2: Short Papers),
pages 565–574, Online. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2021.acl-short.72

Xin Liu, Qingcai Chen, Chong Deng, Huajun
Zeng, Jing Chen, Dongfang Li, and Buzhou
Tang. 2018. LCQMC: A large-scale Chinese
question matching corpus. In Proceedings of
the 27th International Conference on Compu-
tational Linguistics, pages 1952–1962, Santa
Fe, New Mexico, USA. Association for
Computational Linguistics.

Marco Marelli, Luisa Bentivogli, Marco Baroni,
Raffaella Bernardi, Stefano Menini,
and
Roberto Zamparelli. 2014. SemEval-2014 task
1: Evaluation of compositional distributional
semantic models on full sentences through se-
mantic relatedness and textual entailment. In
the 8th International Work-
Proceedings of
shop on Semantic Evaluation (SemEval 2014),
pages 1–8, Dublin, Ireland. Association for
Computational Linguistics. https://doi
.org/10.3115/v1/S14-2001

Irina Matveeva, G. Levow, Ayman Farahat, and
Christian Royer. 2005. Generalized latent se-
mantic analysis for term representation. In
Proceedings of RANLP, page 149.

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel
Artetxe, Mike Lewis, Hannaneh Hajishirzi,
and Luke Zettlemoyer. 2022. Rethinking the
role of demonstrations: What makes in-context
learning work? In Proceedings of the 2022
Conference on Empirical Methods in Natu-
ral Language Processing, pages 11048–11064,
Abu Dhabi, United Arab Emirates. Association
for Computational Linguistics.

Yixin Nie, Xiang Zhou, and Mohit Bansal.
2020. What can we learn from collective hu-
man opinions on natural language inference
data? In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language
Processing (EMNLP), pages 9131–9143, On-
line. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.734

Amanda Olmin and Fredrik Lindsten. 2022. Ro-
bustness and reliability when training with noisy
labels. In International Conference on Artifi-
cial Intelligence and Statistics, pages 922–942.
PMLR.

Ellie Pavlick and Tom Kwiatkowski. 2019. Inher-
ent disagreements in human textual inferences.
Transactions of the Association for Computa-
tional Linguistics, 7:677–694. https://doi
.org/10.1162/tacl_a_00293

Barbara Plank. 2022. The ‘‘problem’’ of human
label variation: On ground truth in data, model-
ing and evaluation. In Proceedings of the 2022
Conference on Empirical Methods in Natu-
ral Language Processing, pages 10671–10682,
Abu Dhabi, United Arab Emirates. Association
for Computational Linguistics.

Nils Reimers

and Iryna Gurevych. 2019.
Sentence-BERT: Sentence embeddings using
Siamese BERT-networks. In Proceedings of
the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th
International Joint Conference on Natural
Language
(EMNLP-IJCNLP),
pages 3982–3992, Hong Kong, China. As-
sociation
for Computational Linguistics.
https://doi.org/10.18653/v1/D19
-1410

Processing

Nils Reimers and Iryna Gurevych. 2020. Making
monolingual sentence embeddings multilingual
using knowledge distillation. In Proceedings
of the 2020 Conference on Empirical Methods
in Natural Language Processing. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/2020.emnlp-main.365

Philip Resnik. 1999. Semantic similarity in a tax-
onomy: An information-based measure and its
application to problems of ambiguity in natural
language. Journal of Artificial Intelligence Re-
search, 11:95–130. https://doi.org/10
.1613/jair.514

Timo Schick and Hinrich Sch¨utze. 2021. Generat-
ing datasets with pretrained language models. In
Proceedings of the 2021 Conference on Empir-
ical Methods in Natural Language Processing,
pages 6943–6951, Online and Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.emnlp-main.555

1012

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

¨Ozt¨urk,

Gizem So˘gancıo˘glu, Hakime

and
Arzucan ¨Ozg¨ur. 2017. BIOSSES: A seman-
sentence
tic
similarity estimation system
the biomedical domain. Bioinformat-
for
ics, 33(14):i49–i58. https://doi.org
/10.1093/bioinformatics/btx238,
PubMed: 28881973

Robert Lawrence Trask and Robert Lawrence
Trask. 1999. Key Concepts in Language and
Linguistics. Psychology Press.

Yanshan Wang, Naveed Afzal, Sunyang Fu, Liwei
Wang, Feichen Shen, Majid Rastegar-Mojarad,
and Hongfang Liu. 2018. MedSTS: A re-
source for clinical semantic textual simi-
larity. Language Resources and Evaluation,
pages 1–16. https://doi.org/10.1007
/s10579-018-9431-1

Yanshan Wang, Sunyang Fu, Feichen Shen, Sam
Henry, Ozlem Uzuner, and Hongfang Liu. 2020.
The 2019 n2c2/ohnlp track on clinical semantic
textual similarity: Overview. JMIR Med In-
form, 8(11). https://doi.org/10.2196
/23375, PubMed: 33245291

Yuxia Wang, Timothy Baldwin, and Karin
Verspoor. 2022a. Noisy label regularisation
for textual regression. In Proceedings of the
29th International Conference on Computa-
tional Linguistics, pages 4228–4240, Gyeongju,
Republic of Korea. International Committee on
Computational Linguistics.

Yuxia Wang, Daniel Beck, Timothy Baldwin, and
Karin Verspoor. 2022b. Uncertainty estimation
text
and reduction of pre-trained models for

regression. Transactions of the Association for
Computational Linguistics, 10:1–17. https://
doi.org/10.1162/tacl a 00483

Yuxia Wang, Minghan Wang, Yimeng Chen,
Shimin Tao, Jiaxin Guo, Chang Su, Min
Zhang, and Hao Yang. 2022c. Capture hu-
man disagreement distributions by calibrated
networks for natural
language inference. In
Findings of the Association for Computational
Linguistics: ACL 2022, pages 1524–1535,
Dublin,
Ireland. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2022.findings-acl.120

2019.

Baldridge.

In Proceedings of

Yinfei Yang, Yuan Zhang, Chris Tar, and
Jason
PAWS-X: A
cross-lingual adversarial dataset for paraphrase
the 2019
identification.
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 3687–3692,
Hong Kong, China. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D19-1382

Yuan Zhang, Jason Baldridge, and Luheng He.
2019. PAWS: Paraphrase adversaries from
word scrambling. In Proceedings of the 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 1298–1308,
Minneapolis, Minnesota. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N19-1131

1013

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
5
8
4
2
1
5
4
5
2
1

/
t

a
c
_
a
_
0
0
5
8
4
p
d

b
y
g
u
e
s
t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3 Collective Human Opinions in Semantic Textual Similarity image

Download pdf