Sentence Similarity Based on Contexts

Xiaofei Sun(cid:2), Yuxian Meng♣, Xiang Ao(cid:3), Fei Wu(cid:2), Tianwei Zhang♥
Jiwei Li(cid:2)♣, and Chun Fan♠

(cid:2)Zhejiang University, Porcelana, ♣Shannon.AI, Porcelana, (cid:3)Academia China de Ciencias, Porcelana,
♥ Nanyang Technological University, Singapur, ♠Computer Center, Peking University, Porcelana,
♠National Biomedical Imaging Center, Peking University, Porcelana, ♠Peng Cheng Laboratory, Porcelana
{xiaofei sun,yuxian meng,jiwei li}@shannonai.com,aoxiang@ict.ac.cn
wufei@zju.edu.cn,tianwei.zhang@ntu.edu.sg,fanchun@pku.edu.cn

Abstracto

Existing methods to measure sentence similar-
ity are faced with two challenges: (1) labeled
datasets are usually limited in size, haciendo
them insufficient to train supervised neural
modelos; y (2) there is a training-test gap for
unsupervised language modeling (LM) based
models to compute semantic scores between
oraciones, since sentence-level semantics are
not explicitly modeled at training. Esta re-
sults in inferior performances in this task.
En este trabajo, we propose a new framework
to address these two issues. The proposed
framework is based on the core idea that the
meaning of a sentence should be defined by
its contexts, and that sentence similarity can
be measured by comparing the probabilities
of generating two sentences given the same
contexto. The proposed framework is able to
generate high-quality, large-scale dataset with
semantic similarity scores between two sen-
tences in an unsupervised manner, con la cual
the train-test gap can be largely bridged. Ex-
tensive experiments show that the proposed
framework achieves significant performance
boosts over existing baselines under both the
supervised and unsupervised settings across
different datasets.

Introducción

Measuring sentence similarity is a long-standing
task in NLP (Luhn, 1957; Robertson et al., 1995;
Blei et al., 2003; Peng et al., 2020). The task
aims at quantitatively measuring the semantic re-
latedness between two sentences, and has wide
applications in text search (Farouk et al., 2018),
comprensión del lenguaje natural (MacCartney and
Manning, 2009), and machine translation (Cual
et al., 2019a).

One of the greatest challenges that existing
methods face for sentence similarity is the lack

573

of large-scale labeled datasets, which contain
sentence pairs with labeled semantic similar-
ity scores. The acquisition of such a dataset is
both labor-intensive and expensive. Para examen-
por ejemplo, the STS benchmark (Cer et al., 2017) y
SICK-Relatedness dataset (Marelli et al., 2014)
respectively contain 8.6K and 9.8K labeled sen-
tence pairs, the sizes of which are usually insuf-
ficient for training deep neural networks.

Unsupervised learning methods are proposed to
address this issue, where word embeddings (Le
and Mikolov, 2014) or BERT embeddings (Devlin
et al., 2018) are used to to map sentences to
fix-length vectors in an unsupervised manner.
Then sentence similarity is computed based on the
cosine or dot product of these sentence representa-
ciones. Our work follows this thread where sentence
similarity is computed based on fix-length sen-
tence representations, as opposed to comparing
sentences directly. The biggest issue with cur-
rent unsupervised approaches is that there exists
a big gap between model training and testing
(es decir., computing semantic similarity between two
oraciones). Por ejemplo, the BERT-style mod-
els are trained at the token level by predicting
words given contexts, and there is neither explicit
modeling sentence semantics nor producing sen-
tence embeddings at the training stage. But at
test time, sentence semantics needs to be explic-
itly modeled to obtain semantic similarity. El
inconsistency results in a distinct discrepancy be-
tween the objectives at the two stages and inferior
performance on textual semantic similarity tasks.
Por ejemplo, BERT embeddings yield inferior
performance on semantic similarity benchmarks
(Reimers y Gurévych, 2019), and even un-
derperform the naive method such as averaging
GloVe (Pennington et al., 2014) embeddings.

Transacciones de la Asociación de Lingüística Computacional, volumen. 10, páginas. 573–588, 2022. https://doi.org/10.1162/tacl a 00477
Editor de acciones: Chris Quirk. Lote de envío: 6/2021; Lote de revisión: 11/2021; Publicado 5/2022.
C(cid:4) 2022 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
7
2
0
2
2
9
4
8

/
t

a
C
_
a
_
0
0
4
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Li et al. (2020) investigated this problem and
found that BERT always induces a non-smooth
anisotropic semantic space of sentences, y esto
property significantly harms the performance of
semantic similarity.

Just as word meanings are defined by neigh-
boring words (harris, 1954), the meaning of a
sentence is determined by its contexts. Given the
same context, there is a high probability of gen-
erating two similar sentences. If there is a low
probability of generating two sentences given the
same context, there is a gap between these two sen-
tences in the semantic space. Based on this idea,
we propose a framework that measures seman-
tic similarity through the probability similarity of
generating two sentences given the same context
in a fully unsupervised manner. As for implemen-
tation, the framework consists of the following
steps: (1) we train a contextual model by predict-
ing the probability of a sentence fitting into the
left and right contexts; (2) we obtain sentence pair
similarity by comparing scores assigned by the
contextual model across a large number of con-
textos. To facilitate inference, we train a surrogate
modelo, to act as the role of step 2, based on the
outputs from step 1. The surrogate model can be
directly used for sentence similarity prediction in
an unsupervised setup, or used as initialization to
be further finetuned on downstream datasets in the
supervised setup. Note that the outcome from step
1 or the surrogate model is a fixed-length vector
regarding the input sentence. Each element in the
vector indicates how fit the input sentence is to
the context corresponding to that element, y el
vector itself can be viewed as the overall seman-
tics of the input sentence in the contextual space.
Then we use cosine distance between two sentence
vectors to compute the semantic similarity.

The proposed framework offers the potential to
fully address the two challenges above: (1) el
context regularization provides a reliable means
to generate a large-scale high-quality dataset with
semantic similarity scores based on unlabeled
cuerpo; y (2) the train-test gap can be natu-
rally bridged by training the model on the large-
leading to significant
scale similarity dataset,
performance gains compared to utilize pretrained
models directly.

We conduct experiments on different datasets
under both supervised and unsupervised set-
el
ups, and experimental

results show that

proposed framework significantly outperforms
existing sentence similarity models.

2 Trabajo relacionado

for measuring sen-
Statistics-based methods
tence similarity include bag-of-words (BoW) (li
et al., 2006), term frequency inverse document
frequency (TF-IDF) (Luhn, 1957; jones, 2004),
BM25 (Robertson et al., 1995), latent semantic
indexing (LSI) (Deerwester et al., 1990), y
latent Dirichlet allocation (LDA) (Blei et al.,
2003). Deep learning based methods for sen-
tence similarity rely on distributed representa-
ciones (Mikolov et al., 2013; Le and Mikolov, 2014)
and can be generally divided into the following
three categories.

Matrix Based Methods

The first line of work for measuring sentence sim-
ilarity is to construct a similarity matrix between
two sentences, each element of which represents
the similarity between the two corresponding units
in two sentences. Then the matrix is aggregated
in different ways to induce the final similarity
puntaje. Pang et al. (2016) applied a two-layer con-
volutional neural network (CNN) followed by a
feed-forward layer to the similarity matrix to de-
rive the similarity score. He and Lin (2016) used a
deeper CNN to make the best use of the similarity
matrix. Yin and Sch¨utze (2015) built a hierarchical
architecture to model text compositions at differ-
ent granularities, so several similarity matrices
can be computed and combined for interactions.
Other works proposed using the attention mecha-
nism as a way of computing the similarity matrix
(Rockt¨aschel et al., 2015; Wang y cols., 2016;
Parikh et al., 2016; Seo et al., 2016; Shen et al.,
2017; Lin et al., 2017; Gong et al., 2017; Broncearse
et al., 2018; Kim y cols., 2019; Yang et al., 2019b).

Word Distance Based Methods

The second line of work to measure sentence
similarity is to calculate the cost of transforming
from one sentence to another; the smaller the cost
es, the more similar two sentences are. This idea
is implemented by the Word Mover’s Distance
(WMD) (Kusner et al., 2015), which measures the
dissimilarity between two documents as the mini-
mum amount of distance that the embedded words
of one document need to transform to words of an-
other document. Following works improve WMD

574

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
7
2
0
2
2
9
4
8

/
t

a
C
_
a
_
0
0
4
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

by incorporating supervision from downstream
tareas (Huang et al., 2016), introducing hierar-
chical optimal transport over topics (Yurochkin
et al., 2019), addressing the complexity limitation
of requiring to consider each pair (Wu and Li,
2017; Wu et al., 2018; Backurs et al., 2020), y
combining graph structures with WMD to perform
cross-domain alignment (Chen et al., 2020). Más
recently, Yokoi et al. (2020) proposed to disentan-
gle word vectors in WRD have shown significant
performance boosts over vanilla WMD.

Sentence Embedding Based Methods

Sentence embeddings are high-dimensional rep-
resentations for sentences. They are expected
to contain rich sentence semantics so that the
similarity between two sentences can be com-
puted by considering their sentence embeddings
via certain metrics such as cosine similarity.
Le and Mikolov (2014) introduced paragraph vec-
colina, which is learned in an unsupervised manner
by predicting the words within the paragraph us-
ing the paragraph vector. In a followup, a line of
sentence embedding methods such as FastText,
Skip-Thought vectors (Kiros et al., 2015), Smooth
Inverse Frequency (SIF) (Arora et al., 2017), Se-
quential Denoising Autoencoder (SDAEs) (Colina
et al., 2016), InferSent (Conneau et al., 2017),
Quick-Thought vectors (Logeswaran and Lee,
2018), and Universal Sentence Encoder (Cer
et al., 2018) have been proposed to improve the
sentence embedding quality with more efficiency.
The great success achieved by large-scale pre-
training models (Devlin et al., 2018; Liu et al.,
2019) has recently stimulated a strand of work
on producing sentence embeddings based on the
pretraining-finetuning paradigm using large-scale
unlabeled corpora. The cosine outcome between
the representations of two sentences produced
by large-scale pretrained models is treated as
the semantic similarity (Reimers y Gurévych,
2019; Wang and Kuo, 2020; Le et al., 2020).
Su et al. (2021) and Huang et al. (2021) pro-
posed regularizing the sentence representations by
whitening them, eso es, enforcing the covariance
to be an identity matrix to address the non-smooth
anisotropic distribution issue (Le et al., 2020).

The BERT-based scores (Zhang et al., 2020;
Sellam et al., 2020), though serving as automatic
métrica, also capture rich semantic information
regarding the sentence and have the potentials

for measuring semantic similarity. Cer et al.
(2018) proposed a method of encoding sentences
into their corresponding embeddings that specifi-
cally target transfer learning to other NLP tasks.
Karpukhin et al. (2020) adopted two unique BERT
encoder models and the model weights are opti-
mized to maximize the dot product. The most
recent line of work focuses on leveraging the
contrastive learning framework to tackle seman-
tic textual similarity (Wu et al., 2020; Carlsson
et al., 2021; Kim y cols., 2021; Yan et al., 2021; gao
et al., 2021), where two similar sentences are
pulled close and two random sentences are pulled
away in the sentence representation space. Este
learning strategy helps better separate sentences
with different semantics.

This work is motivated by learning word repre-
sentations given its contexts (Mikolov et al., 2013;
Le and Mikolov, 2014) with the assumption that
the meaning of a word is determined by its con-
texto. Our work is based on large-scale pretrained
model and aims at learning informative sentence
representations for measuring sentence similarity.

3 Modelo

3.1 Overview

The key point of the proposed paradigm is to com-
pute semantic similarity between two sentences by
measuring the probabilities of generating the two
sentences across a number of context.

We can achieve this goal based on the following
steps: (1) we first need to train a contextual model
to predict the probability of a sentence fitting
into the left and right contexts. This goal can be
achieved by either a discriminative model, a saber,
predicting the probability that the concatenation of
a sentence with context forms a coherent text, o
a generative model, a saber, predicting the proba-
bility of generating a sentence given contexts; (2)
next, given a pair of sentences, we can measure
their similarity by comparing their scores assigned
by contextual models given different contexts; (3)
for step 2, for any pair of sentences at test time,
we need to sample different contexts to compute
scores assigned by contextual models, cual es
time-consuming. We thus propose to train a surro-
gate model that takes a pair of sentences as inputs
and predicts the similarity assigned by the contex-
tual model. This enables faster inference, aunque
at a small sacrifice of accuracy; (4) the surrogate

575

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
7
7
2
0
2
2
9
4
8

/
t

a
C
_
a
_
0
0
4
7
7
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

model can be directly used for obtaining sentence
similarity scores in a unsupervised manner, o
used as model initialization, which will be further
fine-tuned on downstream datasets in a supervised
configuración. We will discuss the detail of each module
in order below.

3.2 Training Contextual Models

We need a contextual model to predict the prob-
ability of a sentence fitting into left and right
contextos. We combine a generative model and a
discriminative model to achieve this goal, allow-
ing us to take the advantage of both to model text
coherencia (Le et al., 2017).

Notations Let ci denote the i-th sentence,
which consists of a sequence of words ci =
}, where ni denotes the number
{ci,1, . . . , ci,ni
of words in ci. Let ci:j denote the i-th to j-th
oraciones. Ci respectively denote the
preceding and subsequent context of ci.

3.2.1 Discriminative Models
The discriminative model takes a sequence of con-
secutive sentences [Ci] as the input, y
maps the input to a probability indicating whether
the input is natural and coherent. We treat sentence
sequences taken from the original articles written
by humans as positive examples and sequences
with replacements of the center sentence ci as neg-
ative ones. Half of replacements of ci come from
the original document, and half of replacements
come from random sentences from the corpus.
The concatenation of LSTM representations at the
last step (right-to-left and left-to-right) is used to
represent the sentence. Sentence representations
for consecutive sentences are concatenated and
output to the sigmoid function to obtain the final
probabilidad:

pag(y = 1|ci, Ci) = sigmoid(h(cid:5)[hi])
(1)
where h denotes learnable parameters. We de-
liberately make the discriminative model simple
for two reasons: The discriminative approach for
coherence prediction is a relatively easy task and
more importantly, it will be further used in the
next selection stage for screening, where faster
speed is preferred.

3.2.2 Generative Models
Given contexts ci, the generative model
predicts the probability of generating each token in

sentence ci sequentially using SEQ2SEQ structures
(Sutskever et al., 2014) as the backbone:
(cid:2)

pag(ci|Ci) =

pag(ci,j|Ci, ci,i), but also the backward
probability of generating contexts given sentences.
The context-given-sentence probability can be
modeled by predicting preceding contexts given
subsequent contexts p(Ci) and to pre-
dict subsequent contexts given preceding contexts
pag(c>i|Ci], the score for si fitting
into the context is the linear combination of scores
from discriminative and generative models:

S(si, Ci) = λ1 log p(y = 1|si, Ci)

+ l2

+ λ3

+ λ4

1
|si| iniciar sesión p(si|Ci)
1
|Ci)
1
|c>i| iniciar sesión p(c>i|Ci. S(si, C) is thus equivalent
to S(si, Ci).

Let C denote a set of contexts, where NC
is the size of C. For a sentence s, its semantic
representation vs is an NC dimensional vector,
with each individual value being S(s, C) con
c ∈ C. The semantic similarity between two
sentences s1 and s2 can be computed based
on vs1 and vs2 using different metrics such as
cosine similarity.

Constructing C We need to pay special at-
tentions to the construction of C. The optimal
situation is to use all contexts, where C is the en-
tire corpus. Desafortunadamente, this is computationally
prohibitive as we need to iterate over the entire
corpus for each sentence s.

We propose the following workaround for
tractable computation. For a sentence s, bastante
than using the full corpus as C, we construct its
sentence specific context set Cs in a way that
s can fit into all constituent context in Cs. El

576

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
7
2
0
2
2
9
4
8

/

/
t

yo

a
C
_
a
_
0
0
4
7
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

intuition is as follows. With respect to sentence
s1, contexts can be divided into two categories:
contexts that s1 fits into, based on which we will
measure whether or not s2 also fits in, and con-
texts that s1 does not fit into, and we will measure
whether or not s2 also does not fit in. Somos
mostly concerned about the former, and can ne-
glect the latter. The reason is as follows: Este último
can also further be divided into two categories:
contexts that fit neither s1 or s2, and contexts
that do not fit s1 but fit s2. For contexts that fit
neither s1 and s2, we can neglect them since two
sentences not fitting into the same context does
not signify their semantic relatedness; for contexts
that does not fit s1 but fit s2, we can leave them
to when we compute Cs2.

Practically, for a given sentence s, we first use
TF-IDF weighted BoW bi-gram vectors to perform
primary screening on the whole corpus to re-
trieve related text chunks (20K for each sentence).
Próximo, we rank all contexts using the discriminative
model based on Eq. (1). For discriminative mod-
los, we cache sentence representations in advance,
and compute model scores in the last neural layer,
which is significantly faster than the generative
modelo. This two-step selection strategy is akin to
the pipelined selection system (Chen et al., 2017;
Karpukhin et al., 2020) in open-domain QA that
contains document retrieval using IR systems and
fine-grained question answering using neural QA
models.

Cs is built by selecting top ranked contexts
by Eq. (3). We use the incremental construction
estrategia, adding one context at a time. To promote
diversity of Cs, each text chunk is allowed to
contribute at most one context, and the Jaccard
similarity between the i − 1-th sentence in the
context to select and those already selected should
be lower than 0.5.1

To compute semantic similarity between s1 and
s2, we concatenate Cs1 and Cs2 and use the
concatenation as the context set C. The semantic
similarity score between s1 and s2 is given as
follows:

vs1 = [S(s1, C) for c ∈ Cs1 + Cs2]
vs2 = [S(s2, C) for c ∈ Cs1 + Cs2]

sim(s1, s2) = cosine(vs1, vs2)

1This strategy can also remove text duplicates.

(4)

577

3.4 Training Surrogate Models

The method described in Section 3.3 provides a
direct way to compute scores for semantic relat-
edness. But it comes with a severe shortcoming
of slow speed at inference time: Given an arbi-
trary pair of sentences, the model still needs to
go through the entire corpus, harvest the context
set Cs, and iterate all instances in Cs for context
score calculation based on Eq. (3), each of which is
time consuming. To address this issue, we propose
training a surrogate model to accelerate inference.
Específicamente, we first harvest similarity scores
for sentence pairs using methods in Section 3.3.
We collect scores for 100M pairs in total, cual
are further split
into train/dev/test by 98/1/1.
Próximo, by treating harvested similarity scores as
gold labels, we train a neural model that takes
a pair of sentence as an input, and predicts its
similarity score. The cosine similarity between
the two sentence representations is the predicted
semantic similarity, and we minimize the L2
distance between predicted and golden similar-
ities. The Siamese structure makes it possible
for fixed-sized vectors for input sentences to be
derived and stored, allowing for fast semantic sim-
ilarity search, which we will discuss in detail in
the ablation study section.

It is worth noting both the advantages and
disadvantages of the surrogate model. For ad-
vantages, firstly, it can significantly speed up
inference as it avoids the time-consuming process
of iterating over the entire corpus to construct C.
En segundo lugar, the surrogate shares the same structure
with existing widely-used models such as BERT
and RoBERTa, and can thus later be easily fine-
tuned on the human-labeled datasets in supervised
aprendiendo; por otro lado, the origin model in
Sección 3.3 cannot be readily combined with other
human-labeled datasets. For disadvantages, el
surrogate model inevitably comes with a cost of
exactitud, as its upper bound is the origin model
in Section 3.3.

4 experimentos

4.1 Experiment Settings

We evaluate the Surrogate model on Semantic
Textual Similarity (STS), Argument Facet Sim-
ilarity (AFS) cuerpo (Misra et al., 2016), y
Wikipedia Sections Distinction (Ein Dor et al.,
2018) tareas. We perform both unsupervised and

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
7
2
0
2
2
9
4
8

/

/
t

yo

a
C
_
a
_
0
0
4
7
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

supervised evaluations on these tasks. For unsu-
pervised evaluations, models are directly used for
obtaining sentence representations. For supervised
evaluations, we use the training set to fine-tune all
models and use the L2 regression as the objective
función. Además, we also conduct partially
supervised evaluation on STS benchmarks.

Implementation Details For discriminative
model in 3.2.1, we use a single-layer bi-directional
LSTM as the backbone with the size of hidden
states set to 300.

For the generative model in 3.2.2, we implement
the above three models, a saber, pag(ci|Ci),
pag(Ci), y P(c>i|Ci),
stands
|Ci) and right-context stands
para
|c>i| iniciar sesión p(c>i|C3.0.CO;2-9

Jacob Devlin, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2018. BERT: Pre-
transformadores
training of deep bidirectional
for language understanding. arXiv preprint
arXiv:1810.04805.

Liat Ein Dor, Yosi Mass, Alon Halfon, Elad
Venezian, Ilya Shnayderman, Ranit Aharonov,
and Noam Slonim. 2018. Learning thematic
similarity metric from article sections using
triplet networks. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
lingüística nacional (Volumen 2: Artículos breves),
pages 49–54, Melbourne, Australia. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/P18-2009

Ishizuka,

Mamdouh Farouk, Mitsuru

y
Danushka Bollegala. 2018. Graph matching
based semantic search engine. In Research Con-
ference on Metadata and Semantics Research,
pages 89–100. Saltador. https://doi.org
/10.1007/978-3-030-14401-2_8

Tianyu Gao, Xingcheng Yao, and Danqi Chen.
2021. SimCSE: Simple contrastive learning of
sentence embeddings. In Empirical Methods in
Natural Language Processing (EMNLP).

Yichen Gong, Heng Luo, and Jian Zhang.
2017. Natural language inference over inter-
action space. arXiv preimpresión arXiv:1709.04348.
https://doi.org/10.1080/00437956.195
4.11659520

Hua He and Jimmy Lin. 2016. Pairwise word
interaction modeling with deep neural net-
works for semantic similarity measurement.
En procedimientos de
el 2016 Conference of
the North American Chapter of the Associ-
ation for Computational Linguistics: Humano
Language Technologies, pages 937–948, san
diego, California. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/N16-1108

el 2016 Conference of

Felix Hill, Kyunghyun Cho, and Anna Korhonen.
2016. Learning distributed representations of
sentences from unlabelled data. En curso-
cosas de
the North
American Chapter of
la Asociación para
Ligüística computacional: Human Language
Technologies, pages 1367–1377, San Diego,
California. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/N16-1162

Gao Huang, Chuan Guo, Matt J. Kusner, Yu
Sol, Fei Sha, and Kilian Q. Weinberger. 2016.
Supervised word mover’s distance. Avances
en sistemas de procesamiento de información neuronal,
29:4862–4870.

Junjie Huang, Duyu Tang, Wanjun Zhong,
Shuai Lu, Linjun Shou, Ming Gong, Daxin
Jiang, and Nan Duan. 2021. WhiteningBERT:
An easy unsupervised sentence embedding
acercarse. arXiv preimpresión arXiv:2104.01767.
https://doi.org/10.18653/v1/2021
.findings-emnlp.23

Karen Sp¨arck Jones. 2004. A statistical inter-
pretation of term specificity and its applica-
tion in retrieval. Journal of Documentation.
60:493–502. https://doi.org/10.1108
/eb026526

Vladimir Karpukhin, Barlas Oguz, Sewon Min,
Patrick Lewis, Ledell Wu, Sergey Edunov,
Danqi Chen, and Wen-tau Yih. 2020. Dense
passage retrieval for open-domain question an-
swering. En Actas de la 2020 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando (EMNLP), pages 6769–6781, On-
line. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.550

Zellig S Harris. 1954. Distributional structure.

Word, 10(2-3):146–162.

Seonhoon Kim, Inho Kang, and Nojun Kwak.
sentence matching with

2019. Semántico

585

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
7
2
0
2
2
9
4
8

/

/
t

yo

a
C
_
a
_
0
0
4
7
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

densely-connected recurrent and co-attentive
información. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volumen 33,
pages 6586–6593. https://doi.org/10
.1609/aaai.v33i01.33016586

Taeuk Kim, Kang Min Yoo, and Sang-goo
Sotavento. 2021. Self-guided contrastive learn-
ing for BERT sentence representations. En
the 59th Annual Meeting
Actas de
de
the Association for Computational Lin-
guistics and the 11th International Joint
Conferencia sobre procesamiento del lenguaje natural
(Volumen 1: Artículos largos), pages 2528–2540,
En línea. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.acl-long.197

Diederik P. Kingma and Jimmy Ba. 2014. Adán:
A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.

tions on Knowledge and Data Engineer-
En g, 18(8):1138–1150. https://doi.org
/10.1109/TKDE.2006.130

Zhouhan Lin, Minwei Feng, Cicero Nogueira
dos Santos, Mo Yu, Bing Xiang, Bowen
zhou, and Yoshua Bengio. 2017. A structured
arXiv
self-attentive
preprint arXiv:1703.03130.

incrustar.

oración

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019. Roberta: A robustly opti-
mized bert pretraining approach. arXiv preprint
arXiv:1907.11692.

and Honglak Lee.
Lajanugen Logeswaran
2018. An efficient
aprender-
ing sentence representations. arXiv preprint
arXiv:1803.02893.

framework for

Ryan Kiros, Yukun Zhu, Russ R. Salakhutdinov,
Richard Zemel, Raquel Urtasun, Antonio
Torralba, and Sanja Fidler. 2015. Skip-thought
vectores. Advances in neural information pro-
cessing systems, 28:3294–3302.

Hans Peter Luhn. 1957. A statistical approach to
mechanized encoding and searching of liter-
ary information. IBM Journal of Research and
Desarrollo, 1(4):309–317. https://doi
.org/10.1147/rd.14.0309

Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian
Weinberger. 2015. From word embeddings to
document distances. In International Confer-
ence on Machine Learning, pages 957–966.

Quoc Le and Tomas Mikolov. 2014. Distributed
representations of sentences and documents. En
Conferencia internacional sobre aprendizaje automático-
En g, pages 1188–1196.

Bohan Li, Hao Zhou, Junxian He, Mingxuan
Wang, Yiming Yang, and Lei Li. 2020. Sobre el
sentence embeddings from pre-trained language
modelos. En Actas de la 2020 Conferir-
encia sobre métodos empíricos en lan natural-
Procesamiento de calibre (EMNLP), pages 9119–9130,
En línea. Asociación
computacional
Lingüística.

para

Jiwei Li, Will Monroe, Tianlin Shi, S´ebastien Jean,
Alan Ritter, and Dan Jurafsky. 2017. Adver-
sarial learning for neural dialogue generation.
arXiv preimpresión arXiv:1701.06547.

Yuhua Li, David McLean, Zuhair A. Bandar,
James D. O’shea, and Keeley Crockett.
2006. Sentence similarity based on seman-
tic nets and corpus statistics. IEEE Transac-

Bill MacCartney and Christopher D. Manning.
2009. Natural Language Inference. Citeseer.

Marco Marelli, Stefano Menini, Marco Baroni,
Luisa Bentivogli, Raffaella Bernardi, y
Roberto Zamparelli. 2014. A SICK cure
for the evaluation of compositional distribu-
tional semantic models. En procedimientos de
the Ninth International Conference on Lan-
guage Resources and Evaluation (LREC’14),
pages 216–223, Reykjavik, Iceland. European
Language Resources Association (ELRA).

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg
S. Corrado, and Jeff Dean. 2013. Distributed
representations of words and phrases and
their compositionality. Advances in Neural In-
formation Processing Systems, 26:3111–3119.

the 17th Annual Meeting of

Amita Misra, Brian Ecker, and Marilyn Walker.
2016. Measuring the similarity of senten-
tial arguments in dialogue. En procedimientos
de
the Special
Interest Group on Discourse and Dialogue,
pages 276–287, Los Angeles. Asociación para
Ligüística computacional. https://doi
.org/10.18653/v1/W16-3636

586

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
7
2
0
2
2
9
4
8

/

/
t

yo

a
C
_
a
_
0
0
4
7
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu,
Shengxian Wan, and Xueqi Cheng. 2016. Texto
matching as image recognition. En curso-
cosas de
the AAAI Conference on Artificial
Inteligencia, volumen 30.

Ankur P. Parikh, Oscar T¨ackstr¨om, Dipanjan Das,
y Jakob Uszkoreit. 2016. A decomposable
attention model for natural language inference.
arXiv preimpresión arXiv:1606.01933. https://
doi.org/10.18653/v1/D16-1244

Shuang Peng, Hengbin Cui, Niantao Xie, Sujian
li, Jiaxing Zhang, and Xiaolong Li. 2020.
Enhanced-rcnn: An efficient method for learn-
En procedimientos de
ing sentence similarity.
The Web Conference 2020, WWW ’20,
pages 2500–2506, Nueva York, Nueva York, EE.UU.
Association for Computing Machinery.

jeffrey

Socher,

Pennington, Ricardo

y
Cristóbal D.. Manning. 2014. Glove: Global
vectors for word representation. En procedimientos
del 2014 Conferencia sobre métodos empíricos
en procesamiento del lenguaje natural (EMNLP),
pages 1532–1543. https://doi.org/10
.3115/v1/D14-1162

Nils Reimers

and Iryna Gurevych. 2019.
Sentence-bert: Sentence embeddings using
preprint
Siamese BERT-networks.
arXiv:1908.10084. https://doi.org/10
.18653/v1/D19-1410

arXiv

Stephen E. Robertson, Steve Walker, Susan Jones,
Micheline M. Hancock-Beaulieu, Mike Gatford,
et al. 1995. Okapi at trec-3. Nist Special Pub-
lication Sp, 109:109.

Tim Rockt¨aschel, Edward Grefenstette, Karl
Moritz Hermann, Tom´aˇs Koˇcisk`y, and Phil
entail-
Blunsom. 2015. Reasoning about
ment with neural attention. arXiv preprint
arXiv:1509.06664.

Thibault Sellam, Dipanjan Das, and Ankur
Parikh. 2020. BLEURT: Learning robust met-
rics for text generation. En Actas de la
58ª Reunión Anual de la Asociación de
Ligüística computacional, pages 7881–7892,
En línea. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.acl-main.704

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi,
and Hannaneh Hajishirzi. 2016. Bidirectional

587

attention flow for machine comprehension.
arXiv preimpresión arXiv:1611.01603.

sentence pair modeling.

Gehui Shen, Yunlun Yang, and Zhi-Hong
Deng. 2017. Inter-weighted alignment network
En curso-
para
cosas de
el 2017 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural,
pages 1179–1189. https://doi.org/10
.18653/v1/D17-1122

Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen
Ou. 2021. Whitening sentence representations
for better semantics and faster retrieval. arXiv
preprint arXiv:2103.15316.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.
2014. Sequence to sequence learning with
neural networks. In Advances in Neural Infor-
mation Processing Systems, pages 3104–3112.

Chuanqi Tan, Furu Wei, Wenhui Wang, Weifeng
Lv, y Ming Zhou. 2018. Multiway attention
networks for modeling sentence pairs. In IJCAI,
pages 4411–4417.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N.. Gómez,
lucas káiser, y Illia Polosukhin. 2017.
In Advances
Attention is all you need.
en sistemas de procesamiento de información neuronal,
pages 5998–6008.

Bin Wang and C.-C. Jay Kuo. 2020. SBERT-wk:
A sentence embedding method by dissecting
bert-based word models. IEEE/ACM Trans-
actions on Audio, Discurso, and Language
Procesando, 28:2146–2157. https://doi
.org/10.1109/TASLP.2020.3008390

Zhiguo Wang, Haitao Mi,

and Abraham
Ittycheriah. 2016. Sentence similarity learning
by lexical decomposition and composition. En
Proceedings of COLING 2016, the 26th In-
ternational Conference on Computational Lin-
guísticos: Technical Papers, pages 1340–1349,
Osaka, Japón. The COLING 2016 Organizing
Committee.

Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through
inferencia. En Actas de la 2018 Estafa-
diferencia de
el Capítulo Norteamericano de
the Association for Computational Linguis-
tics: Tecnologías del lenguaje humano, Volumen 1
(Artículos largos), pages 1112–1122, Nueva Orleans,

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
7
2
0
2
2
9
4
8

/

/
t

yo

a
C
_
a
_
0
0
4
7
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Luisiana. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/N18-1101

Lingfei Wu, Ian E. h. Yen, Kun Xu, Fangli
Xu, Avinash Balakrishnan, Pin-Yu Chen,
Pradeep Ravikumar, and Michael J. Witbrock.
2018. Word mover’s embedding: From word2vec
incrustar. arXiv preprint
to document
arXiv:1811.01713.

Xinhui Wu and Hui Li. 2017. Topic mover’s
distance based document classification.
En
International Confer-
2017
ence on Communication Technology (ICCT),
pages 1998–2002. IEEE.

IEEE 17th

Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian
Khabsa, Fei Sun, and Hao Ma. 2020. Clear:
Contrastive learning for sentence representa-
ción. arXiv preimpresión arXiv:2012.15466.

Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng
zhang, Wei Wu, and Weiran Xu. 2021.
ConSERT: A contrastive framework for self-
supervised sentence representation transfer. En
Proceedings of the 59th Annual Meeting of
la Asociación de Lingüística Computacional
and the 11th International Joint Conference on
Natural Language Processing (Volumen 1: Largo
Documentos), pages 5065–5075, En línea. Asociación
para Lingüística Computacional.

Mingming Yang, Rui Wang, Kehai Chen, Masao
Utiyama, Eiichiro Sumita, Min Zhang, y
Tiejun Zhao. 2019a. Sentence-level agreement
for neural machine translation. En curso-
el
cosas de
Asociación de Lingüística Computacional,
pages 3076–3082. https://doi.org/10
.18653/v1/P19-1296

the 57th Annual Meeting of

Runqi Yang, Jianhai Zhang, Xing Gao, Feng Ji,
and Haiqing Chen. 2019b. Simple and effective
text matching with richer alignment features.
In Proceedings of the 57th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 4699–4709, Florencia, Italia. Asociación
para Lingüística Computacional. https://doi
.org/10.18653/v1/P19-1465

Wenpeng Yin and Hinrich Sch¨utze. 2015.
MultiGranCNN: An architecture for general
matching of text chunks on multiple levels
of granularity. En procedimientos de
the 53rd
Annual Meeting of the Association for Com-
putational Linguistics and the 7th International
Conferencia conjunta sobre lenguaje natural Pro-
cesando (Volumen 1: Artículos largos), pages 63–73,
Beijing, Porcelana. Asociación de Computación
Lingüística.

Sho Yokoi, Ryo Takahashi, Reina Akama, Jun
suzuki, and Kentaro Inui. 2020. Word rotator’s
distancia. En Actas de la 2020 Conferir-
encia sobre métodos empíricos en lan natural-
Procesamiento de calibre (EMNLP), pages 2944–2960,
En línea. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.236

Mikhail Yurochkin, Sebastian Claici, Eduardo
Chien, Farzaneh Mirzazadeh, and Justin M.
Solomon. 2019. Hierarchical optimal
trans-
port for document representation. In Advances
en sistemas de procesamiento de información neuronal,
páginas 1601–1611.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian
and Yoav Artzi. 2020.
q. Weinberger,
BERTscore: Evaluating text generation with
BERT. In International Conference on Learn-
ing Representations.

588

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
7
7
2
0
2
2
9
4
8

/

/
t

yo

a
C
_
a
_
0
0
4
7
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Descargar PDF