MENLI: Robust Evaluation Metrics from Natural Language Inference - IA de Investigación especializada en el MIT

SOY YO: Métricas de evaluación sólidas a partir de la inferencia del lenguaje natural

Yanran Chen1,2 and Steffen Eger2
1Technische Universit¨at Darmstadt, Alemania
2Natural Language Learning Group (NLLG), https://nl2g.github.io/
Faculty of Technology, Universidad de Bielefeld, Alemania

yanran.chen@stud.tu-darmstadt.de, steffen.eger@uni-bielefeld.de

Abstracto

Recently proposed BERT-based evaluation
metrics for text generation perform well on
standard benchmarks but are vulnerable to ad-
ataques versátiles, p.ej., relating to information
correctness. Sostenemos que esto se debe (en parte)
from the fact that they are models of semantic
semejanza. A diferencia de, we develop evaluation
metrics based on Natural Language Inference
(NLI), which we deem a more appropriate
modelado. We design a preference-based ad-
versarial attack framework and show that our
NLI based metrics are much more robust
to the attacks than the recent BERT-based
métrica. On standard benchmarks, our NLI
based metrics outperform existing summariza-
tion metrics, but perform below SOTA MT
métrica. Sin embargo, when combining existing
metrics with our NLI metrics, we obtain both
higher adversarial robustness (15%–30%) y
higher quality metrics as measured on standard
benchmarks (+5% a 30%).

Introducción

Proper evaluation is key to fields such as machine
learning and Natural Language Processing (NLP).
Evaluation is particularly challenging for natural
language generation (NLG) tareas, as there many
be an infinitude of correct solutions (p.ej., transla-
tions or summaries) for a given source text. Mientras
human evaluation is often considered the gold
standard, it is slow and costly, thus researchers
resort to automatic evaluation. Previously, este
was done using simple lexical overlap metrics
such as BLEU and ROUGE, but these exhibit low
correlations with human judgments, particularly
for state-of-the-art NLG systems (Mathur et al.,
2020a; Peyrard, 2019). De este modo, a popular recent
trend is to design automatic evaluation metrics
based on large language models such as BERT
and its many extensions (Zhang et al., 2020; zhao
et al., 2019; Sellam et al., 2020; Wan et al., 2022).

804

Sin embargo,

these novel metrics also have
key limitations. Por ejemplo, Sai et al. (2021)
and Kaster et al. (2021) show that they are not
robust to various adversarial attacks including
lexical overlap and factuality errors. Taking the
currently most popular metric—BERTScore1—as
an example, this adversarial vulnerability is un-
surprising. BERTScore computes the semantic
similarity between a reference and a system output
(the candidate), using a simplified token matching
procedimiento. Sin embargo, a good candidate is typically
not appropriately identified by semantic similarity.
Por ejemplo, a candidate ‘‘5 Ukrainian soldiers
wounded in Russia’’ is not an adequate translation
of a source corresponding to the reference ‘‘50000
Russian soldiers killed in Ukraine’’, although the
two texts are of course semantically very similar.2
While there have been many attempts to improve
BERTScore using better token matching, p.ej., a nosotros-
ing Word Mover Distance (Zhao et al., 2019;
Chen et al., 2020; Colombo et al., 2021), we argue
that this line of research is a dead-end, as the un-
derlying model of semantic similarity, originally
proposed to address issues of lexical variation in
BLEU/ROUGE, is simply not (fully) appropriate.
An intuitively more suitable idea to model
evaluation metrics is via natural language in-
ference (NLI) (Dagan et al., 2013). Por ejemplo,
in reference-based settings, in which candidates
are compared to human references, a candidate
is intuitively good if it is equivalent to a human
reference via the concept of bi-implication. NLI
systems are also promising alternatives because

1Published in 2020, BERTScore has more than 1700

citations as of March 2023.

2That semantic similarity metrics are inherently incapable
of identifying this puts them at great risk of being attacked by
malicious agents, with serious real-world consequences, como
the metrics cannot distinguish between truthful translations
and semantically similar but factually incorrect translations.

Transacciones de la Asociación de Lingüística Computacional, volumen. 11, páginas. 804–825, 2023. https://doi.org/10.1162/tacl a 00576
Editor de acciones: Benjamin Van Durme. Lote de envío: 10/2022; Lote de revisión: 1/2023; Publicado 7/2023.
C(cid:2) 2023 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

NLI is one of the most researched upstream tasks
in NLP, where a lot of emphasis has been placed
on concepts such as biases, generalization and
adversarial conditions (Poliak et al., 2018; Utama
et al., 2020).

en este documento, we ask whether we can di-
rectly use pre-trained NLI models as evaluation
métrica,
thereby establishing a new paradigm
(but with predecessors, as indicated in §2). Nuestro
contributions:

• We design: a novel preference-based adver-
sarial test suite for MT and summarization
métrica. Our adversarial benchmark does
not need human annotators, is suitable for
reference-free (where the candidate is di-
rectly compared to the source text, sin
human reference) and reference-based eval-
uation, and is challenging: p.ej., AZUL,
ROUGE, MoverScore, and BERTScore per-
form below or at random level.

• We explore: (i) how NLI metrics can be
induced from existing NLI models; (ii) cómo
they perform on benchmark and adversarial
conjuntos de datos, across (iii) two NLG problems, MONTE
and summarization.

• We show: (iv) NLI metrics perform par-
ticularly well in summarization, but below
standard metrics in MT. (v) They substan-
tially outperform existing metrics on our
adversarial attacks (p.ej., ∼30%–50% mar-
gin over the best unsupervised standard
metric in MT).
(vi) Combining existing
metrics with our NLI metrics yields both
mejor (+5%–30%) and more robust metrics
(+15%–30%).

We point out that some current metrics already
leverage NLI systems—thus, we do not include
new information with respect to them—but in-
directly and thus (we argue) inadequately: p.ej.,
MoverScore (Zhao et al., 2019) leverages BERT
representations fine-tuned on NLI. Mathur et al.
(2019) train (pre-BERT) NLI-inspired architec-
tures on MT datasets. A diferencia de, we show that
by directly leveraging NLI systems, much better
adversarial and standard benchmark performances
can be obtained. We call our novel metrics MENLI
(MEtrics from NLI).3

3Code+data: http://github.com/cyr19/MENLI.

Concept

Semantic Similarity
Text Generation

Question Answering
NLI

Examples

BERTScore, MoverScore, BaryScore, …
BARTScore, PRISM (Thompson and Post,
2020)
QAEval (Deutsch et al., 2021)
SOY YO

Mesa 1: Different paradigms for metric induction
proposed in recent years.

2 Trabajo relacionado

Our work connects to evaluation metrics and NLI.

Evaluation Metrics for NLG In the last few
años, researchers have come up with a plethora of
different BERT-based metrics for varying tasks
and setups: p.ej., for MT and summarization,
reference-based trained (Sellam et al., 2020; Rei
et al., 2020a) and untrained approaches (zhao
et al., 2019; Zhang et al., 2020) have been sug-
gested and the same is true for reference-free
setups, where both supervised (Ranasinghe et al.,
2020) and unsupervised metrics have been ex-
plored (Zhao et al., 2020; Song et al., 2021;
Belouadi and Eger, 2023). In our work, weconsider
both reference-based as well as reference-free
métrica. Both setups have important differences:
Reference-free setups are more challenging, como
they require to compare text in different lan-
calibres (in MT) or of vastly different lengths (en
summarization). Por otro lado, they are more
‘resource-efficient’, take humans out-of-the-loop,
and promise web-scale evaluation. Both ap-
proaches are also different
in terms of NLI.
Por ejemplo, while reference-based approaches
require equivalence between reference and hy-
pothesis, the concept of equivalence is not always
appropriate in reference-free situations (p.ej., en
summarization, source and summary are intu-
itively not equivalent; bastante, source should entail
summary).

To realize metrics, different high-level ap-
proaches have been suggested as we outline in
Mesa 1 (p.ej., metrics from semantic similarity,
from text generation or from question answering).
There are also some predecessor works on metrics
from NLI which we discuss below.

Robustness of Evaluation Metrics has been a
central issue of recent interest: Sai et al. (2021)
test metrics across several CheckList (Ribeiro
et al., 2020) inspired templates, finding that most

805

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

common standard metrics are not robust even to
simple perturbations. Kaster et al. (2021) probe
metrics in an adversarial setting with lexical over-
regazo, finding that they can be fooled by text that has
high lexical overlap but low semantic similarity
(indicating that the proposed BERT-based metrics
are not even good models of semantic similarity).
We combine the approaches of Sai et al. (2021)
and Kaster et al. (2021): While Sai et al. (2021)
use human crowd-workers to evaluate robustness,
Kaster et al. (2021) use a simpler preference-based
setup, which does not need human annotators. Nosotros
will also use the preference-based setup, but our
attacks are largely inspired by Sai et al. (2021).

More recently (contemporaneously with us and
after the first Arxiv submission of our work),
several other papers have explored the robust-
ness of recent evaluation metrics. Por ejemplo,
He et al. (2022) develop stress test suites ac-
cording to potential errors arising from certain
choices of metric design and pretrained lan-
guage models used, showing that metrics are
biased towards their underlying models—e.g.,
BARTScore assigns higher scores to texts gener-
ated by the models of the metric itself.4 Karpinska
et al. (2022) explore the sensitivity of MT met-
rics to errors of different categories (regarding
semantics, syntax, and morphology) and sever-
idad, using a preference-based setting; they show
that recent metrics like BERTScore dramatically
outperform lexical overlap-based metrics such as
BLEU and ROUGE, mostly obtaining over 95%
accuracy in their experiments. Our setups and that
of Karpinska et al. (2022) and He et al. (2022) son
differentiated by the tasks considered, the prefer-
ence specifications, the results, and the solutions
propuesto. Karpinska et al. (2022) only evaluate
metrics for MT while we consider both MT and
summarization. They design their preferences in
such a way that it would seem that recent met-
rics are quite robust while our more elaborate
preferences expose their weak spots much bet-

4Robustness is also related to model biases. Por ejemplo,
Sun et al. (2022) show that BERTScore encodes social
biases such as gender biases. And Deutsch et al. (2022)
claim that reference-free metrics are inherently biased, cual
implies that they have unreasonable preferences. Nuestros resultados
show that many current reference-based metrics also have
unreasonable preferences. Robustness checks are also related
to explainability (Leiter et al., 2022; Golovneva et al., 2023)
of evaluation metrics as they help to understand metric
limitations.

806

ter. Finalmente, we propose solutions (p.ej., métrica
from NLI) to addressing lack of robustness. Like
a nosotros, He et al. (2022) also consider summarization
and MT. Instead of designing preferences, cómo-
alguna vez, they manually introspect how metric scores
change as various perturbations are introduced. En
this way, they expose blind spots of metrics. Como
remedies, they suggest to combine heterogeneous
metrics to shield against varying blind spots (con-
out performing concrete experiments)—we show
that combining metrics with NLI based metrics
yields additional robustness.

Finalmente, Rony et al. (2022) develop RoMe
as a robust metric in the context of semantic
semejanza, fluency and grammatical variability.
They evaluate it on an adversarial dataset with
five phenomena (entidad, adjective and random
word replacement; as well as text transforma-
tion and passive forms) by correlating against
human judgments. Their model is a rather com-
plicated trained metric leveraging semantic and
grammatical features—we compare to it in §6.

NLI NLI is one of the core upstream tasks in
the NLP community. Due to its popularity, NLI
has been investigated in-depth, where researchers
found that trained models often overfit to low-level
statistical cues instead of learning generalizable
concepts of logical relationships between sen-
tenencias (Poliak et al., 2018; Gururangan et al.,
2018). Como consecuencia, many approaches to im-
prove generalization have been investigated (p.ej.,
Belinkov et al., 2019; Utama et al., 2020; zhou
and Bansal, 2020). We argue that a high-quality
NLI model would be an excellent candidate for an
evaluation metric and explore this in this work.

Like us, Mathur et al. (2019) note the similarity
de (MONTE) evaluation and logical equivalence via
NLI. They design supervised MT metrics lever-
aging different pre-BERT inspired architectures,
including one from the NLI community called
ESIM (Chen et al., 2017) (which performs on
par to an LSTM with attention in their exper-
elementos). De este modo,
they do not
leverage NLI models out-of-the-box as evalua-
tion metrics but only fine-tune an NLI-inspired
architecture on human scores from MT. Mover-
Score (Zhao et al., 2019) fine-tunes BERT on NLI,
which leads to better metrics. De este modo, ellos, también, usar
NLI only indirectly. Duˇsek and Kasner (2020) usar
NLI to evaluate hallucinations and omissions in
reference-free data-to-text generation scenarios.

in contrast

to us,

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

src

ref

r (google translation of src)

candpara

candadv( ref-based )

candadv( ref-free )

Number error
Der bilaterale Handel wurde auf ¨uber 100
Milliarden Dollar im Jahr gesteigert.
Bilateral trade has increased to more than
$100 billion a year. Bilateral trade has increased to over $100
billion a year.
Bilateral trade has increased to more than
one hundred billion dollars a year.
Bilateral trade has increased to more than
$814 billion a year. Bilateral trade has increased to over $478
billion a year.

Negation error

Die Wirtschaft der Entwicklungs- und Schwellenl¨ander
wird schwach bleiben.
Emerging economies will remain weak.

The economies of developing and emerging countries
will remain weak.
Emerging markets will remain weak.

Emerging economies won’t remain weak.

The economies of developing and emerging countries
won’t remain weak.

Mesa 2: Examples of our adversarial test suite taken from WMT20de. Red words indicate specific adver-
sarial perturbations of the words in green. candadv(ref-based) builds on ref, whereas candadv(ref-free)
builds on r (indicated by corresponding coloring in the first column). The preferences we query for are
given in Eq. (1).

They do not compare to any other metrics and do
not consider NLI as a general paradigm for eval-
uation metrics. While the summarization commu-
nity uses NLI models for consistency evaluation
(Fabbri et al., 2021; Laban et al., 2022), to our
conocimiento, we are the first to verify the useful-
ness of NLI systems as general evaluation met-
rics against a range of strong competitors, both in
standard evaluation and adversarial attack settings.

3 Adversarial Setup

Following Sai et al. (2021) y otros, nosotros estafamos-
sider an array of adversarial attacks on evaluation
metrics—we will give a motivation of our at-
tacks from the perspective of errors committed
by real text generation systems below. en contra-
trast to Sai et al. (2021) and similar to the later
published work of Karpinska et al. (2022), nosotros
implement a preference-based setup, which does
not need human annotators. The advantages of the
preference-based setup are: (i) lower cost (p.ej., No
annotation costs), (ii) which can be especially rel-
evant for non-English languages (p.ej., in ref-free
situations for MT), y (iii) which allows adver-
sarial evaluation at larger scale, yielding more
robust estimates of performance. The challenge of
the preference setup is to cleverly determine text
pairs to compare.

In our design, we use an anchor text (either
the reference ref or the source src), a paraphrase
candpara of the anchor text, and an adversarial
text candadv which is maximally similar to the
anchor text, but contains an adversarial attack. Nosotros

expect a good metric m to prefer candpara over
candadv:

ref-based : metro(ref , candpara) > m(ref , candadv)

ref-free : metro(src, ref ) > m(src, candadv)

(1)
The outcome of preferences in Eq. (1) depend
on how we choose candadvand candpara, cual
we will describe below. En general, a challeng-
ing test suite has candadv maximally similar to
ref /src, but with a key error. A diferencia de, candpara
should be maximally dissimilar to ref /src (p.ej.,
on surface level) but meaning-equivalent. Mesa 2
illustrates the general structure of our adversarial
test suite.

candadv To obtain candadv, consideramos el
following attacks (nine regarding information
adequacy/correctness in candidates and three re-
garding text fluency), which we deem (to a large
degree) representative for errors in different NLG
tareas:

• Addition: We randomly add a noun after an
existing one and connect them with ‘‘and’’.
Por ejemplo, ‘‘I love dogs’’ → ‘‘I love dogs
and cats.’’

• Omission: We use the framework of Sai et al.
(2021) to randomly drop ∼1%–20% words
in the sentence.

• Mismatch: We consider mismatching nouns,
verbos, and adjectives, which can lead to mis-
understanding of an entity, an action, y el
speakers’ emotion, respectivamente. Following

807

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Chen et al. (2021), we replace a specific word
having the POS tag of noun/verb/adjective
with another word having the same POS tag
randomly selected from our collected words
for that POS tag.

• Negation: We use the perturbation tool of
Ribeiro et al. (2020) to add/remove negations
to/from the verb for generating candadv with
contrary claims.

• Number error: We replace all numbers (ex-
cept for those related to dates) in the sentence
with random numbers in the same format
(p.ej., integer to integer, decimal to decimal).
• Pronoun error: We replace all pronouns in
the sentence with other ones without causing
syntax errors (p.ej., ‘‘he’’ to ‘‘she’’ and ‘‘us’’
to ‘‘them’’).

• Name error: We use the tool of Ribeiro
et al. (2020) to replace exactly one name
with a random one of the same gender.

• Fluency: We also include three phenomena
from Sai et al. (2021) to examine metrics’
robustness against attacks on text fluency: (i)
Jumbling word order: Randomly shuffle the
word order in a sentence. Spelling error: Add
a typo to a word in a sentence. Subject-verb
disagreement: Make the subject and verb
disagree (p.ej., ‘‘He like dogs.’’).

For ref-based metrics, we apply the perturbation
templates to ref to construct candadv. A diferencia de,
for ref-free MT metrics, we first translate the
source src using Google Translate to a translation r
and then perturb r to obtain candadv. Les presentamos
r to increase the similarity of candadv to src; p.ej.,
we assume that Google Translate translates more
literally, es decir., closer to word-by-word translations,
than human translators. This may be important to
construct challenging test cases, cf. §6 and our
above discussion. For ref-free summarization, nosotros
apply the perturbation templates to a document r
which is maximally similar to src; details follow.

candpara We use different ways to obtain
candpara, because different kinds of paraphrases
may yield more/less difficult test cases for metrics.
We will analyze this in §6.

En particular, we use data from (1) PAWS
(2) PAWS-X (Cual
(Zhang et al., 2019),
et al., 2019), (3) WMT20-news-commentary-v15
German-to-English (Mathur et al., 2020b) to gen-

conjunto de datos

tarea

ref-
based

ref-
gratis

PAWSori MT
PAWSback MT
XPAWSx MT
WMT20de MT
SEadv

Sí
Sí
Sí
Sí
SUM yes

No
No
Sí
Sí
Sí

candpara

#examples

ORI
BACK
ORI
BACK
BACK

2,000
2,000
455–474
200
199

Mesa 3: Adversarial datasets. ‘‘Yes/no’’ indicates
whether the dataset supports ref-based/free adver-
sarial evaluation. ‘‘ORI/BACK’’ denotes whether
candpara (except for number error) is from
the original datasets or backtranslation. ‘‘#exam-
ples’’ refers to the avg. number of examples per
fenómeno. XPAWSx denotes XPAWSde/fr/zh/ja.

erate candpara for MT evaluation metrics, y
(4) SummEval for summarization metrics. A
summary with attributes is shown in Table 3.

(1) PAWS contains sentence pairs created by
word swapping and backtranslation, labeled as
(non-)paraphrases by human raters. From sen-
tence pairs labeled as paraphrase, we derive two
datasets for ref-based evaluation metrics:

• PAWSori: We take the first sentence of a
PAWS sentence pair as ref and the second as
candpara.

• PAWSback: We use the first sentence of a
PAWS sentence pair as ref and generate
candpara based on ref using backtranslation
(we use German as the pivot language) excepto
for number error, for which we replace the
numbers in ref with the corresponding words,
using the Python library num2words.

(2) PAWS-X is the multilingual version of
PAWS, which includes PAWS sentence pairs in
six languages, translated from English PAWS,
allowing us to generate test suites for both
ref-free and ref-based metrics. We use the first
sentence in PAWS-X (p.ej., Alemán) as src
and the second sentence with the same ID in
English PAWS as ref. We select the data for
two closer language pairs: German-to-English
and French-to-English, and two more dis-
language pairs: Chinese-to-English and
tante
Japanese-to-English. Respectivamente, nosotros
create
4 conjuntos de datos: XPAWSde, XPAWSfr, XPAWSzh,
and XPAWSja, each of which contains src
(primero
in source
idioma), ref (first sentence of English PAWS

sentence of X-PAWS pair

808

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Error

Mismatch/verb

Mismatch/adj.

Pronoun/Addition

Nombre

Omission

Mismatch/noun

Pronoun

Fuente

MT hypothesis

Pay attention to (Follow) Suning.com service account

Not bad, the picture quality of playing games is
really fragrant (bien)

Bought it for his (mi) son, he said it was good.
On the same day, US Secretary of Transportation Zhao
Xiaolan (Elaine Lan Chao), US Congressman Meng
Zhaowen (Grace Meng) and Dong Jiling (Chiling Tong),
founding president of the International Leaders
Base, spoke at the meeting respectively.
Ich werde Ihr Konto […] ( ¨uberpr ¨ufen), einen Moment
bitte.
,,H¨or zu, ich will mein Volk (meine Leute) nicht verr¨uckt
machen‘‘, sagte sie.
Williams war nicht die einzige, die beim diesj¨ahrigen
Wimbledon eine Geldstrafe erhielt, obwohl sie (ihre)
die teuerste war.

I’ll review your account, one moment,
please.
Listen, I don’t want to make my
people mad,» she said.
Williams wasn’t the only one who
received a fine at this year’s Wimbledon,
though hers was the most costly.

Mesa 4: Examples of errors in WMT MQM annotations for Chinese-to-English and English-to-German.
Red texts are the annotated errors (‘‘[…]’’ indicates the missing translation) and the green texts in the
bracket refer to a more correct translation accordingly; the green texts in source sentences denote the
parts being mistranslated or omitted.

pair), and candpara (second sentence of English
PAWS pair).

(3) WMT20-news-commentary-v15 contains
sentence pairs of source and human reference.
From this, we create WMT20de, directly taking
the source and reference sentences as src and ref.
We obtain candpara as in the case of PAWSback.
(4) EvaluaciónSumm (Fabbri et al., 2021) contiene
documents and references from CNN Daily-
Mail (CNNDM) (Hermann et al., 2015), con
10 additional human references. We rank the 11
references using ROUGE-L (lin, 2004) and use
the reference r with highest ROUGE score to
generate candadv for ref-free setting, mientras que la
remaining 10 references serve as ref. We refer to
the adversarial dataset induced from SummEval
as SEadv in the remainder. We obtain candpara as
in the case of PAWSback.5

Real-world Motivation of Attacks Modern
text generation systems are prone to many of
the errors we investigate in this work. por ejemplo-
amplio, Freitag et al.
(2021a,b, 2022) espectáculo,
based on fine-grained human error annotations

5As we generate our adversarial test instances fully auto-
matically from backtranslation or automatic tools, they may
contain some errors (including upper-/lower-case). por ejemplo-
amplio, we note that in candpara, ‘‘. . . billion dollars’’ is
sometimes incorrectly formulated as ‘‘. . . dollars billion’’;
sin embargo, such cases occur only in ∼1% of all test cases for
number error, which we argue is still on an acceptable noise
nivel.

(Lommel et al., 2014), that translations gener-
ated by state-of-the-art MT models still contain
many accuracy-related errors (p.ej., addition and
omission of information, inappropriately informal
pronouns) and sometimes even fluency-related er-
rors (p.ej., wrong spelling). Negation handling is
also frequently discussed as an issue of modern
MT systems (Bentivogli et al., 2016; Sennrich,
2017; Hossain et al., 2020; Tang et al., 2021).
In summarization, system summaries are often
factually inconsistent with source documents in
terms of numbers, named entities and assign-
ing quotations to a particular person, etc.. (Falke
et al., 2019; Kryscinski et al., 2020; Chen et al.,
2021). More generally, hallucination (of which
addition/mismatches/etc. may be considered spe-
cial cases) is a particular worrisome limitation
of recent large language models (Ji et al., 2022).
En mesa 4, we show selected system translations
from real MT systems with specific errors (fol-
lowing WMT MQM annotations) that are very
similar to the ones we consider.6 The frequency
of errors may differ for various source-target lan-
guage pairs (p.ej., depending on their language
distancia) and formal/informal context. Para examen-
por ejemplo, when translating Chinese to English for news,
the names are often directly translated to their
Pinyin format (see the 4th row) instead of the

6https://github.com/google/wmt-mqm-human

-evaluación.

809

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Tarea

MONTE

ref-based MoverScore (Zhao et al., 2019), BERTScore (Zhang et al., 2020), BARTScore (Yuan et al.,

ref-free

2021), SentSim (Song et al., 2021), COMET (Rei et al., 2020b), BLEURT (Sellam et al., 2020)
COMET, SentSim, XMoverScore (Zhao et al., 2020)

Métrica

Summarization

ref-based BARTScore, DiscoScore (Zhao et al., 2023), MoverScore, BERTScore
ref-free

BARTScore, SUPERT (Gao et al., 2020)

Mesa 5: Evaluation metrics explored in this work.

Tarea

MONTE

Summarization

segment-level
system-level
adversary

summary-level
system-level
adversary

Datasets

WMT15-17, WMT20-21
WMT20-21
ref-based: PAWSori/back, WMT20de, XPAWSde; ref-free: XPAWSde/fr/zh/ja, WMT20de

RealSum (Bhandari et al., 2020)
RealSum, EvaluaciónSumm
SEadv, Rank19 (Falke et al., 2019) (ref-free only)

Mesa 6: We use the to-English language pairs in WMT15-17 datasets (Stanojevi´c et al., 2015; Bojar
et al., 2016, 2017). In segment-level evaluation on WMT20-21 (Mathur et al., 2020b; Freitag et al.,
2021a,b), we use the data with MQM scores for zh-en, while in system-level evaluation, we correlate
the metrics with DA scores for all to-English language pairs. The datasets for system-level evaluation
before WMT20 are skipped, as all metrics mostly get very high correlations on them.

official translations; in contrast, this rarely hap-
pens in English-to-German translations. But even
for such closely related languages, NLG systems
may omit information, or choose wrong pronouns
or mismatching nouns, particularly when a word
has multiple senses.

4 Experimental Setup

4.1 Métricas de evaluación

We explore a large array of recent state-of-the-
transformer based metrics, summarized in
arte
Mesa 5. The variants used are briefly introduced
abajo; further details (p.ej., model checkpoints and
implementación) can be found on our Github.

Nosotros

report BERTScore F1 employing a
RoBERTa-large model. For MoverScore, we use
the unigram variant with a BERT-base model
fine-tuned on MNLI (Williams et al., 2018). Nosotros
use two variants of BARTScore (Precision and
F1) for ref-based MT and summarization and
BARTScore-FN (FN stands for Faithfulness) para
ref-free summarization. We consider two variants
of XMoverScore with different remapping strate-
gies for multilingual embeddings (CLP, UMD)
and two variants of SentSim with different word
matching paradigms (BERTScore, WMD). We re-
port the DiscoScore variant with feature ‘Focus
Frequency’.

4.2 Datasets & Evaluation Protocol

We summarize our used datasets in Table 6. A
evaluate the metrics’ robustness under adversar-
ial conditions, we use the datasets introduced in
§3 and additionally Rank19 (Falke et al., 2019)
(only for ref-free summarization), which contains
examples composed of documents paired with
one correct and one incorrect candidate summary
with real-world factuality errors. En general, nosotros
check the metrics’ preference between the two
candidates and calculate accuracy: the relative
frequency that the metrics correctly choose among
the two alternatives.

On MT standard benchmarks, we evaluate the
metrics on both segment-level (where we cor-
relate metrics scores to human judgments for
individual sentences/segments in the datasets)
and system-level (where we correlate the aver-
age metric scores to the average human scores
over the segments generated by each system),
using Pearson correlation as the performance in-
dicator. On SummEval for summarization, nosotros
compute Kendall correlation with system-level
human judgements on four criteria: coherencia,
consistencia,
fluency and relevance (we apply
two aggregation methods for the multi-reference
configuración, max and mean). We calculate Pearson cor-
relation with both summary-level (analogous to

810

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

segment-level in MT) and system-level LitePyra-
mids (Shapira et al., 2019) human ratings in
RealSumm.

4.3 NLI as a Metric

NLI systems yield probability distributions over
Entailment, Contradiction, and Neutral. Wedenote
the probability values as e, C, y N, where e + C +
norte = 1 and e, C, n ≥ 0. We first determine how to
leverage the three values as NLI metrics.

para hacerlo, we evaluate five simple formulas of
their arithmetic combination in a heuristic way: (1)
mi, (2) -C, (3) e-n, (4) e-c, y (5) e-n-2c, and inspect
their effect in three directions, which correspond
to the entailment directions implication, reverse
implication and bi-implication: (i) ref /src → cand,
where ref or src act as premise and cand as
hypothesis; (ii) ref /src ← cand, where cand acts
as premise and ref or src act as hypothesis; y
(iii) ref /src ↔ cand, as arithmetic average over
the two above cases.

Por ejemplo, to obtain e-n from ref /src ↔ cand,
we first average the three probability scores over
direction ref /src → cand and ref /src ← cand, entonces
calculate e-n based on the averaged scores. Nosotros
only consider direction src → cand for ref-free
summarization, since hypothesis does not need to
entail source document. The various selections of
the formulas and directions result in 15 pooling
strategies for NLI-based metrics.

NLI Systems We explore both monolingual and
cross-lingual NLI-based metrics. For each setup,
we choose two NLI models, which are obtained
from Hugging Face or fine-tuning by ourselves.

For monolingual NLI metrics, we choose
(1) a RoBERTa-large model (Liu et al., 2019)
fine-tuned on SNLI (Bowman et al., 2015), MNLI,
Fever (Nie et al., 2019) and ANLI (Nie et al., 2020)
by Nie et al. (2020) y (2) a DeBERTa-large
model fine-tuned by He et al. (2021), using MNLI.
We denote the NLI metrics induced from these two
models as NLI-R and NLI-D. They will be used
for ref-based MT evaluation, and both ref-based
and -free summarization evaluation tasks. Nota
eso, while NLI-R has been fine-tuned on adversar-
ial NLI (ANLI), which has been shown to increase
robustness on (Por ejemplo) negation and numer-
ical reasoning, NLI-D has not been trained on
ANLI. Cross-lingual NLI metrics should handle
premises and hypotheses in different languages,
so we select the multilingual versions of the under-

(a) Reference-based

ref→cand
ref←cand
ref↔cand

mi
3+0

0+4

e-n

-C
3+0

e-c
2+0

e-n-2c

0+3

0+1

0+2

(b) Reference-free

src→cand
src←cand
src↔cand

-C
2+0

0+1
0+1

e-n

e-c

e-n-2c

0+2
4+6

4+0

Mesa 7: Winning frequency of different pooling
strategies for NLI metrics on adversarial (primero
entry) and MT datasets (second entry). We only
show non-zero entries.

lying models of NLI-R/NLI-D. (1) We fine-tune
a XLM-RoBERTa-base model (Conneau et al.,
2019), using the datasets for fine-tuning NLI-R
as well as XNLI dataset (Conneau et al., 2018). (2)
We select an mDeBERTa-base model fine-tuned
on MNLI and XNLI. We denote the correspond-
ing cross-lingual NLI metrics as XNLI-R and
XNLI-D.

5 Experiment Results

Before outlining our main results in §5.1 (MONTE)
and §5.2 (summarization), we first discuss good
pooling strategies for NLI metrics.

Pooling Strategy We determine the pooling
strategy for NLI metrics in MT evaluation from
(1) the accuracy on the adversarial datasets and
(2) the correlation with human judgements on
the standard (segment-level) MT datasets. Nosotros
leverage the winning frequency of the pooling
strategies to choose the best one; a strategy wins
if it works best for an NLI metric among all 15
estrategias. En general, we find that the simple for-
mula e from the direction src/ref↔cand is a good
choice which works well for both standard and ad-
versarial benchmarks, even though slightly better
formulas could be chosen in selected subsettings
(p.ej., ref-based vs. ref-free evaluation), ver tabla 7
for examples.

For summarization, the situation is slightly
more complex: (1) e-c from direction ref←cand
performs best for ref-based NLI metrics; (2) -C
from direction src→cand is the best strategy for

811

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Adv.

MONTE

ref-based
all adeq. all adeq.

ref-free

ref-based
sys
seg

ref-free
sys

seg

67.4 67.0 76.8 74.5 0.676 0.808 0.620
74.8 79.8

0.708 0.807

–

0.698
–

32.9 27.2
34.3 28.7
48.3 46.9

–
–
–

0.380 0.757
0.425 0.774
0.567 0.806

–

74.5 71.7
73.8 70.9

–
–

–
–
–
0.400
0.422
–

–
–
–
0.672
0.673
–

–
–
–

65.3 60.9
67.4 64.2
78.4 77.8
68.1 67.8 62.7 65.5 0.612 0.401 0.421 −0,021
62.1 61.9 63.0 65.8 0.607

0.620 0.799
0.587 0.761
0.593 0.802

0.427

–
–
–

–

85.0 92.1 70.5 75.8 0.451 0.756 0.221
86.6 92.3 79.3 85.8 0.439 0.770 0.149

0.335
0.581

Supervised
COMET
BLEURT
Unsupervised
sentBLEU
Rouge
MoverScore
XMoverS(UMD)
XMoverS(CLP)
BERTS
BARTS-P
BARTS-F
SentS(BERTS)
SentS(WMD)
NLI-based
X(NLI)-R
X(NLI)-D

Mesa 8: Pearson correlation with human judg-
ments in WMT and accuracy (%) on our
adversarial datasets, averaged over datasets. El
performance of ref-based COMET is averaged
over WMT20de and XPAWSde, since it also re-
quires source texts as input. In bold: best results
among all unsupervised metrics including the
NLI-based metrics.

ref-free NLI metrics. De este modo, we compare NLI met-
rics adopting these strategies with classic metrics.
Even though we only looked at global aggregate
Estadísticas, we still observe that our method of
identifying the pooling strategies above leveraged
the data on which we will later evaluate the NLI
métrica. To avoid leaking information from the test
colocar, we evaluate NLI metrics on each dataset with
the pooling strategy selected from the remaining
datasets for that task in §6.

5.1 Máquina traductora

5.1.1 Adversarial Evaluation

We now compare our NLI metrics with the best
pooling strategy to our baseline metrics under
adversarial conditions.

From Table 8 (columns ‘‘Adv.’’), we observe
that in the ref-based setup: (1) NLI metrics out-
perform the great majority of metrics by a huge
margin: encima 85% vs. 32%–78% (all phenomena)
y 92% vs. 27%–80% (adequacy phenomena
solo) on average. (2) Más, the two NLI metrics
perform similarly. In the ref-free setup, the best
cross-lingual NLI metric (XNLI-D) is still most
robust under our attacks. Sin embargo, NLI metrics
do not as substantially outperform the other met-

812

rics as in the ref-based setup. A potential reason is
that the cross-lingual NLI models underperform
compared to the monolingual setup (the prefer-
ences we query for in the reference-free setup
may also play a role). Sin embargo, when ex-
cluding the fluency-related phenomena from the
adversarial datasets, XNLI-D is still on average
10 points better than the best standard metric,
COMET (86% vs. 75%).

Además, our results reveal

eso: (1) mayoría
standard metrics are particularly incapable of
detecting name error, number error, and pro-
noun error (∼29%–70%); (2) standard metrics,
especially BLEURT and COMET, are most
competitive regarding omission, addition, y
jumbling (∼80%–100%); (3) NLI metrics are sub-
optimal for fluency attacks (mostly at random
nivel), especially the reference-free NLI metrics;
y (4) NLI metrics are much better at name er-
ror, negation, number error, pronoun error, y
adj. mismatch than most of the other metrics,
especially ref-based (>90% vs. ∼10%–80%), como
como se muestra en la figura 1.

Our observations are inconsistent with Karpinska
et al.
(2022), where the state-of-the-art MT
metrics mostly obtain >95% accuracy in the
preference-based evaluation. The reason is that
our test suites are much more difficult for the
evaluation metrics because we challenge them
by lexical overlap between source/reference and
candidate sentences during attacks: Metrics must
choose between high lexical overlap adversar-
ial candidates (with key errors) over low lexical
overlap paraphrases. A diferencia de, in Karpinska
et al. (2022), metrics are challenged to assign cor-
rect preferences for score(ref, t) vs. puntaje(ref, t(cid:8))
where t is a candidate and t(cid:8) the perturbed can-
didate. This is a much easier comparison because
neither are ref and t maximally dissimilar (pero
meaning equivalent) nor are ref and t(cid:8) maximally
similar. This is an important lesson: How to de-
sign the adversarial preferences may critically
affect the assessment of whether recent metrics
are robust or not.

5.1.2 Standard Benchmarks
Ref-based We give average results over all
datasets in Table 8 (columns ‘MT’;
individ-
ual results are available in our Github). Para
segment-level evaluation, we observe: (1) entrenado
métrica (COMET and BLEURT) sustancialmente
outperform the others, with average performance

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 1: Average accuracy (values in each block) of all metrics per phenomenon over the adversarial datasets for
ref-based MT evaluation. Darker color indicates higher accuracy and vice versa.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

of ∼0.7 Pearson. (2) Unsupervised SOTA met-
rics have average correlation of ∼0.6 Pearson,
BERTScore is the best among them. (3) Nuestro
NLI-based metrics are not competitive, with cor-
relations of ∼0.45 Pearson. When correlating with
system-level human judgments, NLI metrics still
underperform most of the SOTA metrics, pero el
margin is much smaller.

(>0.6

evaluación

Ref-free Trained metrics also dominate in
segment-level
Pearson),
whereas the two NLI-based metrics perform
much worse than the others (0.15-0.22 Pearson).
Sin embargo, XNLI-D performs on par with
COMET and better than the others on WMT20 at
system-level.

En general, we conclude that our NLI metrics are
not competitive with state-of-the-art evaluation
metrics on standard MT datasets, especially at
segment-level and ref-free.

5.1.3 Combined Metrics

Observing that NLI metrics are strong on adversar-
ial setups, but comparatively weaker in standard

evaluación, we examine how to get more robust
metrics which also perform well on standard
benchmarks. para hacerlo, we take the weighted
average of NLI and classical metrics:

C = wnli · N + (1 − wnli) · M

(2)

where wnli ∈ [0, 1] is the weight for NLI metric N
and M is a classical metric. Before combination,
we rescale M and N to [0, 1], using min-max
normalization.

We illustrate the performance of the com-
bined evaluation metrics with (X)NLI-R
on both adversarial and standard benchmarks
the results for
En figura 2;
(segment-level)
(X)NLI-D and for system-level are similar. El
x-axis denotes the average accuracy over the
adversarial datasets, while y-axis is the average
Pearson correlation over the standard benchmarks
(MT datasets). Each dot in each graph shows
the value C(wnli) for a specific weight wnli.
the graphs show an
As seen from Figure 2,
intriguing concave curvature. In standard MT
the combination boosts the metric
evaluación,

813

Cifra 2: Accuracy on adversarial datasets and Pearson
correlation with segment-level human judgements in
WMT datasets of combined metrics with (X)NLI-R,
averaged over datasets. The points on each path from
the original metric to the NLI metric indicate wnli =
0, 0.1, . . . , 1. The purple line denoting the combination
with ref-based COMET ends at another point since
the corresponding adversarial performance is averaged
over the 2 adversarial datasets containing source texts.

performance when wnli is small (de 0.1 a 0.4)
in virtually all cases. We then see a simultaneous
increase of adversarial robustness and quality on
standard benchmarks. In ref-based setup, p.ej.,
for wnli = 0.2, we observe: (1) MoverScore and
BARTScore-P improve most, with ∼8% (de
respectivamente)
0.57/0.59 a 0.61/0.64 Pearson,
and 21%–36% improvements on adversarial
conjuntos de datos (de 48%/67% a 66%/82% exactitud
on average). (2) The best unsupervised metric
aumenta
on segment-level MT, BERTScore,
∼4% Pearson on standard benchmarks and
∼24% accuracy on adversarial datasets. (3) El
robust untrained metric, BARTScore-F,
mayoría
improves about ∼11% in robustness, whereas its
performance on standard benchmarks also rises
∼5%. (4) The improvements on MT for trained
metrics are smaller compared to those untrained
métrica, with COMET improving only 1.5%
and BLEURT even becoming worse with the
choice wnli = 0.2. Sin embargo, their performance
in defending adversarial attacks still improves
∼10%–20%.
In ref-free setups, all metrics
improve ∼6%–7% on adversarial datasets. Semejante
setting only substantially boosts XMoverScore’s
performance on standard benchmarks, con
∼6%–9%.

We summarize the improvements for all com-
binations in Figure 3(a), which are averages over
all experiments considered here. We can observe
that the line denoting improvements on standard

Cifra 3: Improvements of all metrics on standard
benchmarks and adversarial datasets for wnli = 0.1,
… 0.9, averaged over all experiments. We show 95%
confidence interval.

benchmarks peaks at wnli = 0.2, and the average
improvements are positive when wnli ≤ 0.5. Fur-
ther, on the adversarial datasets, the improvement
monotonously increases with wnli and the gain is a
concave function of wnli which saturates as wnli be-
comes larger. The sweet spots are wnli ∈ [0.2, 0.3],
which leads to 5%–6% improvement on stan-
dard benchmarks and 14%–16% improvement in
adversarial robustness on average. When exclud-
ing the fluency phenomena from the adversarial
conjuntos de datos, the combined metrics consistently gain
larger improvements in adversarial robustness,
with 20%–24% improvements at the sweet spots.

5.2 Summarization

similar

Evaluation As Table 9 muestra,
a
MT evaluation, NLI-based metrics exhibit much
stronger robustness under adversarial conditions
(our best NLI metrics have at least ∼8 points
higher accuracy than the best standard met-
rics; right-most columns). The difference is that
the vanilla NLI metrics are now also compara-
bly effective to the SOTA metrics on standard
benchmarks. Por ejemplo, in ref-based setup,
NLI-D with max aggregation beats all metrics
except for DiscoScore with mean on SummEval
and both NLI metrics highly correlate with
system-level human ratings in RealSumm (arriba

814

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

AZUL

Rouge
MoverS

BERTS

BARTS-F

DiscoS

NLI-based

NLI-R

NLI-D

métrico

BARTS-FN

SUPERT

NLI-based
NLI-R

NLI-D

(a) Reference-based

métrico

coherencia

consistencia

EvaluaciónSumm
fluidez

mean max mean
0.279
0.294

máximo

significar
0.244

máximo
0.044 −0,029
0.229
0.088 −0.279 −0.037 −0.081
0.362
0.456

0.421

0.103

0.191
0.206

0.176
0.324

0.618

0.221
0.044
0.176 −0.044
0.250
0.206

0.647

0.515

0.676

0.279

0.676

0.273

0.376

0.317

0.539

0.185

0.450

0.554

relevance

mean max
0.397 0.382

0.118 0.103
0.368 0.515

0.603 0.515

0.500 0.368

0.529 0.632

0.632 0.353

avg

máximo
significar
0.245
0.215
0.090 −0.020
0.326
0.363

0.340

0.237

0.429

0.385

0.392

0.532

RealSumm
litePyr

Adv.
SEadv

sum
0.480

0.540
0.585

0.574

0.478

sys
0.124

0.457
0.501

todo

adeq.
0.182 0.109

0.185 0.117
0.287 0.251

0.380

0.598 0.574

0.531

0.697 0.692

0.583

0.495
0.788 0.792
0.466 −0.199 −0.066 0.334 0.294

0.687

BARTS-P

0.485

0.441

0.147

0.074

0.632

0.250

0.265

0.706

0.676

0.750

0.494

0.568

0.450

0.613

0.279 0.206

0.471 0.397

0.388

0.499

0.352

0.506

0.525

0.489

0.856 0.864 0.905

0.840 0.806 0.843

(b) Reference-free

EvaluaciónSumm

coherencia

consistencia

fluency relevance

avg

0.735

0.147

0.221

0.162

0.132

0.603

0.235

0.647

0.391

0.465

0.391

0.332

0.662

0.279

0.500

0.324

0.480

0.374

0.337

0.366

RealSumm
litePyr

Adv.

SEadv

summary

sistema
0.178 −0,023
0.522
0.626

todo

0.427

0.296

adeq.

0.389

0.273

Rank19

avg

0.796 0.612

0.668 0.482

0.300
−0.076

0.688

0.568

0.720

0.624

0.722

0.629

0.866 0.793

0.885 0.755

Mesa 9: Kendall correlation with system-level human judgments in SummEval. Correlación de Pearson
with summary/system-level litePyramid in RealSumm. Accuracy on adversarial benchmarks, averaged
over phenomena in SEadv. We bold the best performance on each criterion. ‘‘max/mean’’ denotes the
aggregation method used for multi-reference setting in ref-based evaluation on SummEval.

0.8 Pearson), where most standard metrics ob-
tain only 0.5–0.7 Pearson correlations. Cuando
considering all evaluation dimensions of Sum-
mEval and RealSumm, NLI-D outperforms all
otras métricas, followed by NLI-R. Besides, nosotros
observe that NLI metrics correlate much bet-
ter with human judgments regarding consistency
y (somewhat surprisingly) fluency in Sum-
mEval compared to the other metrics. Para el
ref-free setup, BARTScore-FN performs best on
SummEval—it outperforms the other metrics by
arriba 0.1 Kendall on average. Sin embargo, it does
not correlate well with both summary-level and
system-level human judgments in RealSumm. NLI
metrics are comparable or better than standard
metrics on system-level. Por ejemplo, NLI-R
performs best among the examined metrics and is
acerca de 0.06 Pearson better than the best standard
métrico (SUPERT) on system-level in RealSumm.
Sin embargo, reference-free NLI metrics also
perform worse than the reference-based ones
as in MT; an explicit bottleneck for the two
NLI metrics is that they were only trained on
NLI data with short sentences, but reference-free

summarization evaluation requires metrics to deal
with source documents which contain many more
oraciones.

to MT,

Combined Metrics
En figura 3(b), we sum-
marize the median improvements of combined
summarization metrics (the median smooths some
outliers). A diferencia de
the combination
brings almost equal benefits to performance of
standard metrics on standard and adversarial
benchmarks concerning only adequacy—we again
observe a decrease in improvements on adversarial
datasets when adding our fluency phenomena. Nosotros
identify a best wnli, a saber, 0.8, with which the
standard metrics gain about 25%–30% improve-
ments in both types of performances (adversarial
and standard).

6 Discusión & Análisis

Selected Failure Cases of Metrics: Mesa 10
shows selected failure cases of four popular
métrica (BERTScore, BARTScore, BLEURT,
COMET), where the NLI metrics are correct in

815

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

ref

candpara

candadv

scorepara:
scoreadv
(standard
métrico)

scorepara:
scoreadv
(NLI-R)

error

BERTScore
Although President George
W.. Bush says he believes in
markets, in this case he has
called for voluntary action.

BARTScore-F
Reagan and I were nonetheless
able to create a reservoir
of constructive spirit
through constant outreach
and face-to-face interaction.

BLEURT
En 2012, when Freedom
House downgraded Mali to
‘‘not free,’’ engagement
declined by 7%.
This leads to heavy deforestation
and lethal indoor
air pollution, which kills 1.3
million people each year.

COMET
Who serves as president of
the United States is critically
important for
Mexicans.

Although President George
W.. Bush says he believes in
markets, he has demanded
voluntary action in this case.

Although President George
W.. Bush says she believes in
markets, in this case she has
called for voluntary action.

0.980:
0.982

0.951:
0.000

Pronoun

Sin embargo, Reagan and
I were able to create a
constructive climate through
constantcontact and
personal interaction.

Nicole and I were nonetheless
able to create a reservoir
of constructive spirit
through constant outreach
and face-to-face interaction.

−2.104:
−1.527

0.943:
0.002

Nombre

En 2012, when Freedom
House classified Mali as
unfree, the engagement fell by
7 por ciento.
This leads to heavy
Deforestation and lethal indoor
air pollution, which kills one
point three million people
each year.

En 2012, when Freedom
House downgraded Melissa
to ‘‘not free,’’ engagement
declined by 7%.
This leads to heavy
Deforestation and lethal indoor
air pollution, which kills
6.9 million people each year.

0.787:
0.834

0.983:
0.030

Nombre

0.682:
0.767

0.783:
0.000

Num

Anyone who serves as
President of the United States is
crucial to Mexicans.

Who serves as president of
the United States is not
critically important for
Mexicans.

1.067:
1.086

0.974:
0.044

Negation

Mesa 10: Sample instances in adversarial datasets where standard metrics failed while NLI-R succeeded;
ref-based setup. In the 4th and 5th columns, we show [score assigned to candpara]: [score assigned
to candadv] by standard metrics and NLI-R, respectivamente; robust metrics should give candpara higher
puntuaciones. Green bold texts indicate the anchor words/phrases to be perturbed and the red ones in
candadvrefer to the corresponding perturbed texts.

each case. In the examples, BERTScore prefers
text with the wrong gendered pronoun over a le-
gitimate paraphrase and even trained metrics like
BLEURT fail on severe name changes such as
‘‘Melissa’’ (a person name) vs. ‘‘Mali’’ (a coun-
try name). Leveraging more subtle cases (p.ej.,
mismatches based on wrong word senses instead
of random mismatches with the same POS or re-
placing names with names of the same ‘type’)
would likely constitute even harder test cases for
future metrics.

No Metric is Good Everywhere: Across distinct
dimensions, different metrics perform differently,
indicating that they capture varying aspects. Para
ejemplo, NLI metrics are not so good on flu-
ency adversarial attacks, p.ej., typos. This may be
unsurprising, given that fluency is a low-level phe-
nomenon while NLI concerns high-level logical
relationships between sentences (some fluency
phenomena would best be treated by switch-

ing to a lower-level representation space, semejante
as character-level [Vu et al., 2022]; this could
seamlessly be integrated in existing NLI mod-
los). The NLI metrics are also weaker concerning
segment-level MT evaluation on standard bench-
marks. Sin embargo, NLI metrics alone perform
surprisingly well: In ref-based MT, they win on 7
out of 19 dimensions (12 adversarial phenomena
y 7 standard datasets, evaluated segment- y
system-level), only beaten by BLEURT (8 wins);
ref-free, they win 5 out of 19 dimensions, segundo
only to COMET (11 wins). In ref-based summa-
rization, they are clearly ahead of all standard
métrica, winning not only 8 out of 12 adversarial
dimensions, but also system-level LitePyramid,
consistency and fluency (de este modo, 11 out of 18 wins),
clearly ahead of BARTScore-P (4 de 18); ref-free,
they are also best and win 13 out of 18 di-
mensions. The best overall metrics, measured as
average performance over standard and adversar-
ial datasets, always include NLI: for ref-based MT,

816

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

this is BLEURT+0.2×NLI-R, for ref-free MT, es
COMET+0.3×NLI-D. For summarization, NLI-R
alone and combined with BARTScore-F perform
best on average.

Rescaling: The min-max normalization we used
(a standard technique for normalizing data in ma-
chine learning, typically applied to input features)
for metric combination requires batch processing.
It is necessary to account for the different ranges
of metrics, p.ej., some metrics take negative val-
ues. An alternative would include to enforce more
formal constraints on evaluation metrics, es decir., eso
they should take outputs in [0,1]. When applying
our combined metrics in practice, one could also
replace them by surrogate metrics trained on the
outputs of the original combined metrics or simply
take the min-max values inferred from the datasets
already evaluated on—the larger these datasets the
more reliably are min and max estimated.

Sensitvity to wnli: Having different weights
wnli for different tasks is undesirable, porque
requires considering each task individually. Cómo-
alguna vez, in our experiments, we found that all small
wnli (abajo 0.5) yield good performances and
are thus safe choices: They increase adversar-
ial robustness and also lead to better metrics on
standard benchmarks.

Adversarial Performance vs. Standard Perfor-
mance: From our experiments, it might seem
that adversarial and standard performance are
anti-correlated: A metric with higher adversar-
ial performance may have lower performance on
standard benchmarks and vice versa. While this
would not necessarily be a major surprise as adver-
sarial conditions oftentimes test phenomena that
are otherwise not represented in standard bench-
marks (Niven and Kao, 2019), a statistical analysis
reveals that standard performance generally pos-
itively correlates to the adversarial performance
in our case, consistent with our earlier argument
that existing NLG systems in the real world do
commit similar errors as we check for. para hacerlo,
we first convert the metrics’ standard performance
to rankings for each performance category (p.ej.,
ref-based/-free segment/system-level MT perfor-
mance, performance on SummEval/RealSumm),
then we correlate the ranking-based standard
performance to the corresponding adversarial per-
formance rankings, obtaining 0.37 Lancero.

When excluding NLI metrics,
increases to 0.60.

the correlation

The Choice of candpara Matters: As indi-
cated in §3, we speculate that a good adversarial
setting maximizes (surface) dissimilarity between
ref and candpara (which can better trick the met-
rics). To investigate, we compute the normalized
edit distance between ref and candpara;7 un mayor
edit distance means a greater dissimilarity. If our
assumption is true, then larger edit distances rep-
resent harder test cases for the metrics. We find:
(1) the average edit distance for the test cases
where the metrics fail to defend against the ad-
versarial attacks is 0.01–0.6 larger than that for
where they succeed, averaged over metrics; (2)
for PAWSback and PAWSori (both induced from
PAWS) where the candpara are obtained in dif-
ferent ways, all metrics achieve 0.02-0.15 más bajo
accuracy on PAWSori, which has 0.46 larger av-
erage edit distance than PAWSback, Sucesivamente. Ambos
findings confirm our above assumption. In addi-
ción, we observe that NLI metrics have the smallest
difference between the edit distances for failure
and success cases (0.01–0.26) as well as that be-
tween the accuracy on PAWSback and PAWSori
(0.02) among all evaluation metrics. This implies
that they are least affected by surface overlap
and instead better consider the logical relation-
ship between sentences. This is what makes them
attractive as evaluation metrics.

The Choice of candadv Matters, Too: Nosotros
evaluate on one complex attack combining Num-
ber error with Negation which increases the
difference between ref and candadvbased on
the test cases for Number error in WMT20de.
The accuracy increases by an average of 0.28 encima
all metrics. This confirms our assumption that
maximizing the (surface) similarity between ref
and candadv (but with key errors) leads to harder
test suites and vice versa.

Ensemble with NLI Metrics Are More Ef-
fective: We compare the ensembles with NLI
metrics to ensembles with standard metrics, es decir.,
w · A + (1 − w) · M , where A is a fixed standard
metric and M is any of the remaining metrics. A
hazlo, we combine standard metrics with the rest
metrics for each category of MT/summarization
and ref-based/-free setting. We take the arithmetic

7Ref-free, the edit distance between r and ref isconsidered.

817

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

account the multi-sentence nature of source texts
and outputs, especially in summarization.

To remedy the mismatch between the gran-
ularities of the training data of NLI models
and the input data of summarization evalua-
ción, es decir., oración- vs. nivel de documento, Laban
et al. (2022) propose both supervised and un-
supervised NLI-based summarization metrics for
inconsistency detection. We test their unsuper-
vised variant (SummaCZS),8 which segments
documents into sentence units and aggregates
scores between pairs of sentences, con el
underlying model of NLI-R. Sin embargo, Sum-
maCZS does not consistently outperform NLI-R
across all datasets; in contrast, NLI-R performs
much better in our adversarial
test compared
a
to SummaCZS (72% vs. 53%). Besides,
match the training data of NLI models with the
task of factual inconsistency detection in sum-
marization, Utama et al. (2022) introduce an
augmented NLI dataset with task-oriented exam-
ples based on CNNDM—FalseSum; we evaluate
three Roberta-large models finetuned on it and
MNLI. Similar to SummaCZS, this also does
not always yield better performance compared to
simple NLI metrics (∼55%–68% vs. 72% on ad-
versarial datasets). En general, both approaches work
well on SummEval, but not so well on RealSumm
and our adversarial benchmark.

Choice of Pooling Strategy: To examine the
issue of data leakage discussed in §5, we now
evaluate the NLI metrics on each dataset with
the pooling strategy selected from the remain-
ing datasets (excluding the one for evaluation)
based on winning frequency. Por ejemplo, para
the segment-level MT evaluation on WMT15,
we choose the pooling strategy which wins most
times on all MT datasets (including all standard
datasets for both segment/system-level evalua-
tion and the adversarial datasets) excepto por
WMT15. We observe that this change in pooling
strategy induction results in minor performance
variación: −1.9% for segment-level evaluation,
+0.8% for system-level evaluation, and −0.7%
for adversarial evaluation. For summarization,
as only one direction—i.e., src→cand—is con-
sidered for ref-free NLI metrics, we separately
the pooling strategy for ref-based and
select

Cifra 4: Accuracy on adversarial datasets and Pearson
correlation with segment-level human judgements in
WMT datasets of combined metrics with BERTScore,
averaged over datasets. The green line denoting the
combination with COMET ends at another point since
the corresponding adversarial performance is only av-
eraged over the 2 adversarial datasets containing source
textos.

average of the accuracy on adversarial bench-
marks and correlations on standard benchmarks
as the overall metric performance here. We calcu-
late the mean/maximal improvement of ensembles
to the original metric M over w ∈ [0.1, 0.9] y
observe: (i) While the ensembles with standard
metrics are better for ref-free MT metrics because
cross-lingual NLI metrics perform very poorly
in our experiments, (ii) the monolingual NLI
metrics lead to much better ensembles—17/15
points larger mean/max improvement—compared
to the standard metrics. (iii) En general, the ensem-
bles with NLI metrics yield 10/7 points larger
mean/max improvement in overall performance
than with standard metrics (averaged over all 4
tareas: ref-based/-free MT/summarization). De este modo,
(monolingual) NLI metrics have unique proper-
corbatas, compared to standard metrics, making them
attractive in ensembles.

To illustrate, Cifra 4 shows ensembles with
BERTScore. These show minor or no improve-
ments on standard benchmarks and also mixed
(often negative) results for adversarial robustness.

In §5, we applied
SummaCZS and Falsesum:
NLI systems on whole input texts, not taking into

8We do not compare to the supervised one as it is trained
on a consistency dataset for summarization task, for a fairer
comparación.

818

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

ref-free NLI metrics. En general, we have no perfor-
mance change for the ref-free setting and −3.6%
performance on average over all five criteria
(correlations on SummEval with max/mean ag-
gregation, summary/system-level correlations on
RealSumm, and accuracy on SEadv) ref-based.
De este modo, the changes are again minor.

Comparison to RoMe: As the authors of RoMe
did not publish their adversarial dataset, we com-
pare RoMe’s performance with our metrics on
one of our adversarial datasets, WMT20de, en-
lugar. RoMe has an average accuracy of 43%,
with > 90% accuracy only on the phenomena
SVD and omission, which are the easiest for most
standard metrics. A diferencia de, our NLI metrics
have above 80% average accuracy. As RoMe
does not evaluate on MT or summarization, nosotros
also evaluate our NLI metrics on one (randomly
chosen) data-to-text generation dataset used in
Rony et al. (2022)—BAGEL (Mairesse et al.,
2010). RoMe and our NLI metrics perform on
par here (∼0.23 Spearman’s ρ). En general,
este
seems to imply that simple NLI models taken
out of the box are better and more robust metrics
than a specially trained approach such as RoMe.

7 Concluding Remarks

En este trabajo, we explored NLI as a general
paradigm for evaluation metrics. We showed
that NLI metrics yield adversarial robustness,
and are also strong—though not always state-of-
the-art—when it comes to standard metric eval-
uation benchmarks. By linearly interpolating
established (BERT-based) metrics with our NLI
métrica, we obtained high-quality metrics along
both axes: adversarial robustness and standard
benchmarks, with substantial gains over recent
BERT-based metrics.

A potential reason why NLI based metrics
perform subpar on some standard benchmarks
(especially in MT) is the training data mismatch,
es decir., typical NLI datasets contain many artificial
sentences of the type ‘‘A girl is playing on a
piano’’. A further limitation is that cross-lingual
NLI models are not yet high-quality enough and
that most current NLI models are sentence-level,
not document-level—with a few recent exceptions
(Yin et al., 2021). Once these limitations of NLI
are overcome, we believe that even better perfor-
mances from NLI based metrics can be expected,

cual, we believe, is one of the most promis-
ing directions for future high-quality and robust
evaluation metric design. Future work should also
consider NLI metrics for other text generation
tareas; the NLI paradigm looks especially promis-
ing for tasks that require comparison with human
references, which oftentimes involve the concept
of logical equivalence.

Expresiones de gratitud

We thank Zuojun Shi for conducting initial exper-
iments related to this paper as part of her Bachelor
thesis at TU Darmstadt. We appreciate the re-
viewers and editors from TACL for their time,
esfuerzo, and greatly helpful comments. Nosotros también
thankfully acknowledge support from the BMBF
via the grant ‘‘Metrics4NLG’’. Steffen Eger is
financed by DFG grant EG 375/5–1.

Referencias

Yonatan Belinkov, Adam Poliak, Stuart Shieber,
y alejandro
Benjamin Van Durme,
Rush. 2019. On adversarial
removal of
hypothesis-only bias in natural language in-
ference. In Proceedings of the Eighth Joint
Conference on Lexical and Computational
Semántica
(*SEM 2019), pages 256–262,
Mineápolis, Minnesota. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/S19-1028

Jonas Belouadi and Steffen Eger. 2023. Uscore:
An effective approach to fully unsupervised
evaluation metrics for machine translation. En
EACL.

Luisa Bentivogli, Arianna Bisazza, Mauro
Cettolo, and Marcello Federico. 2016. nuevo-
ral versus phrase-based machine translation
quality: A case study. En Actas de la
2016 Conference on Empirical Methods in Nat-
ural Language Processing, pages 257–267,
austin, Texas. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D16-1025

Manik Bhandari, Pranav Narayan Gour, Atabak
Ashfaq, Pengfei Liu, y Graham Neubig. 2020.
Re-evaluating evaluation in text summariza-
ción. En Actas de la 2020 Conferencia
sobre métodos empíricos en lenguaje natural

819

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Procesando (EMNLP), pages 9347–9359, On-
line. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.751

Ondˇrej Bojar, Yvette Graham, and Amir Kamran.
2017. Results of the WMT17 metrics shared
tarea. In Proceedings of the Second Confer-
ence on Machine Translation, pages 489–513,
Copenhague, Dinamarca. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/W17-4755

Ondˇrej Bojar, Yvette Graham, Amir Kamran,
and Miloˇs Stanojevi´c. 2016. Results of the
WMT16 metrics shared task. En procedimientos de
the First Conference on Machine Translation:
Volumen 2, Shared Task Papers, pages 199–231,
Berlina, Alemania. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/W16-2302

Samuel R. Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015. A
large annotated corpus for learning natural lan-
guage inference. En Actas de la 2015
Conference on Empirical Methods in Nat-
ural Language Processing, pages 632–642,
Lisbon, Portugal. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D15-1075

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei,
Hui Jiang, and Diana Inkpen. 2017. En-
hanced LSTM for natural language inference.
In Proceedings of the 55th Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 1657–1668,
vancouver, Canada. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/P17-1152

Xi Chen, Nan Ding, Tomer Levinboim, y
Radu Soricut. 2020. Improving text gener-
ation evaluation with batch centering and
En profesional-
tempered word mover distance.
cesiones de
the First Workshop on Eval-
uation and Comparison of NLP Systems,
pages 51–59, En línea. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2020.eval4nlp-1.6

Yiran Chen, Pengfei Liu, and Xipeng Qiu.
2021. Are factuality checkers reliable? Ad-
factuality in
versarial meta-evaluation of

summarization. In Findings of the Association
para Lingüística Computacional: EMNLP 2021,
pages 2082–2095, Punta Cana, Dominican Re-
público. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.findings-emnlp.179

Pierre Colombo, Guillaume Staerman, Chlo´e
Clavel, and Pablo Piantanida. 2021. Au-
tomatic text evaluation through the lens
En curso-
of Wasserstein barycenters.
cosas de
el 2021 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural,
pages 10450–10466, En línea y Punta Cana,
República Dominicana. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/2021.emnlp-main.817

Alexis Conneau, Kartikay Khandelwal, Naman
goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzm´an, Edouard Grave, Myle Ott,
Lucas Zettlemoyer, and Veselin Stoyanov.
2019. Unsupervised cross-lingual
represen-
tation learning at scale. CORR, abs/1911.02116.
https://doi.org/10.18653/v1/2020
.acl-main.747

Alexis Conneau, Ruty Rinott, Guillaume Lample,
Adina Williams, Samuel R. Bowman, Holger
Schwenk, and Veselin Stoyanov. 2018. Xnli:
Evaluating cross-lingual sentence represen-
el 2018 Estafa-
taciones.
ference on Empirical Methods in Natural
Procesamiento del lenguaje. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D18-1269

En procedimientos de

Ido Dagan, Dan Roth, Mark Sammons, y
Fabio Massimo Zanzotto. 2013. Recognizing
Textual Entailment: Models and Applica-
ciones, Synthesis Lectures on Human Lan-
guage Technologies, morgan & Claypool
Publishers. https://doi.org/10.1007
/978-3-031-02151-0

Daniel Deutsch, Tania Bedrax-Weiss, and Dan
Roth. 2021. Towards question-answering as
an automatic metric for evaluating the con-
tent quality of a summary. Transactions of
la Asociación de Lingüística Computacional,
9:774–789. https://doi.org/10.1162
/tacl_a_00397

Daniel Deutsch, Rotem Dror, and Dan Roth. 2022.
On the limitations of reference-free evaluations

820

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

of generated text. En Actas de la 2022
Jornada sobre Métodos Empíricos en Natu-
Procesamiento del lenguaje oral, pages 10960–10977,
Abu Dhabi, United Arab Emirates. Asociación
para Lingüística Computacional.

Ondˇrej Duˇsek and Zdenˇek Kasner. 2020. Evaluat-
ing semantic accuracy of data-to-text generation
with natural language inference. En curso-
ings of the 13th International Conference on
Generación de lenguaje natural, pages 131–137,
Dublín, Irlanda. Asociación de Computación
Lingüística.

Alexander R. Fabbri, Wojciech Kry´sci´nski, Bryan
McCann, Caiming Xiong, Richard Socher,
and Dragomir Radev. 2021. Summeval: Re-
evaluating summarization evaluation. Trans-
acciones de la Asociación de Computación
Lingüística, 9:391–409. https://doi.org
/10.1162/tacl_a_00373

Tobias Falke, Leonardo F. R. Ribeiro, Prasetya
Ajie Utama, Ido Dagan, and Iryna Gurevych.
2019. Ranking generated summaries by correct-
ness: An interesting but challenging application
language inference. En curso-
for natural
cosas de
el
Asociación de Lingüística Computacional,
pages 2214–2220. https://doi.org/10
.18653/v1/P19-1213

the 57th Annual Meeting of

Markus Freitag, George Foster, David Grangier,
Viresh Ratnakar, Qijun Tan, and Wolfgang
Macherey. 2021a. Experts, errores, y estafa-
texto: A large-scale study of human evaluation
for machine translation. https://doi.org
/10.1162/tacl_a_00437

Markus Freitag, Ricardo Rei, Nitika Mathur,
Chi-kiu Lo, Craig
Stewart, Eleftherios
Avramidis, Tom Kocmi, George Foster, Alon
La vida, and Andr´e F. t. Martins. 2022. Resultados
of WMT22 metrics shared task: Stop using
BLEU – neural metrics are better and more ro-
bust. In Proceedings of the Seventh Conference
on Machine Translation (WMT), pages 46–68,
Abu Dhabi, United Arab Emirates (Hybrid).
Asociación de Lingüística Computacional.

Markus Freitag, Ricardo Rei, Nitika Mathur,
Chi-kiu Lo, Craig Stewart, George Foster, Alon
La vida, and Ondˇrej Bojar. 2021b. Results of
the WMT21 metrics shared task: Evaluating
metrics with expert-based human evaluations

on TED and news domain. En procedimientos
of the Sixth Conference on Machine Trans-
lación, pages 733–774, En línea. Asociación para
Ligüística computacional.

Yang Gao, Wei Zhao, and Steffen Eger. 2020.
SUPERT: Towards new frontiers in unsuper-
vised evaluation metrics for multi-document
summarization. En procedimientos de
the 58th
la Asociación para
Annual Meeting of
Ligüística computacional, pages 1347–1354,
En línea. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/2020.acl-main.124

Olga Golovneva, Moya Peng Chen, Spencer
Poff, Martin Corredor, Lucas Zettlemoyer,
Maryam Fazel-Zarandi, and Asli Celikyilmaz.
2023. ROSCOE: A suite of metrics for scor-
ing step-by-step reasoning. In The Eleventh
Conferencia Internacional sobre Aprendizaje Repre-
sentaciones.

Suchin Gururangan, Swabha Swayamdipta, Omer
Exacción, Roy Schwartz, Samuel Bowman, y
Noah A. Herrero. 2018. Annotation artifacts in
natural language inference data. En procedimientos
del 2018 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
Volumen 2 (Artículos breves), pages 107–112, Nuevo
Orleans, Luisiana. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/N18-2017

Liu,

Pengcheng He, Xiao Dong

Jianfeng
gao, and Weizhu Chen. 2021. DeBERTa:
Decoding-enhanced BERT with disentangled
In International Conference on
atención.
Learning Representations.

Tianxing He, Jingyu Zhang, Tianle Wang, Sachin
Kumar, Kyunghyun Cho, James Glass, y
Yulia Tsvetkov. 2022. On the blind spots
of model-based evaluation metrics for text
generación. arXiv preimpresión arXiv:2212.10020.

Karl Moritz Hermann, Tomas Kocisky, Eduardo
Grefenstette, Lasse Espeholt, Will Kay,
Mustafa Suleyman, and Phil Blunsom. 2015.
Teaching machines to read and comprehend.
Avances en el procesamiento de información neuronal
Sistemas, 28.

Md Mosharaf Hossain, Antonios Anastasopoulos,
Eduardo Blanco, and Alexis Palmer. 2020.

821

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

It’s not a non-issue: Negation as a source
of error in machine translation. In Findings
de
the Association for Computational Lin-
guísticos: EMNLP 2020, pages 3869–3885,
En línea. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/2020.findings-emnlp.345

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu,
Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang,
Andrea Madotto, and Pascale Fung. 2022.
Survey of hallucination in natural
idioma
generación. ACM Computing Surveys.

Marzena Karpinska, Nishant Raj, Katherine Thai,
Yixiao Song, Ankita Gupta, and Mohit Iyyer.
2022. DEMETR: Diagnosing evaluation met-
rics for translation. En Actas de la 2022
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, pages 9540–9561, Abu
Dhabi, United Arab Emirates. Asociación para
Ligüística computacional.

Marvin Kaster, Wei Zhao, and Steffen Eger.
2021. Global explainability of BERT-based
evaluation metrics by disentangling along lin-
guistic factors. En procedimientos de
el 2021
Jornada sobre Métodos Empíricos en Natu-
Procesamiento del lenguaje oral, pages 8912–8925,
En línea y Punta Cana, Dominican Repub-
lic. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.emnlp-main.701

Wojciech Kryscinski, Bryan McCann, Caiming
xiong, and Richard Socher. 2020. Evaluat-
ing the factual consistency of abstractive
text summarization. En procedimientos de
el
2020 Conference on Empirical Methods in
Natural Language Processing
(EMNLP),
pages 9332–9346, En línea. Asociación para
Ligüística computacional. https://doi
.org/10.18653/v1/2020.emnlp-main
.750

Philippe Laban, Tobias Schnabel, Paul N. bennett,
and Marti A. Hearst. 2022. SummaC: Re-
visiting NLI-based models for inconsistency
detection in summarization. Transactions of
the Association for Computational Linguis-
tics, 10:163–177. https://doi.org/10
.1162/tacl_a_00453

Christoph Leiter, Piyawat Lertvittayakumjorn,
METRO. Fomicheva, Wei Zhao, Yang Gao, y

Steffen Eger. 2022. Towards explainable eval-
uation metrics for natural language generation.
ArXiv, abs/2203.11131.

Chin-Yew Lin. 2004. Rouge: A package for
automatic evaluation of summaries. In Text
Summarization Branches Out, pages 74–81.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly op-
timized BERT pretraining approach. arXiv
preprint arXiv:1907.11692.

Arle Lommel, Aljoscha Burchardt, and Hans
Uszkoreit. 2014. Multidimensional quality
métrica (mqm): A framework for declaring
and describing translation quality metrics.
Tradum`atica: Tecnologies de la Traducci´o,
0:455–463. https://doi.org/10.5565
/rev/tradumatica.77

Franc¸ois Mairesse, Milica Gaˇsi´c, Filip Jurˇc´ıˇcek,
Simon Keizer, Blaise Thomson, Kai Yu, y
Steve Young. 2010. Phrase-based statistical lan-
guage generation using graphical models and
active learning. In Proceedings of the 48th
Annual Meeting of the Association for Com-
Lingüística putacional, pages 1552–1561, Arriba-
psala, Suecia. Asociación de Computación
Lingüística.

Nitika Mathur, Timothy Baldwin, and Trevor
Cohn. 2019. Putting evaluation in context:
Contextual embeddings improve machine trans-
el
lation evaluation.
57ª Reunión Anual de la Asociación de
Ligüística computacional, pages 2799–2808,
Florencia,
Italia. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/P19-1269

En procedimientos de

Nitika Mathur, Timothy Baldwin, and Trevor
Cohn. 2020a. Tangled up in BLEU: Reevaluat-
ing the evaluation of automatic machine trans-
lation evaluation metrics. En Actas de la
58ª Reunión Anual de la Asociación de
Ligüística computacional, pages 4984–4997,
En línea. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/2020.acl-main.448

Nitika Mathur, Johnny Wei, Markus Freitag,
Qingsong Ma, and Ondˇrej Bojar. 2020b. Re-
sults of the WMT20 metrics shared task. En

822

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Proceedings of the Fifth Conference on Ma-
chine Translation, pages 688–725, En línea.
Asociación de Lingüística Computacional.

(En línea). International Committee on Compu-
lingüística nacional. https://doi.org/10
.18653/v1/2020.coling-main.445

Yixin Nie, Haonan Chen, and Mohit Bansal.
2019. Combining fact extraction and verifica-
tion with neural semantic matching networks.
In Association for the Advancement of Artifi-
cial Intelligence (AAAI). https://doi.org
/10.1609/aaai.v33i01.33016859

Yixin Nie, Adina Williams, Emily Dinan,
Mohit Bansal, Jason Weston, and Douwe
Kiela. 2020. Adversarial NLI: un nuevo banco-
mark for natural language understanding. En
Actas de la 58ª Reunión Anual de
the Association for Computational Linguis-
tics. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.acl-main.441

Timothy Niven and Hung-Yu Kao. 2019. problema-
ing neural network comprehension of natural
language arguments. En procedimientos de
el
57ª Reunión Anual de la Asociación de
Ligüística computacional, pages 4658–4664,
Florencia,
Italia. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/P19-1459

Maxime Peyrard. 2019. Studying summariza-
tion evaluation metrics in the appropriate
scoring range. En procedimientos de
the 57th
la Asociación para
Annual Meeting of
Ligüística computacional, pages 5093–5100,
Florencia,
Italia. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/P19-1502

Adam Poliak,

Jason Naradowsky, Aparajita
Haldar, Rachel Rudinger,
and Benjamin
Van Durme. 2018. Hypothesis only baselines
in natural language inference. En procedimientos
of the Seventh Joint Conference on Lexical and
Computational Semantics, pages 180–191, Nuevo
Orleans, Luisiana. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/S18-2023

Tharindu Ranasinghe, Constantin Orasan, y
Ruslan Mitkov. 2020. TransQuest: Trans-
lation quality estimation with cross-lingual
the 28th
transformadores.
International Conference on Computational
Lingüística, pages 5070–5081, Barcelona, España

En procedimientos de

Ricardo Rei, Craig Stewart, Ana C. Farinha,
and Alon Lavie. 2020a. COMET: A neural
framework for MT evaluation. En procedimientos
del 2020 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 2685–2702, En línea. Asociación para
Ligüística computacional. https://doi.org
/10.18653/v1/2020.emnlp-main.213

Ricardo Rei, Craig Stewart, Ana C. Farinha, y
Alon Lavie. 2020b. Unbabel’s participation in
the WMT20 metrics shared task. En curso-
ings of the Fifth Conference on Machine Trans-
lación, pages 911–920, En línea. Asociación para
Ligüística computacional.

Marco Tulio Ribeiro, Tongshuang Wu, Carlos
Guestrin, and Sameer Singh. 2020. Beyond
testing of NLP mod-
exactitud: conductual
els with CheckList. En procedimientos de
el
58ª Reunión Anual de la Asociación de
Ligüística computacional, pages 4902–4912,
En línea. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/2020.acl-main.442

Md Rashad Al Hasan Rony, Liubov Kovriguina,
Debanjan Chaudhuri, Ricardo Usbeck, y
Jens Lehman. 2022. RoMe: A robust met-
ric for evaluating natural language generation.
En actas de la 60.ª reunión anual de
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 5645–5657,
Irlanda. Asociación de Computación-
Dublín,
lingüística nacional. https://doi.org/10
.18653/v1/2022.acl-long.387

Ananya B. Sai, Tanay Dixit, Dev Yashpal Sheth,
Sreyas Mohan, and Mitesh M. Khapra. 2021.
Perturbation checklists for evaluating NLG
evaluation metrics. In Proceedings of the Con-
ference on Empirical Methods in Natural
Procesamiento del lenguaje (EMNLP).

Thibault Sellam, Dipanjan Das, and Ankur P.
Parikh. 2020. BLEURT: Learning robust met-
rics for text generation. En procedimientos de
LCA. https://doi.org/10.18653/v1
/2020.acl-main.704

Rico Sennrich. 2017. How grammatical

es
character-level neural machine translation?

823

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Assessing MT quality with contrastive translation
pares. In Proceedings of the 15th Conference
of the European Chapter of the Association for
Ligüística computacional: Volumen 2, Short
Documentos, pages 376–382, Valencia, España.
for Computational Linguis-
Asociación
tics. https://doi.org/10.18653/v1
/E17-2060

Ori Shapira, David Gabay, Yang Gao, Hadar
Ronen, Ramakanth Pasunuru, Mohit Bansal,
Yael Amsterdamer, and Ido Dagan. 2019.
Crowdsourcing lightweight pyramids for man-
ual summary evaluation. En Actas de la
2019 Conferencia del Capítulo Norteamericano
de la Asociación de Linguis Computacional-
tics: Tecnologías del lenguaje humano, Volumen 1
(Artículos largos y cortos), pages 682–687,
Mineápolis, Minnesota. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/N19-1072

Yurun Song, Junchen Zhao, and Lucia Specia.
2021. Sentsim: Crosslingual semantic evalua-
tion of machine translation. En procedimientos de
el 2021 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
pages 3143–3156. https://doi.org/10
.18653/v1/2021.naacl-main.252

Miloˇs Stanojevi´c, Amir Kamran, Philipp Koehn,
and Ondˇrej Bojar. 2015. Results of the WMT15
metrics shared task. In Proceedings of the Tenth
Workshop on Statistical Machine Translation,
pages 256–273, Lisbon, Portugal. Asociación
para Lingüística Computacional. https://
doi.org/10.18653/v1/W15-3031

Tianxiang Sun, Junliang He, Xipeng Qiu, y
Xuanjing Huang. 2022. BERTScore is unfair:
On social bias in language model-based met-
rics for text generation. En Actas de la
2022 Conference on Empirical Methods in Nat-
ural Language Processing, pages 3726–3739,
Abu Dhabi, United Arab Emirates. Asociación
para Lingüística Computacional.

Gongbo Tang, Philipp R¨onchen, Rico Sennrich,
and Joakim Nivre. 2021. Revisiting negation
in neural machine translation. Transactions of
la Asociación de Lingüística Computacional,
9:740–755. https://doi.org/10.1162
/tacl_a_00395

Brian Thompson and Matt Post. 2020. Auto-
matic machine translation evaluation in many
languages via zero-shot paraphrasing. En profesional-
cesiones de la 2020 Conferencia sobre Empiri-
Métodos cal en el procesamiento del lenguaje natural
(EMNLP), pages 90–121, En línea. Asociación
para Lingüística Computacional. https://doi
.org/10.18653/v1/2020.emnlp-main.8

Prasetya Ajie Utama, Joshua Bambrick, Nafise
Sadat Moosavi, and Iryna Gurevych. 2022.
Falsesum: Generating document-level nli ex-
amples for recognizing factual inconsistency
in summarization. https://doi.org/10
.48550/arXiv.2205.06009

Prasetya Ajie Utama, Nafise Sadat Moosavi, y
Iryna Gurévych. 2020. Mind the trade-off: De-
biasing NLU models without degrading the
in-distribution performance. En procedimientos de
la 58ª Reunión Anual de la Asociación de
Ligüística computacional, pages 8717–8729,
En línea. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.acl-main.770

Doan Nam Long Vu, Nafise Sadat Moosavi, y
Steffen Eger. 2022. Layer or representation
espacio: What makes BERT-based evaluation
the 29th
metrics robust? En procedimientos de
International Conference on Computational
Lingüística, pages 3401–3411, Gyeongju, Re-
public of Korea. International Committee on
Ligüística computacional.

Yu Wan, Dayiheng Liu, Baosong Yang, Haibo
zhang, Boxing Chen, Derek Wong, and Lidia
chao. 2022. UniTE: Unified translation eval-
uation. En procedimientos de
the 60th Annual
reunión de
la Asociación de Computación-
lingüística nacional (Volumen 1: Artículos largos),
pages 8117–8127, Dublín, Irlanda. Asociación
para Lingüística Computacional. https://doi
.org/10.18653/v1/2022.acl-long.558

Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through in-
ference. En Actas de la 2018 Conferir-
ence of the North American Chapter of the
Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano, Volumen 1 (Largo
Documentos), pages 1112–1122. Asociación para
Ligüística computacional. https://doi
.org/10.18653/v1/N18-1101

824

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
5
7
6
2
1
4
3
2
9
7

/
t

a
C
_
a
_
0
0
5
7
6
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Yinfei Yang, Yuan Zhang, Chris Tar, y
Jason Baldridge. 2019. PAWS-X: A cross-
for paraphrase
lingual adversarial dataset
ofEMNLP.
identification.
https://doi.org/10.18653/v1/D19
-1382

En procedimientos

Wenpeng Yin, Dragomir Radev, and Caiming
xiong. 2021. DocNLI: A large-scale dataset
language infer-
for document-level natural
ence. In Findings of the Association for Com-
Lingüística putacional: ACL-IJCNLP 2021,
pages 4913–4922, En línea. Asociación para
Ligüística computacional. https://doi.org
/10.18653/v1/2021.findings-acl.435

Weizhe Yuan, Graham Neubig, and Pengfei Liu.
2021. BARTscore: Evaluating generated text as
generación de texto. Advances in Neural Informa-
tion Processing Systems, 34:27263–27277.

Tianyi Zhang, Varsha Kishore, Felix Wu,
Kilian Q. Weinberger, and Yoav Artzi. 2020.
BERTscore: Evaluating text generation with
BERT. In International Conference on Learn-
ing Representations.

Yuan Zhang, Jason Baldridge, and Luheng He.
2019. PAWS: Paraphrase adversaries from
word scrambling. In Proceedings of NAACL.

Wei Zhao, Goran Glavaˇs, Maxime Peyrard,
Yang Gao, Robert West, and Steffen Eger.
2020. On the limitations of cross-lingual en-

coders as exposed by reference-free machine
translation evaluation. En Actas de la
58ª Reunión Anual de la Asociación de
Ligüística computacional, pages 1656–1671,
En línea. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/2020.acl-main.151

Wei Zhao, Maxime Peyrard, Fei Liu, Cual
gao, Christian M. Meyer, and Steffen Eger.
2019. MoverScore: Text generation evaluat-
ing with contextualized embeddings and earth
mover distance. En Actas de la 2019
Jornada sobre Métodos Empíricos en Natural
Language Processing and the 9th Interna-
tional Joint Conference on Natural Language
Procesando (EMNLP-IJCNLP), pages 563–578,
Hong Kong, Porcelana. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/D19-1053

Wei Zhao, Michael Strube, and Steffen Eger.
2023. Discoscore: Evaluating text generation
with BERT and discourse coherence. In EACL.

En procedimientos de

Xiang Zhou and Mohit Bansal. 2020. A-
wards robustifying NLI models against lex-
el
ical dataset biases.
58ª Reunión Anual de la Asociación de
Ligüística computacional, pages 8759–8771,
En línea. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.acl-main.773