MENLI: Robust Evaluation Metrics from Natural Language Inference

MENLI: Robust Evaluation Metrics from Natural Language Inference

Yanran Chen1,2 and Steffen Eger2
1达姆施塔特工业大学, 德国
2Natural Language Learning Group (NLLG), https://nl2g.github.io/
Faculty of Technology, Universit¨at Bielefeld, 德国

yanran.chen@stud.tu-darmstadt.de, steffen.eger@uni-bielefeld.de

抽象的

Recently proposed BERT-based evaluation
metrics for text generation perform well on
standard benchmarks but are vulnerable to ad-
versarial attacks, 例如, relating to information
correctness. We argue that this stems (部分地)
from the fact that they are models of semantic
相似. 相比之下, we develop evaluation
metrics based on Natural Language Inference
(NLI), which we deem a more appropriate
造型. We design a preference-based ad-
versarial attack framework and show that our
NLI based metrics are much more robust
to the attacks than the recent BERT-based
指标. On standard benchmarks, our NLI
based metrics outperform existing summariza-
tion metrics, but perform below SOTA MT
指标. 然而, when combining existing
metrics with our NLI metrics, we obtain both
higher adversarial robustness (15%–30%) 和
higher quality metrics as measured on standard
benchmarks (+5% 到 30%).

1

介绍

Proper evaluation is key to fields such as machine
learning and Natural Language Processing (自然语言处理).
Evaluation is particularly challenging for natural
language generation (NLG) 任务, as there many
be an infinitude of correct solutions (例如, transla-
tions or summaries) for a given source text. 尽管
human evaluation is often considered the gold
标准, it is slow and costly, thus researchers
resort to automatic evaluation. Previously, 这
was done using simple lexical overlap metrics
such as BLEU and ROUGE, but these exhibit low
correlations with human judgments, 特别
for state-of-the-art NLG systems (Mathur et al.,
2020A; Peyrard, 2019). 因此, a popular recent
trend is to design automatic evaluation metrics
based on large language models such as BERT
and its many extensions (张等人。, 2020; 赵
等人。, 2019; Sellam et al., 2020; Wan et al., 2022).

804

尽管如此,

these novel metrics also have
key limitations. 例如, Sai et al. (2021)
and Kaster et al. (2021) show that they are not
robust to various adversarial attacks including
lexical overlap and factuality errors. Taking the
currently most popular metric—BERTScore1—as
an example, this adversarial vulnerability is un-
surprising. BERTScore computes the semantic
similarity between a reference and a system output
(the candidate), using a simplified token matching
procedure. 然而, a good candidate is typically
not appropriately identified by semantic similarity.
例如, a candidate ‘‘5 Ukrainian soldiers
wounded in Russia’’ is not an adequate translation
of a source corresponding to the reference ‘‘50000
Russian soldiers killed in Ukraine’’, 虽然
two texts are of course semantically very similar.2
While there have been many attempts to improve
BERTScore using better token matching, 例如, 我们-
ing Word Mover Distance (赵等人。, 2019;
陈等人。, 2020; Colombo et al., 2021), we argue
that this line of research is a dead-end, as the un-
derlying model of semantic similarity, 起初
proposed to address issues of lexical variation in
BLEU/ROUGE, is simply not (完全) 合适的.
An intuitively more suitable idea to model
evaluation metrics is via natural language in-
参考 (NLI) (Dagan et al., 2013). 例如,
in reference-based settings, in which candidates
are compared to human references, a candidate
is intuitively good if it is equivalent to a human
reference via the concept of bi-implication. NLI
systems are also promising alternatives because

1Published in 2020, BERTScore has more than 1700

citations as of March 2023.

2That semantic similarity metrics are inherently incapable
of identifying this puts them at great risk of being attacked by
malicious agents, with serious real-world consequences, 作为
the metrics cannot distinguish between truthful translations
and semantically similar but factually incorrect translations.

计算语言学协会会刊, 卷. 11, PP. 804–825, 2023. https://doi.org/10.1162/tacl 00576
动作编辑器: Benjamin Van Durme. 提交批次: 10/2022; 修改批次: 1/2023; 已发表 7/2023.
C(西德:2) 2023 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

NLI is one of the most researched upstream tasks
in NLP, where a lot of emphasis has been placed
on concepts such as biases, generalization and
adversarial conditions (Poliak et al., 2018; Utama
等人。, 2020).

在本文中, we ask whether we can di-
rectly use pre-trained NLI models as evaluation
指标,
thereby establishing a new paradigm
(but with predecessors, as indicated in §2). 我们的
contributions:

• We design: a novel preference-based adver-
sarial test suite for MT and summarization
指标. Our adversarial benchmark does
not need human annotators, is suitable for
reference-free (where the candidate is di-
rectly compared to the source text, 没有
human reference) and reference-based eval-
uation, and is challenging: 例如, 蓝线,
ROUGE, MoverScore, and BERTScore per-
form below or at random level.

• We explore: (我) how NLI metrics can be
induced from existing NLI models; (二) 如何
they perform on benchmark and adversarial
datasets, across (三、) two NLG problems, 公吨
and summarization.

• We show: (四号) NLI metrics perform par-
ticularly well in summarization, but below
standard metrics in MT. (v) They substan-
tially outperform existing metrics on our
adversarial attacks (例如, ∼30%–50% mar-
gin over the best unsupervised standard
metric in MT).
(六) Combining existing
metrics with our NLI metrics yields both
更好的 (+5%–30%) and more robust metrics
(+15%–30%).

We point out that some current metrics already
leverage NLI systems—thus, we do not include
new information with respect to them—but in-
directly and thus (we argue) inadequately: 例如,
MoverScore (赵等人。, 2019) leverages BERT
representations fine-tuned on NLI. Mathur et al.
(2019) 火车 (pre-BERT) NLI-inspired architec-
tures on MT datasets. 相比之下, we show that
by directly leveraging NLI systems, much better
adversarial and standard benchmark performances
can be obtained. We call our novel metrics MENLI
(MEtrics from NLI).3

3Code+data: http://github.com/cyr19/MENLI.

概念

Semantic Similarity
Text Generation

Question Answering
NLI

Examples

BERTScore, MoverScore, BaryScore, …
BARTScore, PRISM (Thompson and Post,
2020)
QAEval (Deutsch et al., 2021)
MENLI

桌子 1: Different paradigms for metric induction
proposed in recent years.

2 相关工作

Our work connects to evaluation metrics and NLI.

Evaluation Metrics for NLG In the last few
年, researchers have come up with a plethora of
different BERT-based metrics for varying tasks
and setups: 例如, for MT and summarization,
reference-based trained (Sellam et al., 2020; Rei
等人。, 2020A) and untrained approaches (赵
等人。, 2019; 张等人。, 2020) have been sug-
gested and the same is true for reference-free
setups, where both supervised (Ranasinghe et al.,
2020) and unsupervised metrics have been ex-
plored (赵等人。, 2020; Song et al., 2021;
Belouadi and Eger, 2023). In our work, weconsider
both reference-based as well as reference-free
指标. Both setups have important differences:
Reference-free setups are more challenging, 作为
they require to compare text in different lan-
guages (in MT) or of vastly different lengths (在
summarization). 另一方面, they are more
‘resource-efficient’, take humans out-of-the-loop,
and promise web-scale evaluation. Both ap-
proaches are also different
in terms of NLI.
例如, while reference-based approaches
require equivalence between reference and hy-
pothesis, the concept of equivalence is not always
appropriate in reference-free situations (例如, 在
summarization, source and summary are intu-
itively not equivalent; 相当, source should entail
summary).

To realize metrics, different high-level ap-
proaches have been suggested as we outline in
桌子 1 (例如, metrics from semantic similarity,
from text generation or from question answering).
There are also some predecessor works on metrics
from NLI which we discuss below.

Robustness of Evaluation Metrics has been a
central issue of recent interest: Sai et al. (2021)
test metrics across several CheckList (Ribeiro
等人。, 2020) inspired templates, finding that most

805

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

common standard metrics are not robust even to
simple perturbations. Kaster et al. (2021) probe
metrics in an adversarial setting with lexical over-
lap, finding that they can be fooled by text that has
high lexical overlap but low semantic similarity
(indicating that the proposed BERT-based metrics
are not even good models of semantic similarity).
We combine the approaches of Sai et al. (2021)
and Kaster et al. (2021): While Sai et al. (2021)
use human crowd-workers to evaluate robustness,
Kaster et al. (2021) use a simpler preference-based
setup, which does not need human annotators. 我们
will also use the preference-based setup, but our
attacks are largely inspired by Sai et al. (2021).

最近 (contemporaneously with us and
after the first Arxiv submission of our work),
several other papers have explored the robust-
ness of recent evaluation metrics. 例如,
He et al. (2022) develop stress test suites ac-
cording to potential errors arising from certain
choices of metric design and pretrained lan-
guage models used, showing that metrics are
biased towards their underlying models—e.g.,
BARTScore assigns higher scores to texts gener-
ated by the models of the metric itself.4 Karpinska
等人. (2022) explore the sensitivity of MT met-
rics to errors of different categories (regarding
语义学, syntax, and morphology) and sever-
性, using a preference-based setting; they show
that recent metrics like BERTScore dramatically
outperform lexical overlap-based metrics such as
BLEU and ROUGE, mostly obtaining over 95%
accuracy in their experiments. Our setups and that
of Karpinska et al. (2022) and He et al. (2022) 是
differentiated by the tasks considered, the prefer-
ence specifications, 结果, and the solutions
proposed. Karpinska et al. (2022) only evaluate
metrics for MT while we consider both MT and
summarization. They design their preferences in
such a way that it would seem that recent met-
rics are quite robust while our more elaborate
preferences expose their weak spots much bet-

4Robustness is also related to model biases. 例如,
Sun et al. (2022) show that BERTScore encodes social
biases such as gender biases. And Deutsch et al. (2022)
claim that reference-free metrics are inherently biased, 哪个
implies that they have unreasonable preferences. Our results
show that many current reference-based metrics also have
unreasonable preferences. Robustness checks are also related
to explainability (Leiter et al., 2022; Golovneva et al., 2023)
of evaluation metrics as they help to understand metric
局限性.

806

特尔. 最后, we propose solutions (例如, 指标
from NLI) to addressing lack of robustness. Like
我们, He et al. (2022) also consider summarization
and MT. Instead of designing preferences, 如何-
曾经, they manually introspect how metric scores
change as various perturbations are introduced. 在
这边走, they expose blind spots of metrics. 作为
remedies, they suggest to combine heterogeneous
metrics to shield against varying blind spots (和-
out performing concrete experiments)—we show
that combining metrics with NLI based metrics
yields additional robustness.

最后, Rony et al. (2022) develop RoMe
as a robust metric in the context of semantic
相似, fluency and grammatical variability.
They evaluate it on an adversarial dataset with
five phenomena (实体, adjective and random
word replacement; as well as text transforma-
tion and passive forms) by correlating against
human judgments. Their model is a rather com-
plicated trained metric leveraging semantic and
grammatical features—we compare to it in §6.

NLI NLI is one of the core upstream tasks in
the NLP community. Due to its popularity, NLI
has been investigated in-depth, where researchers
found that trained models often overfit to low-level
statistical cues instead of learning generalizable
concepts of logical relationships between sen-
时态 (Poliak et al., 2018; Gururangan et al.,
2018). 作为结果, many approaches to im-
prove generalization have been investigated (例如,
Belinkov et al., 2019; Utama et al., 2020; 周
and Bansal, 2020). We argue that a high-quality
NLI model would be an excellent candidate for an
evaluation metric and explore this in this work.

Like us, Mathur et al. (2019) note the similarity
的 (公吨) evaluation and logical equivalence via
NLI. They design supervised MT metrics lever-
aging different pre-BERT inspired architectures,
including one from the NLI community called
ESIM (陈等人。, 2017) (which performs on
par to an LSTM with attention in their exper-
瞬间). 因此,
他们不
leverage NLI models out-of-the-box as evalua-
tion metrics but only fine-tune an NLI-inspired
architecture on human scores from MT. Mover-
分数 (赵等人。, 2019) fine-tunes BERT on NLI,
which leads to better metrics. 因此, 他们, 也, 使用
NLI only indirectly. Duˇsek and Kasner (2020) 使用
NLI to evaluate hallucinations and omissions in
reference-free data-to-text generation scenarios.

相比之下

to us,

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

src

ref

r (google translation of src)

candpara

candadv( ref-based )

candadv( ref-free )

Number error
Der bilaterale Handel wurde auf ¨uber 100
Milliarden Dollar im Jahr gesteigert.
Bilateral trade has increased to more than
$100 billion a year. Bilateral trade has increased to over $100
billion a year.
Bilateral trade has increased to more than
one hundred billion dollars a year.
Bilateral trade has increased to more than
$814 billion a year. Bilateral trade has increased to over $478
billion a year.

Negation error

Die Wirtschaft der Entwicklungs- und Schwellenl¨ander
wird schwach bleiben.
Emerging economies will remain weak.

The economies of developing and emerging countries
will remain weak.
Emerging markets will remain weak.

Emerging economies won’t remain weak.

The economies of developing and emerging countries
won’t remain weak.

桌子 2: Examples of our adversarial test suite taken from WMT20de. Red words indicate specific adver-
sarial perturbations of the words in green. candadv(ref-based) builds on ref, whereas candadv(ref-free)
builds on r (indicated by corresponding coloring in the first column). The preferences we query for are
given in Eq. (1).

They do not compare to any other metrics and do
not consider NLI as a general paradigm for eval-
uation metrics. While the summarization commu-
nity uses NLI models for consistency evaluation
(Fabbri et al., 2021; Laban et al., 2022), to our
知识, we are the first to verify the useful-
ness of NLI systems as general evaluation met-
rics against a range of strong competitors, both in
standard evaluation and adversarial attack settings.

3 Adversarial Setup

Following Sai et al. (2021) 和别的, 我们骗-
sider an array of adversarial attacks on evaluation
metrics—we will give a motivation of our at-
tacks from the perspective of errors committed
by real text generation systems below. 在骗子-
trast to Sai et al. (2021) and similar to the later
published work of Karpinska et al. (2022), 我们
implement a preference-based setup, which does
not need human annotators. The advantages of the
preference-based setup are: (我) lower cost (例如, 不
annotation costs), (二) which can be especially rel-
evant for non-English languages (例如, in ref-free
situations for MT), 和 (三、) which allows adver-
sarial evaluation at larger scale, yielding more
robust estimates of performance. The challenge of
the preference setup is to cleverly determine text
pairs to compare.

In our design, we use an anchor text (任何一个
the reference ref or the source src), a paraphrase
candpara of the anchor text, and an adversarial
text candadv which is maximally similar to the
anchor text, but contains an adversarial attack. 我们

expect a good metric m to prefer candpara over
candadv:

ref-based : 米(ref , candpara) > m(ref , candadv)

ref-free : 米(src, ref ) > m(src, candadv)

(1)
The outcome of preferences in Eq. (1) 依靠
on how we choose candadvand candpara, 哪个
we will describe below. 一般来说, a challeng-
ing test suite has candadv maximally similar to
ref /src, but with a key error. 相比之下, candpara
should be maximally dissimilar to ref /src (例如,
on surface level) but meaning-equivalent. 桌子 2
illustrates the general structure of our adversarial
test suite.

candadv To obtain candadv, we consider the
following attacks (nine regarding information
adequacy/correctness in candidates and three re-
garding text fluency), which we deem (to a large
程度) representative for errors in different NLG
任务:

• Addition: We randomly add a noun after an
existing one and connect them with ‘‘and’’.
例如, ‘‘I love dogs’’ → ‘‘I love dogs
and cats.’’

• Omission: We use the framework of Sai et al.
(2021) to randomly drop ∼1%–20% words
in the sentence.

• Mismatch: We consider mismatching nouns,
动词, 和形容词, which can lead to mis-
understanding of an entity, an action, 和
speakers’ emotion, 分别. 下列的

807

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Chen et al. (2021), we replace a specific word
having the POS tag of noun/verb/adjective
with another word having the same POS tag
randomly selected from our collected words
for that POS tag.

• Negation: We use the perturbation tool of
Ribeiro et al. (2020) to add/remove negations
to/from the verb for generating candadv with
contrary claims.

• Number error: We replace all numbers (前任-
cept for those related to dates) in the sentence
with random numbers in the same format
(例如, integer to integer, decimal to decimal).
• Pronoun error: We replace all pronouns in
the sentence with other ones without causing
syntax errors (例如, ‘‘he’’ to ‘‘she’’ and ‘‘us’’
to ‘‘them’’).

• Name error: We use the tool of Ribeiro
等人. (2020) to replace exactly one name
with a random one of the same gender.

• Fluency: We also include three phenomena
from Sai et al. (2021) to examine metrics’
robustness against attacks on text fluency: (我)
Jumbling word order: Randomly shuffle the
word order in a sentence. Spelling error: Add
a typo to a word in a sentence. Subject-verb
disagreement: Make the subject and verb
disagree (例如, ‘‘He like dogs.’’).

For ref-based metrics, we apply the perturbation
templates to ref to construct candadv. 相比之下,
for ref-free MT metrics, we first translate the
source src using Google Translate to a translation r
and then perturb r to obtain candadv. We introduce
r to increase the similarity of candadv to src; 例如,
we assume that Google Translate translates more
literally, IE。, closer to word-by-word translations,
than human translators. This may be important to
construct challenging test cases, 比照. §6 and our
above discussion. For ref-free summarization, 我们
apply the perturbation templates to a document r
which is maximally similar to src; details follow.

candpara We use different ways to obtain
candpara, because different kinds of paraphrases
may yield more/less difficult test cases for metrics.
We will analyze this in §6.

尤其, we use data from (1) PAWS
(2) PAWS-X (哪个
(张等人。, 2019),
等人。, 2019), (3) WMT20-news-commentary-v15
German-to-English (Mathur et al., 2020乙) to gen-

dataset

任务

ref-
基于

ref-
自由的

PAWSori MT
PAWSback MT
XPAWSx MT
WMT20de MT
SEadv

是的
是的
是的
是的
SUM yes



是的
是的
是的

candpara

#examples

ORI
BACK
ORI
BACK
BACK

2,000
2,000
455–474
200
199

桌子 3: Adversarial datasets. ‘‘Yes/no’’ indicates
whether the dataset supports ref-based/free adver-
sarial evaluation. ‘‘ORI/BACK’’ denotes whether
candpara (except for number error) is from
the original datasets or backtranslation. ‘‘#exam-
ples’’ refers to the avg. number of examples per
现象. XPAWSx denotes XPAWSde/fr/zh/ja.

erate candpara for MT evaluation metrics, 和
(4) SummEval for summarization metrics. A
summary with attributes is shown in Table 3.

(1) PAWS contains sentence pairs created by
word swapping and backtranslation, labeled as
(非)paraphrases by human raters. From sen-
tence pairs labeled as paraphrase, we derive two
datasets for ref-based evaluation metrics:

• PAWSori: We take the first sentence of a
PAWS sentence pair as ref and the second as
candpara.

• PAWSback: We use the first sentence of a
PAWS sentence pair as ref and generate
candpara based on ref using backtranslation
(we use German as the pivot language) 除了
for number error, for which we replace the
numbers in ref with the corresponding words,
using the Python library num2words.

(2) PAWS-X is the multilingual version of
PAWS, which includes PAWS sentence pairs in
six languages, translated from English PAWS,
allowing us to generate test suites for both
ref-free and ref-based metrics. We use the first
sentence in PAWS-X (例如, 德语) as src
and the second sentence with the same ID in
English PAWS as ref. We select the data for
two closer language pairs: German-to-English
and French-to-English, and two more dis-
language pairs: Chinese-to-English and
坦特
Japanese-to-English. 因此, 我们
创造
4 datasets: XPAWSde, XPAWSfr, XPAWSzh,
and XPAWSja, each of which contains src
(第一的
in source
语言), ref (first sentence of English PAWS

sentence of X-PAWS pair

808

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Error

Mismatch/verb

Mismatch/adj.

Pronoun/Addition

姓名

Omission

Mismatch/noun

Pronoun

来源

MT hypothesis

Pay attention to (Follow) Suning.com service account

Not bad, the picture quality of playing games is
really fragrant (好的)

Bought it for his (我的) 儿子, he said it was good.
On the same day, US Secretary of Transportation Zhao
Xiaolan (Elaine Lan Chao), US Congressman Meng
Zhaowen (Grace Meng) and Dong Jiling (Chiling Tong),
founding president of the International Leaders
基础, spoke at the meeting respectively.
Ich werde Ihr Konto […] ( ¨uberpr ¨ufen), einen Moment
bitte.
,,H¨or zu, ich will mein Volk (meine Leute) nicht verr¨uckt
machen‘‘, sagte sie.
Williams war nicht die einzige, die beim diesj¨ahrigen
Wimbledon eine Geldstrafe erhielt, obwohl sie (ihre)
die teuerste war.

I’ll review your account, one moment,
请.
Listen, I don’t want to make my
people mad,” 她说.
Williams wasn’t the only one who
received a fine at this year’s Wimbledon,
though hers was the most costly.

桌子 4: Examples of errors in WMT MQM annotations for Chinese-to-English and English-to-German.
Red texts are the annotated errors (‘‘[…]’’ indicates the missing translation) and the green texts in the
bracket refer to a more correct translation accordingly; the green texts in source sentences denote the
parts being mistranslated or omitted.

pair), and candpara (second sentence of English
PAWS pair).

(3) WMT20-news-commentary-v15 contains
sentence pairs of source and human reference.
由此, we create WMT20de, directly taking
the source and reference sentences as src and ref.
We obtain candpara as in the case of PAWSback.
(4) SummEval (Fabbri et al., 2021) 包含
documents and references from CNN Daily-
Mail (CNNDM) (Hermann et al., 2015), 和
10 additional human references. We rank the 11
references using ROUGE-L (林, 2004) and use
the reference r with highest ROUGE score to
generate candadv for ref-free setting, 而
其余的 10 references serve as ref. We refer to
the adversarial dataset induced from SummEval
as SEadv in the remainder. We obtain candpara as
in the case of PAWSback.5

Real-world Motivation of Attacks Modern
text generation systems are prone to many of
the errors we investigate in this work. 对于前-
充足, Freitag et al.
(2021A,乙, 2022) 展示,
based on fine-grained human error annotations

5As we generate our adversarial test instances fully auto-
matically from backtranslation or automatic tools, they may
contain some errors (including upper-/lower-case). 对于前-
充足, we note that in candpara, ‘‘. . . billion dollars’’ is
sometimes incorrectly formulated as ‘‘. . . dollars billion’’;
然而, such cases occur only in ∼1% of all test cases for
number error, which we argue is still on an acceptable noise
等级.

(Lommel et al., 2014), that translations gener-
ated by state-of-the-art MT models still contain
many accuracy-related errors (例如, addition and
omission of information, inappropriately informal
pronouns) and sometimes even fluency-related er-
rors (例如, wrong spelling). Negation handling is
also frequently discussed as an issue of modern
MT systems (Bentivogli et al., 2016; Sennrich,
2017; Hossain et al., 2020; Tang et al., 2021).
In summarization, system summaries are often
factually inconsistent with source documents in
terms of numbers, named entities and assign-
ing quotations to a particular person, ETC. (Falke
等人。, 2019; Kryscinski et al., 2020; 陈等人。,
2021). 更普遍, hallucination (其中
addition/mismatches/etc. may be considered spe-
cial cases) is a particular worrisome limitation
of recent large language models (Ji et al., 2022).
表中 4, we show selected system translations
from real MT systems with specific errors (遵循-
lowing WMT MQM annotations) that are very
similar to the ones we consider.6 The frequency
of errors may differ for various source-target lan-
guage pairs (例如, depending on their language
distance) and formal/informal context. 考试用-
普莱, when translating Chinese to English for news,
the names are often directly translated to their
Pinyin format (see the 4th row) instead of the

6https://github.com/google/wmt-mqm-human

-评估.

809

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

任务

公吨

ref-based MoverScore (赵等人。, 2019), BERTScore (张等人。, 2020), BARTScore (袁等人。,

ref-free

2021), SentSim (Song et al., 2021), COMET (Rei et al., 2020乙), BLEURT (Sellam et al., 2020)
COMET, SentSim, XMoverScore (赵等人。, 2020)

指标

Summarization

ref-based BARTScore, DiscoScore (赵等人。, 2023), MoverScore, BERTScore
ref-free

BARTScore, SUPERT (Gao et al., 2020)

桌子 5: Evaluation metrics explored in this work.

任务

公吨

Summarization

segment-level
system-level
adversary

summary-level
system-level
adversary

数据集

WMT15-17, WMT20-21
WMT20-21
ref-based: PAWSori/back, WMT20de, XPAWSde; ref-free: XPAWSde/fr/zh/ja, WMT20de

RealSum (Bhandari et al., 2020)
RealSum, SummEval
SEadv, Rank19 (Falke et al., 2019) (ref-free only)

桌子 6: We use the to-English language pairs in WMT15-17 datasets (Stanojevi´c et al., 2015; Bojar
等人。, 2016, 2017). In segment-level evaluation on WMT20-21 (Mathur et al., 2020乙; Freitag et al.,
2021A,乙), we use the data with MQM scores for zh-en, while in system-level evaluation, we correlate
the metrics with DA scores for all to-English language pairs. The datasets for system-level evaluation
before WMT20 are skipped, as all metrics mostly get very high correlations on them.

official translations; 相比之下, this rarely hap-
pens in English-to-German translations. 但即使
for such closely related languages, NLG systems
may omit information, or choose wrong pronouns
or mismatching nouns, particularly when a word
has multiple senses.

4 实验装置

4.1 Evaluation Metrics

We explore a large array of recent state-of-the-
transformer based metrics, summarized in
艺术
桌子 5. The variants used are briefly introduced
以下; further details (例如, model checkpoints and
implementation) can be found on our Github.

我们

report BERTScore F1 employing a
RoBERTa-large model. For MoverScore, 我们用
the unigram variant with a BERT-base model
fine-tuned on MNLI (Williams et al., 2018). 我们
use two variants of BARTScore (Precision and
F1) for ref-based MT and summarization and
BARTScore-FN (FN stands for Faithfulness) 为了
ref-free summarization. We consider two variants
of XMoverScore with different remapping strate-
gies for multilingual embeddings (CLP, UMD)
and two variants of SentSim with different word
matching paradigms (BERTScore, 大规模杀伤性武器). We re-
port the DiscoScore variant with feature ‘Focus
Frequency’.

4.2 数据集 & Evaluation Protocol

We summarize our used datasets in Table 6. 到
evaluate the metrics’ robustness under adversar-
ial conditions, we use the datasets introduced in
§3 and additionally Rank19 (Falke et al., 2019)
(only for ref-free summarization), which contains
examples composed of documents paired with
one correct and one incorrect candidate summary
with real-world factuality errors. 一般来说, 我们
check the metrics’ preference between the two
candidates and calculate accuracy: the relative
frequency that the metrics correctly choose among
the two alternatives.

On MT standard benchmarks, we evaluate the
metrics on both segment-level (where we cor-
relate metrics scores to human judgments for
individual sentences/segments in the datasets)
and system-level (where we correlate the aver-
age metric scores to the average human scores
over the segments generated by each system),
using Pearson correlation as the performance in-
dicator. On SummEval for summarization, 我们
compute Kendall correlation with system-level
human judgements on four criteria: 连贯性,
一致性,
fluency and relevance (we apply
two aggregation methods for the multi-reference
环境, max and mean). We calculate Pearson cor-
relation with both summary-level (analogous to

810

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

segment-level in MT) and system-level LitePyra-
mids (Shapira et al., 2019) human ratings in
RealSumm.

4.3 NLI as a Metric

NLI systems yield probability distributions over
Entailment, Contradiction, and Neutral. Wedenote
the probability values as e, C, and n, where e + C +
n = 1 and e, C, n ≥ 0. We first determine how to
leverage the three values as NLI metrics.

这样做, we evaluate five simple formulas of
their arithmetic combination in a heuristic way: (1)
e, (2) -C, (3) e-n, (4) e-c, 和 (5) e-n-2c, and inspect
their effect in three directions, which correspond
to the entailment directions implication, reverse
implication and bi-implication: (我) ref /src → cand,
where ref or src act as premise and cand as
假设; (二) ref /src ← cand, where cand acts
as premise and ref or src act as hypothesis; 和
(三、) ref /src ↔ cand, as arithmetic average over
the two above cases.

例如, to obtain e-n from ref /src ↔ cand,
we first average the three probability scores over
direction ref /src → cand and ref /src ← cand, 然后
calculate e-n based on the averaged scores. 我们
only consider direction src → cand for ref-free
summarization, since hypothesis does not need to
entail source document. The various selections of
the formulas and directions result in 15 pooling
strategies for NLI-based metrics.

NLI Systems We explore both monolingual and
cross-lingual NLI-based metrics. For each setup,
we choose two NLI models, which are obtained
from Hugging Face or fine-tuning by ourselves.

For monolingual NLI metrics, we choose
(1) a RoBERTa-large model (刘等人。, 2019)
fine-tuned on SNLI (Bowman et al., 2015), MNLI,
Fever (Nie et al., 2019) and ANLI (Nie et al., 2020)
by Nie et al. (2020) 和 (2) a DeBERTa-large
model fine-tuned by He et al. (2021), using MNLI.
We denote the NLI metrics induced from these two
models as NLI-R and NLI-D. They will be used
for ref-based MT evaluation, and both ref-based
and -free summarization evaluation tasks. 笔记
那, while NLI-R has been fine-tuned on adversar-
ial NLI (ANLI), which has been shown to increase
robustness on (例如) negation and numer-
ical reasoning, NLI-D has not been trained on
ANLI. Cross-lingual NLI metrics should handle
premises and hypotheses in different languages,
so we select the multilingual versions of the under-

(A) Reference-based

ref→cand
ref←cand
ref↔cand

e
3+0

0+4

e-n

-C
3+0

e-c
2+0

e-n-2c

0+3

0+1

0+2

(乙) Reference-free

src→cand
src←cand
src↔cand

-C
2+0

e

0+1
0+1

e-n

e-c

e-n-2c

0+2
4+6

4+0

桌子 7: Winning frequency of different pooling
strategies for NLI metrics on adversarial (第一的
entry) and MT datasets (second entry). We only
show non-zero entries.

lying models of NLI-R/NLI-D. (1) We fine-tune
a XLM-RoBERTa-base model (Conneau et al.,
2019), using the datasets for fine-tuning NLI-R
as well as XNLI dataset (Conneau et al., 2018). (2)
We select an mDeBERTa-base model fine-tuned
on MNLI and XNLI. We denote the correspond-
ing cross-lingual NLI metrics as XNLI-R and
XNLI-D.

5 Experiment Results

Before outlining our main results in §5.1 (公吨)
and §5.2 (summarization), we first discuss good
pooling strategies for NLI metrics.

Pooling Strategy We determine the pooling
strategy for NLI metrics in MT evaluation from
(1) the accuracy on the adversarial datasets and
(2) the correlation with human judgements on
the standard (segment-level) MT datasets. 我们
leverage the winning frequency of the pooling
strategies to choose the best one; a strategy wins
if it works best for an NLI metric among all 15
策略. 全面的, we find that the simple for-
mula e from the direction src/ref↔cand is a good
choice which works well for both standard and ad-
versarial benchmarks, even though slightly better
formulas could be chosen in selected subsettings
(例如, ref-based vs. ref-free evaluation), 见表 7
举些例子.

For summarization, the situation is slightly
更复杂: (1) e-c from direction ref←cand
performs best for ref-based NLI metrics; (2) -C
from direction src→cand is the best strategy for

811

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Adv.

公吨

ref-based
all adeq. all adeq.

ref-free

ref-based
系统
seg

ref-free
系统

seg

67.4 67.0 76.8 74.5 0.676 0.808 0.620
74.8 79.8

0.708 0.807

0.698

32.9 27.2
34.3 28.7
48.3 46.9





0.380 0.757
0.425 0.774
0.567 0.806

74.5 71.7
73.8 70.9






0.400
0.422




0.672
0.673



65.3 60.9
67.4 64.2
78.4 77.8
68.1 67.8 62.7 65.5 0.612 0.401 0.421 −0.021
62.1 61.9 63.0 65.8 0.607

0.620 0.799
0.587 0.761
0.593 0.802

0.427



85.0 92.1 70.5 75.8 0.451 0.756 0.221
86.6 92.3 79.3 85.8 0.439 0.770 0.149

0.335
0.581

Supervised
COMET
BLEURT
Unsupervised
sentBLEU
Rouge
MoverScore
XMoverS(UMD)
XMoverS(CLP)
BERTS
BARTS-P
BARTS-F
SentS(BERTS)
SentS(大规模杀伤性武器)
NLI-based
X(NLI)-右
X(NLI)-D

桌子 8: Pearson correlation with human judg-
ments in WMT and accuracy (%) on our
adversarial datasets, averaged over datasets. 这
performance of ref-based COMET is averaged
over WMT20de and XPAWSde, since it also re-
quires source texts as input. In bold: best results
among all unsupervised metrics including the
NLI-based metrics.

ref-free NLI metrics. 因此, we compare NLI met-
rics adopting these strategies with classic metrics.
Even though we only looked at global aggregate
统计数据, we still observe that our method of
identifying the pooling strategies above leveraged
the data on which we will later evaluate the NLI
指标. To avoid leaking information from the test
放, we evaluate NLI metrics on each dataset with
the pooling strategy selected from the remaining
datasets for that task in §6.

5.1 机器翻译

5.1.1 Adversarial Evaluation

We now compare our NLI metrics with the best
pooling strategy to our baseline metrics under
adversarial conditions.

From Table 8 (columns ‘‘Adv.’’), we observe
that in the ref-based setup: (1) NLI metrics out-
perform the great majority of metrics by a huge
margin: 超过 85% 与. 32%–78% (all phenomena)
和 92% 与. 27%–80% (adequacy phenomena
仅有的) 一般. (2) 更远, the two NLI metrics
perform similarly. In the ref-free setup, the best
cross-lingual NLI metric (XNLI-D) is still most
robust under our attacks. 然而, NLI metrics
do not as substantially outperform the other met-

812

rics as in the ref-based setup. A potential reason is
that the cross-lingual NLI models underperform
compared to the monolingual setup (the prefer-
ences we query for in the reference-free setup
may also play a role). 尽管如此, when ex-
cluding the fluency-related phenomena from the
adversarial datasets, XNLI-D is still on average
10 points better than the best standard metric,
COMET (86% 与. 75%).

而且, our results reveal

那: (1) 最多
standard metrics are particularly incapable of
detecting name error, number error, 和亲-
noun error (∼29%–70%); (2) standard metrics,
especially BLEURT and COMET, are most
competitive regarding omission, 添加, 和
jumbling (∼80%–100%); (3) NLI metrics are sub-
optimal for fluency attacks (mostly at random
等级), especially the reference-free NLI metrics;
和 (4) NLI metrics are much better at name er-
ror, 否定, number error, pronoun error, 和
adj. mismatch than most of the other metrics,
especially ref-based (>90% vs. ∼10%–80%), 作为
shown in Figure 1.

Our observations are inconsistent with Karpinska
等人.
(2022), where the state-of-the-art MT
metrics mostly obtain >95% accuracy in the
preference-based evaluation. The reason is that
our test suites are much more difficult for the
evaluation metrics because we challenge them
by lexical overlap between source/reference and
candidate sentences during attacks: Metrics must
choose between high lexical overlap adversar-
ial candidates (with key errors) over low lexical
overlap paraphrases. 相比之下, in Karpinska
等人. (2022), metrics are challenged to assign cor-
rect preferences for score(ref, t) 与. 分数(ref, t(西德:8))
where t is a candidate and t(西德:8) the perturbed can-
didate. This is a much easier comparison because
neither are ref and t maximally dissimilar (但
meaning equivalent) nor are ref and t(西德:8) maximally
相似的. This is an important lesson: How to de-
sign the adversarial preferences may critically
affect the assessment of whether recent metrics
are robust or not.

5.1.2 Standard Benchmarks
Ref-based We give average results over all
datasets in Table 8 (columns ‘MT’;
individ-
ual results are available in our Github). 为了
segment-level evaluation, we observe: (1) trained
指标 (COMET and BLEURT) substantially
outperform the others, with average performance

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 1: Average accuracy (values in each block) of all metrics per phenomenon over the adversarial datasets for
ref-based MT evaluation. Darker color indicates higher accuracy and vice versa.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

of ∼0.7 Pearson. (2) Unsupervised SOTA met-
rics have average correlation of ∼0.6 Pearson,
BERTScore is the best among them. (3) 我们的
NLI-based metrics are not competitive, with cor-
relations of ∼0.45 Pearson. When correlating with
system-level human judgments, NLI metrics still
underperform most of the SOTA metrics, 但是
margin is much smaller.

(>0.6

评估

Ref-free Trained metrics also dominate in
segment-level
皮尔逊),
whereas the two NLI-based metrics perform
much worse than the others (0.15-0.22 皮尔逊).
尽管如此, XNLI-D performs on par with
COMET and better than the others on WMT20 at
system-level.

全面的, we conclude that our NLI metrics are
not competitive with state-of-the-art evaluation
metrics on standard MT datasets, especially at
segment-level and ref-free.

5.1.3 Combined Metrics

Observing that NLI metrics are strong on adversar-
ial setups, but comparatively weaker in standard

评估, we examine how to get more robust
metrics which also perform well on standard
benchmarks. 这样做, we take the weighted
average of NLI and classical metrics:

C = wnli · N + (1 − wnli) · M

(2)

where wnli ∈ [0, 1] is the weight for NLI metric N
and M is a classical metric. Before combination,
we rescale M and N to [0, 1], using min-max
normalization.

We illustrate the performance of the com-
bined evaluation metrics with (X)NLI-R
on both adversarial and standard benchmarks
the results for
图中 2;
(segment-level)
(X)NLI-D and for system-level are similar. 这
x-axis denotes the average accuracy over the
adversarial datasets, while y-axis is the average
Pearson correlation over the standard benchmarks
(MT datasets). Each dot in each graph shows
the value C(wnli) for a specific weight wnli.
the graphs show an
As seen from Figure 2,
intriguing concave curvature. In standard MT
the combination boosts the metric
评估,

813

数字 2: Accuracy on adversarial datasets and Pearson
correlation with segment-level human judgements in
WMT datasets of combined metrics with (X)NLI-R,
averaged over datasets. The points on each path from
the original metric to the NLI metric indicate wnli =
0, 0.1, . . . , 1. The purple line denoting the combination
with ref-based COMET ends at another point since
the corresponding adversarial performance is averaged
超过 2 adversarial datasets containing source texts.

performance when wnli is small (从 0.1 到 0.4)
in virtually all cases. We then see a simultaneous
increase of adversarial robustness and quality on
standard benchmarks. In ref-based setup, 例如,
for wnli = 0.2, we observe: (1) MoverScore and
BARTScore-P improve most, with ∼8% (从
分别)
0.57/0.59 到 0.61/0.64 皮尔逊,
and 21%–36% improvements on adversarial
datasets (从 48%/67% 到 66%/82% 准确性
一般). (2) The best unsupervised metric
增加
on segment-level MT, BERTScore,
∼4% Pearson on standard benchmarks and
∼24% accuracy on adversarial datasets. (3) 这
robust untrained metric, BARTScore-F,
最多
improves about ∼11% in robustness, whereas its
performance on standard benchmarks also rises
∼5%. (4) The improvements on MT for trained
metrics are smaller compared to those untrained
指标, with COMET improving only 1.5%
and BLEURT even becoming worse with the
choice wnli = 0.2. 然而, their performance
in defending adversarial attacks still improves
∼10%–20%.
In ref-free setups, all metrics
improve ∼6%–7% on adversarial datasets. 这样的
setting only substantially boosts XMoverScore’s
performance on standard benchmarks, 和
∼6%–9%.

We summarize the improvements for all com-
binations in Figure 3(A), which are averages over
all experiments considered here. We can observe
that the line denoting improvements on standard

数字 3: Improvements of all metrics on standard
benchmarks and adversarial datasets for wnli = 0.1,
… 0.9, averaged over all experiments. We show 95%
confidence interval.

benchmarks peaks at wnli = 0.2, and the average
improvements are positive when wnli ≤ 0.5. 毛皮-
ther, on the adversarial datasets, the improvement
monotonously increases with wnli and the gain is a
concave function of wnli which saturates as wnli be-
comes larger. The sweet spots are wnli ∈ [0.2, 0.3],
which leads to 5%–6% improvement on stan-
dard benchmarks and 14%–16% improvement in
adversarial robustness on average. When exclud-
ing the fluency phenomena from the adversarial
datasets, the combined metrics consistently gain
larger improvements in adversarial robustness,
with 20%–24% improvements at the sweet spots.

5.2 Summarization

相似的

Evaluation As Table 9 节目,

MT evaluation, NLI-based metrics exhibit much
stronger robustness under adversarial conditions
(our best NLI metrics have at least ∼8 points
higher accuracy than the best standard met-
rics; right-most columns). The difference is that
the vanilla NLI metrics are now also compara-
bly effective to the SOTA metrics on standard
benchmarks. 例如, in ref-based setup,
NLI-D with max aggregation beats all metrics
except for DiscoScore with mean on SummEval
and both NLI metrics highly correlate with
system-level human ratings in RealSumm (多于

814

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

蓝线

Rouge
MoverS

BERTS

BARTS-F

DiscoS

NLI-based

NLI-R

NLI-D

metric

BARTS-FN

SUPERT

NLI-based
NLI-R

NLI-D

(A) Reference-based

metric

连贯性

一致性

SummEval
fluency

mean max mean
0.279
0.294

max

意思是
0.244

max
0.044 −0.029
0.229
0.088 −0.279 −0.037 −0.081
0.362
0.456

0.421

0.103

0.191
0.206

0.176
0.324

0.618

0.618

0.221
0.044
0.176 −0.044
0.250
0.206

0.647

0.515

0.676

0.279

0.279

0.676

0.273

0.376

0.317

0.539

0.185

0.185

0.450

0.554

关联

mean max
0.397 0.382

0.118 0.103
0.368 0.515

0.603 0.515

0.500 0.368

0.529 0.632

0.632 0.353

平均

max
意思是
0.245
0.215
0.090 −0.020
0.326
0.363

0.340

0.237

0.429

0.385

0.392

0.532

RealSumm
litePyr

Adv.
SEadv


0.480

0.540
0.585

0.574

0.478

系统
0.124

0.457
0.501

全部

adeq.
0.182 0.109

0.185 0.117
0.287 0.251

0.380

0.598 0.574

0.531

0.697 0.692

0.583

0.495
0.788 0.792
0.466 −0.199 −0.066 0.334 0.294

0.687

BARTS-P

0.485

0.441

0.147

0.074

0.632

0.250

0.265

0.706

0.676

0.750

0.494

0.568

0.450

0.613

0.279 0.206

0.471 0.397

0.388

0.499

0.352

0.506

0.525

0.489

0.856 0.864 0.905

0.840 0.806 0.843

(乙) Reference-free

SummEval

连贯性

一致性

fluency relevance

平均

0.735

0.147

0.221

0.162

0.132

0.603

0.235

0.647

0.391

0.465

0.391

0.332

0.662

0.279

0.500

0.324

0.480

0.374

0.337

0.366

RealSumm
litePyr

Adv.

SEadv

summary

系统
0.178 −0.023
0.522
0.626

全部

0.427

0.296

adeq.

0.389

0.273

Rank19

平均

0.796 0.612

0.668 0.482

0.300
−0.076

0.688

0.568

0.720

0.624

0.722

0.629

0.866 0.793

0.885 0.755

桌子 9: Kendall correlation with system-level human judgments in SummEval. Pearson correlation
with summary/system-level litePyramid in RealSumm. Accuracy on adversarial benchmarks, averaged
over phenomena in SEadv. We bold the best performance on each criterion. ‘‘max/mean’’ denotes the
aggregation method used for multi-reference setting in ref-based evaluation on SummEval.

0.8 皮尔逊), where most standard metrics ob-
tain only 0.5–0.7 Pearson correlations. 什么时候
considering all evaluation dimensions of Sum-
mEval and RealSumm, NLI-D outperforms all
other metrics, followed by NLI-R. Besides, 我们
observe that NLI metrics correlate much bet-
ter with human judgments regarding consistency
和 (somewhat surprisingly) fluency in Sum-
mEval compared to the other metrics. 为了
ref-free setup, BARTScore-FN performs best on
SummEval—it outperforms the other metrics by
多于 0.1 Kendall on average. 然而, it does
not correlate well with both summary-level and
system-level human judgments in RealSumm. NLI
metrics are comparable or better than standard
metrics on system-level. 例如, NLI-R
performs best among the examined metrics and is
关于 0.06 Pearson better than the best standard
metric (SUPERT) on system-level in RealSumm.
尽管如此, reference-free NLI metrics also
perform worse than the reference-based ones
as in MT; an explicit bottleneck for the two
NLI metrics is that they were only trained on
NLI data with short sentences, but reference-free

summarization evaluation requires metrics to deal
with source documents which contain many more
句子.

to MT,

Combined Metrics
图中 3(乙), we sum-
marize the median improvements of combined
summarization metrics (the median smooths some
异常值). 相比之下
the combination
brings almost equal benefits to performance of
standard metrics on standard and adversarial
benchmarks concerning only adequacy—we again
observe a decrease in improvements on adversarial
datasets when adding our fluency phenomena. 我们
identify a best wnli, 即, 0.8, with which the
standard metrics gain about 25%–30% improve-
ments in both types of performances (adversarial
and standard).

6 讨论 & 分析

Selected Failure Cases of Metrics: 桌子 10
shows selected failure cases of four popular
指标 (BERTScore, BARTScore, BLEURT,
COMET), where the NLI metrics are correct in

815

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

ref

candpara

candadv

scorepara:
scoreadv
(标准
metric)

scorepara:
scoreadv
(NLI-R)

错误

BERTScore
Although President George
瓦. Bush says he believes in
市场, in this case he has
called for voluntary action.

BARTScore-F
Reagan and I were nonetheless
able to create a reservoir
of constructive spirit
through constant outreach
and face-to-face interaction.

BLEURT
在 2012, when Freedom
House downgraded Mali to
‘‘not free,’’ engagement
拒绝了 7%.
This leads to heavy deforestation
and lethal indoor
air pollution, which kills 1.3
million people each year.

COMET
Who serves as president of
the United States is critically
important for
Mexicans.

Although President George
瓦. Bush says he believes in
市场, he has demanded
voluntary action in this case.

Although President George
瓦. Bush says she believes in
市场, in this case she has
called for voluntary action.

0.980:
0.982

0.951:
0.000

Pronoun

尽管如此, Reagan and
I were able to create a
constructive climate through
constantcontact and
personal interaction.

Nicole and I were nonetheless
able to create a reservoir
of constructive spirit
through constant outreach
and face-to-face interaction.

−2.104:
−1.527

0.943:
0.002

姓名

在 2012, when Freedom
House classified Mali as
unfree, the engagement fell by
7 百分.
This leads to heavy
Deforestation and lethal indoor
air pollution, which kills one
point three million people
each year.

在 2012, when Freedom
House downgraded Melissa
to ‘‘not free,’’ engagement
拒绝了 7%.
This leads to heavy
Deforestation and lethal indoor
air pollution, which kills
6.9 million people each year.

0.787:
0.834

0.983:
0.030

姓名

0.682:
0.767

0.783:
0.000

Num

Anyone who serves as
President of the United States is
crucial to Mexicans.

Who serves as president of
the United States is not
critically important for
Mexicans.

1.067:
1.086

0.974:
0.044

Negation

桌子 10: Sample instances in adversarial datasets where standard metrics failed while NLI-R succeeded;
ref-based setup. In the 4th and 5th columns, we show [score assigned to candpara]: [score assigned
to candadv] by standard metrics and NLI-R, 分别; robust metrics should give candpara higher
scores. Green bold texts indicate the anchor words/phrases to be perturbed and the red ones in
candadvrefer to the corresponding perturbed texts.

each case. In the examples, BERTScore prefers
text with the wrong gendered pronoun over a le-
gitimate paraphrase and even trained metrics like
BLEURT fail on severe name changes such as
‘‘Melissa’’ (a person name) 与. ‘‘Mali’’ (a coun-
try name). Leveraging more subtle cases (例如,
mismatches based on wrong word senses instead
of random mismatches with the same POS or re-
placing names with names of the same ‘type’)
would likely constitute even harder test cases for
future metrics.

No Metric is Good Everywhere: Across distinct
方面, different metrics perform differently,
indicating that they capture varying aspects. 为了
例子, NLI metrics are not so good on flu-
ency adversarial attacks, 例如, typos. This may be
unsurprising, given that fluency is a low-level phe-
nomenon while NLI concerns high-level logical
relationships between sentences (some fluency
phenomena would best be treated by switch-

ing to a lower-level representation space, 这样的
as character-level [Vu et al., 2022]; this could
seamlessly be integrated in existing NLI mod-
这). The NLI metrics are also weaker concerning
segment-level MT evaluation on standard bench-
marks. 然而, NLI metrics alone perform
surprisingly well: In ref-based MT, they win on 7
在......之外 19 方面 (12 adversarial phenomena
和 7 standard datasets, evaluated segment- 和
system-level), only beaten by BLEURT (8 wins);
ref-free, they win 5 在......之外 19 方面, 第二
only to COMET (11 wins). In ref-based summa-
rization, they are clearly ahead of all standard
指标, winning not only 8 在......之外 12 adversarial
方面, but also system-level LitePyramid,
consistency and fluency (因此, 11 在......之外 18 wins),
clearly ahead of BARTScore-P (4 的 18); ref-free,
they are also best and win 13 在......之外 18 的-
mensions. The best overall metrics, measured as
average performance over standard and adversar-
ial datasets, always include NLI: for ref-based MT,

816

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

this is BLEURT+0.2×NLI-R, for ref-free MT, 这是
COMET+0.3×NLI-D. For summarization, NLI-R
alone and combined with BARTScore-F perform
best on average.

Rescaling: The min-max normalization we used
(a standard technique for normalizing data in ma-
chine learning, typically applied to input features)
for metric combination requires batch processing.
It is necessary to account for the different ranges
of metrics, 例如, some metrics take negative val-
厄斯. An alternative would include to enforce more
formal constraints on evaluation metrics, IE。, 那
they should take outputs in [0,1]. When applying
our combined metrics in practice, one could also
replace them by surrogate metrics trained on the
outputs of the original combined metrics or simply
take the min-max values inferred from the datasets
already evaluated on—the larger these datasets the
more reliably are min and max estimated.

Sensitvity to wnli: Having different weights
wnli for different tasks is undesirable, 因为它
requires considering each task individually. 如何-
曾经, in our experiments, we found that all small
wnli (以下 0.5) yield good performances and
are thus safe choices: They increase adversar-
ial robustness and also lead to better metrics on
standard benchmarks.

Adversarial Performance vs. Standard Perfor-
曼斯: From our experiments, it might seem
that adversarial and standard performance are
anti-correlated: A metric with higher adversar-
ial performance may have lower performance on
standard benchmarks and vice versa. While this
would not necessarily be a major surprise as adver-
sarial conditions oftentimes test phenomena that
are otherwise not represented in standard bench-
marks (Niven and Kao, 2019), a statistical analysis
reveals that standard performance generally pos-
itively correlates to the adversarial performance
in our case, consistent with our earlier argument
that existing NLG systems in the real world do
commit similar errors as we check for. 这样做,
we first convert the metrics’ standard performance
to rankings for each performance category (例如,
ref-based/-free segment/system-level MT perfor-
曼斯, performance on SummEval/RealSumm),
then we correlate the ranking-based standard
performance to the corresponding adversarial per-
formance rankings, obtaining 0.37 Spearman.

When excluding NLI metrics,
increases to 0.60.

the correlation

The Choice of candpara Matters: As indi-
cated in §3, we speculate that a good adversarial
setting maximizes (surface) dissimilarity between
ref and candpara (which can better trick the met-
rics). To investigate, we compute the normalized
edit distance between ref and candpara;7 一个更大的
edit distance means a greater dissimilarity. If our
assumption is true, then larger edit distances rep-
resent harder test cases for the metrics. We find:
(1) the average edit distance for the test cases
where the metrics fail to defend against the ad-
versarial attacks is 0.01–0.6 larger than that for
where they succeed, averaged over metrics; (2)
for PAWSback and PAWSori (both induced from
PAWS) where the candpara are obtained in dif-
不同的方式, all metrics achieve 0.02-0.15 降低
accuracy on PAWSori, which has 0.46 larger av-
erage edit distance than PAWSback, 反过来. 两个都
findings confirm our above assumption. 在阿迪-
的, we observe that NLI metrics have the smallest
difference between the edit distances for failure
and success cases (0.01–0.26) as well as that be-
tween the accuracy on PAWSback and PAWSori
(0.02) among all evaluation metrics. This implies
that they are least affected by surface overlap
and instead better consider the logical relation-
ship between sentences. This is what makes them
attractive as evaluation metrics.

The Choice of candadv Matters, Too: 我们
evaluate on one complex attack combining Num-
ber error with Negation which increases the
difference between ref and candadvbased on
the test cases for Number error in WMT20de.
The accuracy increases by an average of 0.28 超过
all metrics. This confirms our assumption that
maximizing the (surface) similarity between ref
and candadv (but with key errors) leads to harder
test suites and vice versa.

Ensemble with NLI Metrics Are More Ef-
有感染力的: We compare the ensembles with NLI
metrics to ensembles with standard metrics, IE。,
w · A + (1 − w) · M , where A is a fixed standard
metric and M is any of the remaining metrics. 到
do so, we combine standard metrics with the rest
metrics for each category of MT/summarization
and ref-based/-free setting. We take the arithmetic

7Ref-free, the edit distance between r and ref isconsidered.

817

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

account the multi-sentence nature of source texts
and outputs, especially in summarization.

To remedy the mismatch between the gran-
ularities of the training data of NLI models
and the input data of summarization evalua-
的, IE。, 句子- 与. document-level, Laban
等人. (2022) propose both supervised and un-
supervised NLI-based summarization metrics for
inconsistency detection. We test their unsuper-
vised variant (SummaCZS),8 which segments
documents into sentence units and aggregates
scores between pairs of sentences, 与
underlying model of NLI-R. 然而, Sum-
maCZS does not consistently outperform NLI-R
across all datasets; 相比之下, NLI-R performs
much better in our adversarial
test compared

to SummaCZS (72% 与. 53%). Besides,
match the training data of NLI models with the
task of factual inconsistency detection in sum-
marization, Utama et al. (2022) introduce an
augmented NLI dataset with task-oriented exam-
ples based on CNNDM—FalseSum; we evaluate
three Roberta-large models finetuned on it and
MNLI. Similar to SummaCZS, this also does
not always yield better performance compared to
simple NLI metrics (∼55%–68% vs. 72% on ad-
versarial datasets). 全面的, both approaches work
well on SummEval, but not so well on RealSumm
and our adversarial benchmark.

Choice of Pooling Strategy: To examine the
issue of data leakage discussed in §5, we now
evaluate the NLI metrics on each dataset with
the pooling strategy selected from the remain-
ing datasets (excluding the one for evaluation)
based on winning frequency. 例如, 为了
the segment-level MT evaluation on WMT15,
we choose the pooling strategy which wins most
times on all MT datasets (including all standard
datasets for both segment/system-level evalua-
tion and the adversarial datasets) 除了
WMT15. We observe that this change in pooling
strategy induction results in minor performance
variation: −1.9% for segment-level evaluation,
+0.8% for system-level evaluation, and −0.7%
for adversarial evaluation. For summarization,
as only one direction—i.e., src→cand—is con-
sidered for ref-free NLI metrics, we separately
the pooling strategy for ref-based and
选择

数字 4: Accuracy on adversarial datasets and Pearson
correlation with segment-level human judgements in
WMT datasets of combined metrics with BERTScore,
averaged over datasets. The green line denoting the
combination with COMET ends at another point since
the corresponding adversarial performance is only av-
eraged over the 2 adversarial datasets containing source
文本.

average of the accuracy on adversarial bench-
marks and correlations on standard benchmarks
as the overall metric performance here. We calcu-
late the mean/maximal improvement of ensembles
to the original metric M over w ∈ [0.1, 0.9] 和
observe: (我) While the ensembles with standard
metrics are better for ref-free MT metrics because
cross-lingual NLI metrics perform very poorly
in our experiments, (二) the monolingual NLI
metrics lead to much better ensembles—17/15
points larger mean/max improvement—compared
to the standard metrics. (三、) 全面的, the ensem-
bles with NLI metrics yield 10/7 points larger
mean/max improvement in overall performance
than with standard metrics (averaged over all 4
任务: ref-based/-free MT/summarization). 因此,
(monolingual) NLI metrics have unique proper-
领带, compared to standard metrics, making them
attractive in ensembles.

To illustrate, 数字 4 shows ensembles with
BERTScore. These show minor or no improve-
ments on standard benchmarks and also mixed
(often negative) results for adversarial robustness.

In §5, we applied
SummaCZS and Falsesum:
NLI systems on whole input texts, not taking into

8We do not compare to the supervised one as it is trained
on a consistency dataset for summarization task, for a fairer
比较.

818

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

ref-free NLI metrics. 全面的, we have no perfor-
mance change for the ref-free setting and −3.6%
performance on average over all five criteria
(correlations on SummEval with max/mean ag-
gregation, summary/system-level correlations on
RealSumm, and accuracy on SEadv) ref-based.
因此, the changes are again minor.

Comparison to RoMe: As the authors of RoMe
did not publish their adversarial dataset, we com-
pare RoMe’s performance with our metrics on
one of our adversarial datasets, WMT20de, 在-
代替. RoMe has an average accuracy of 43%,
with > 90% accuracy only on the phenomena
SVD and omission, which are the easiest for most
standard metrics. 相比之下, our NLI metrics
have above 80% average accuracy. As RoMe
does not evaluate on MT or summarization, 我们
also evaluate our NLI metrics on one (randomly
选择的) data-to-text generation dataset used in
Rony et al. (2022)—BAGEL (Mairesse et al.,
2010). RoMe and our NLI metrics perform on
par here (∼0.23 Spearman’s ρ). 全面的,

seems to imply that simple NLI models taken
out of the box are better and more robust metrics
than a specially trained approach such as RoMe.

7 Concluding Remarks

在这项工作中, we explored NLI as a general
paradigm for evaluation metrics. 我们展示了
that NLI metrics yield adversarial robustness,
and are also strong—though not always state-of-
the-art—when it comes to standard metric eval-
uation benchmarks. By linearly interpolating
已确立的 (BERT-based) metrics with our NLI
指标, we obtained high-quality metrics along
both axes: adversarial robustness and standard
benchmarks, with substantial gains over recent
BERT-based metrics.

A potential reason why NLI based metrics
perform subpar on some standard benchmarks
(especially in MT) is the training data mismatch,
IE。, typical NLI datasets contain many artificial
sentences of the type ‘‘A girl is playing on a
piano’’. A further limitation is that cross-lingual
NLI models are not yet high-quality enough and
that most current NLI models are sentence-level,
not document-level—with a few recent exceptions
(Yin et al., 2021). Once these limitations of NLI
are overcome, we believe that even better perfor-
mances from NLI based metrics can be expected,

哪个, we believe, is one of the most promis-
ing directions for future high-quality and robust
evaluation metric design. Future work should also
consider NLI metrics for other text generation
任务; the NLI paradigm looks especially promis-
ing for tasks that require comparison with human
参考, which oftentimes involve the concept
of logical equivalence.

致谢

We thank Zuojun Shi for conducting initial exper-
iments related to this paper as part of her Bachelor
thesis at TU Darmstadt. We appreciate the re-
viewers and editors from TACL for their time,
努力, and greatly helpful comments. 我们也
thankfully acknowledge support from the BMBF
via the grant ‘‘Metrics4NLG’’. Steffen Eger is
financed by DFG grant EG 375/5–1.

参考

Yonatan Belinkov, Adam Poliak, Stuart Shieber,
and Alexander
Benjamin Van Durme,
匆忙. 2019. On adversarial
removal of
hypothesis-only bias in natural language in-
参考. In Proceedings of the Eighth Joint
Conference on Lexical and Computational
语义学
(*SEM 2019), pages 256–262,
明尼阿波利斯, Minnesota. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/S19-1028

Jonas Belouadi and Steffen Eger. 2023. Uscore:
An effective approach to fully unsupervised
evaluation metrics for machine translation. 在
EACL.

Luisa Bentivogli, Arianna Bisazza, Mauro
Cettolo, and Marcello Federico. 2016. 新-
ral versus phrase-based machine translation
质量: 案例研究. 在诉讼程序中
2016 Conference on Empirical Methods in Nat-
ural Language Processing, pages 257–267,
Austin, 德克萨斯州. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D16-1025

Manik Bhandari, Pranav Narayan Gour, Atabak
Ashfaq, Pengfei Liu, and Graham Neubig. 2020.
Re-evaluating evaluation in text summariza-
的. 在诉讼程序中 2020 会议
on Empirical Methods in Natural Language

819

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

加工 (EMNLP), pages 9347–9359, 在-
线. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2020.emnlp-main.751

Ondˇrej Bojar, Yvette Graham, and Amir Kamran.
2017. Results of the WMT17 metrics shared
任务. In Proceedings of the Second Confer-
ence on Machine Translation, pages 489–513,
哥本哈根, 丹麦. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/W17-4755

Ondˇrej Bojar, Yvette Graham, Amir Kamran,
and Miloˇs Stanojevi´c. 2016. Results of the
WMT16 metrics shared task. 在诉讼程序中
the First Conference on Machine Translation:
体积 2, Shared Task Papers, pages 199–231,
柏林, 德国. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/W16-2302

Samuel R. Bowman, Gabor Angeli, Christopher
波茨, and Christopher D. 曼宁. 2015. A
large annotated corpus for learning natural lan-
guage inference. 在诉讼程序中 2015
Conference on Empirical Methods in Nat-
ural Language Processing, pages 632–642,
里斯本, Portugal. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D15-1075

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei,
Hui Jiang, and Diana Inkpen. 2017. En-
hanced LSTM for natural language inference.
In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 1657–1668,
Vancouver, 加拿大. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P17-1152

Xi Chen, Nan Ding, Tomer Levinboim, 和
Radu Soricut. 2020. Improving text gener-
ation evaluation with batch centering and
In Pro-
tempered word mover distance.
ceedings of
the First Workshop on Eval-
uation and Comparison of NLP Systems,
pages 51–59, 在线的. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2020.eval4nlp-1.6

Yiran Chen, Pengfei Liu, and Xipeng Qiu.
2021. Are factuality checkers reliable? Ad-
factuality in
versarial meta-evaluation of

summarization. In Findings of the Association
for Computational Linguistics: EMNLP 2021,
pages 2082–2095, Punta Cana, Dominican Re-
民众. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2021.findings-emnlp.179

Pierre Colombo, Guillaume Staerman, Chlo´e
Clavel, and Pablo Piantanida. 2021. Au-
tomatic text evaluation through the lens
In Proceed-
of Wasserstein barycenters.
ings of
这 2021 Conference on Empirical
Methods in Natural Language Processing,
pages 10450–10466, Online and Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.emnlp-main.817

Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzm´an, Edouard Grave, Myle Ott,
Luke Zettlemoyer, and Veselin Stoyanov.
2019. Unsupervised cross-lingual
代表-
tation learning at scale. CoRR, abs/1911.02116.
https://doi.org/10.18653/v1/2020
.acl-main.747

Alexis Conneau, 鲁西·里诺特, Guillaume Lample,
Adina Williams, Samuel R. Bowman, Holger
Schwenk, and Veselin Stoyanov. 2018. Xnli:
Evaluating cross-lingual sentence represen-
这 2018 骗局-
tations.
ference on Empirical Methods in Natural
语言处理. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/D18-1269

在诉讼程序中

Ido Dagan, Dan Roth, Mark Sammons, 和
Fabio Massimo Zanzotto. 2013. Recognizing
Textual Entailment: Models and Applica-
系统蒸发散, Synthesis Lectures on Human Lan-
guage Technologies, 摩根 & Claypool
出版商. https://doi.org/10.1007
/978-3-031-02151-0

Daniel Deutsch, Tania Bedrax-Weiss, and Dan
Roth. 2021. Towards question-answering as
an automatic metric for evaluating the con-
tent quality of a summary. Transactions of
the Association for Computational Linguistics,
9:774–789. https://doi.org/10.1162
/tacl_a_00397

Daniel Deutsch, Rotem Dror, and Dan Roth. 2022.
On the limitations of reference-free evaluations

820

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

of generated text. 在诉讼程序中 2022
Conference on Empirical Methods in Natu-
ral Language Processing, pages 10960–10977,
阿布扎比, United Arab Emirates. 协会
for Computational Linguistics.

Ondˇrej Duˇsek and Zdenˇek Kasner. 2020. Evaluat-
ing semantic accuracy of data-to-text generation
with natural language inference. In Proceed-
ings of the 13th International Conference on
Natural Language Generation, pages 131–137,
都柏林, 爱尔兰. Association for Computational
语言学.

Alexander R. Fabbri, Wojciech Kry´sci´nski, Bryan
McCann, Caiming Xiong, Richard Socher,
and Dragomir Radev. 2021. Summeval: 关于-
evaluating summarization evaluation. 反式-
actions of the Association for Computational
语言学, 9:391–409. https://doi.org
/10.1162/tacl_a_00373

Tobias Falke, Leonardo F. 右. Ribeiro, Prasetya
Ajie Utama, Ido Dagan, and Iryna Gurevych.
2019. Ranking generated summaries by correct-
内斯: An interesting but challenging application
language inference. In Proceed-
for natural
ings of

计算语言学协会,
pages 2214–2220. https://doi.org/10
.18653/v1/P19-1213

the 57th Annual Meeting of

Markus Freitag, George Foster, David Grangier,
Viresh Ratnakar, Qijun Tan, and Wolfgang
Macherey. 2021A. Experts, 错误, 和骗局-
文本: A large-scale study of human evaluation
for machine translation. https://doi.org
/10.1162/tacl_a_00437

Markus Freitag, Ricardo Rei, Nitika Mathur,
Chi-kiu Lo, Craig
斯图尔特, Eleftherios
Avramidis, Tom Kocmi, George Foster, 阿隆
Lavie, and Andr´e F. 时间. 马丁斯. 2022. 结果
of WMT22 metrics shared task: Stop using
BLEU – neural metrics are better and more ro-
bust. In Proceedings of the Seventh Conference
on Machine Translation (WMT), pages 46–68,
阿布扎比, United Arab Emirates (Hybrid).
计算语言学协会.

Markus Freitag, Ricardo Rei, Nitika Mathur,
Chi-kiu Lo, Craig Stewart, George Foster, 阿隆
Lavie, and Ondˇrej Bojar. 2021乙. Results of
the WMT21 metrics shared task: Evaluating
metrics with expert-based human evaluations

on TED and news domain. In Proceedings
of the Sixth Conference on Machine Trans-
关系, pages 733–774, 在线的. 协会
计算语言学.

Yang Gao, Wei Zhao, and Steffen Eger. 2020.
SUPERT: Towards new frontiers in unsuper-
vised evaluation metrics for multi-document
summarization. 在诉讼程序中
the 58th
the Association for
Annual Meeting of
计算语言学, pages 1347–1354,
在线的. Association for Computational Lin-
语言学. https://doi.org/10.18653
/v1/2020.acl-main.124

Olga Golovneva, Moya Peng Chen, Spencer
Poff, Martin Corredor, Luke Zettlemoyer,
Maryam Fazel-Zarandi, and Asli Celikyilmaz.
2023. ROSCOE: A suite of metrics for scor-
ing step-by-step reasoning. In The Eleventh
International Conference on Learning Repre-
句子.

Suchin Gururangan, Swabha Swayamdipta, Omer
征收, Roy Schwartz, Samuel Bowman, 和
诺亚A. 史密斯. 2018. Annotation artifacts in
natural language inference data. In Proceedings
的 2018 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
体积 2 (Short Papers), pages 107–112, 新的
Orleans, Louisiana. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/N18-2017

刘,

Pengcheng He, Xiaodong

Jianfeng
高, and Weizhu Chen. 2021. DeBERTa:
Decoding-enhanced BERT with disentangled
In International Conference on
注意力.
Learning Representations.

Tianxing He, Jingyu Zhang, Tianle Wang, Sachin
Kumar, Kyunghyun Cho, James Glass, 和
Yulia Tsvetkov. 2022. On the blind spots
of model-based evaluation metrics for text
一代. arXiv 预印本 arXiv:2212.10020.

Karl Moritz Hermann, Tomas Kocisky, 爱德华
格芬施泰特, Lasse Espeholt, Will Kay,
Mustafa Suleyman, and Phil Blunsom. 2015.
Teaching machines to read and comprehend.
神经信息处理的进展
系统, 28.

Md Mosharaf Hossain, Antonios Anastasopoulos,
Eduardo Blanco, and Alexis Palmer. 2020.

821

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

It’s not a non-issue: Negation as a source
of error in machine translation. In Findings

the Association for Computational Lin-
语言学: EMNLP 2020, pages 3869–3885,
在线的. Association for Computational Lin-
语言学. https://doi.org/10.18653
/v1/2020.findings-emnlp.345

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu,
Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang,
Andrea Madotto, and Pascale Fung. 2022.
Survey of hallucination in natural
语言
一代. ACM Computing Surveys.

Marzena Karpinska, Nishant Raj, Katherine Thai,
Yixiao Song, Ankita Gupta, and Mohit Iyyer.
2022. DEMETR: Diagnosing evaluation met-
rics for translation. 在诉讼程序中 2022
Conference on Empirical Methods in Natural
语言处理, pages 9540–9561, Abu
Dhabi, United Arab Emirates. 协会
计算语言学.

Marvin Kaster, Wei Zhao, and Steffen Eger.
2021. Global explainability of BERT-based
evaluation metrics by disentangling along lin-
guistic factors. 在诉讼程序中
这 2021
Conference on Empirical Methods in Natu-
ral Language Processing, pages 8912–8925,
Online and Punta Cana, Dominican Repub-
利克. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2021.emnlp-main.701

Wojciech Kryscinski, Bryan McCann, Caiming
Xiong, and Richard Socher. 2020. Evaluat-
ing the factual consistency of abstractive
text summarization. 在诉讼程序中

2020 实证方法会议
自然语言处理
(EMNLP),
pages 9332–9346, 在线的. 协会
计算语言学. https://土井
.org/10.18653/v1/2020.emnlp-main
.750

Philippe Laban, Tobias Schnabel, Paul N. Bennett,
and Marti A. Hearst. 2022. SummaC: 关于-
visiting NLI-based models for inconsistency
detection in summarization. Transactions of
the Association for Computational Linguis-
抽动症, 10:163–177. https://doi.org/10
.1162/tacl_a_00453

Christoph Leiter, Piyawat Lertvittayakumjorn,
中号. Fomicheva, Wei Zhao, Yang Gao, 和

Steffen Eger. 2022. Towards explainable eval-
uation metrics for natural language generation.
ArXiv, abs/2203.11131.

Chin-Yew Lin. 2004. Rouge: A package for
automatic evaluation of summaries. In Text
Summarization Branches Out, pages 74–81.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly op-
timized BERT pretraining approach. arXiv
preprint arXiv:1907.11692.

Arle Lommel, Aljoscha Burchardt, and Hans
Uszkoreit. 2014. Multidimensional quality
指标 (mqm): A framework for declaring
and describing translation quality metrics.
Tradum`atica: Tecnologies de la Traducci´o,
0:455–463. https://doi.org/10.5565
/rev/tradumatica.77

Franc¸ois Mairesse, Milica Gaˇsi´c, Filip Jurˇc´ıˇcek,
Simon Keizer, Blaise Thomson, Kai Yu, 和
Steve Young. 2010. Phrase-based statistical lan-
guage generation using graphical models and
active learning. In Proceedings of the 48th
Annual Meeting of the Association for Com-
putational Linguistics, pages 1552–1561, Up-
psala, 瑞典. Association for Computational
语言学.

Nitika Mathur, Timothy Baldwin, and Trevor
Cohn. 2019. Putting evaluation in context:
Contextual embeddings improve machine trans-

lation evaluation.
57th Annual Meeting of the Association for
计算语言学, pages 2799–2808,
Florence,
意大利. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P19-1269

在诉讼程序中

Nitika Mathur, Timothy Baldwin, and Trevor
Cohn. 2020A. Tangled up in BLEU: Reevaluat-
ing the evaluation of automatic machine trans-
lation evaluation metrics. 在诉讼程序中
58th Annual Meeting of the Association for
计算语言学, pages 4984–4997,
在线的. Association for Computational Lin-
语言学. https://doi.org/10.18653
/v1/2020.acl-main.448

Nitika Mathur, Johnny Wei, Markus Freitag,
Qingsong Ma, and Ondˇrej Bojar. 2020乙. 关于-
sults of the WMT20 metrics shared task. 在

822

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Proceedings of the Fifth Conference on Ma-
chine Translation, pages 688–725, 在线的.
计算语言学协会.

(在线的). International Committee on Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2020.coling-main.445

Yixin Nie, Haonan Chen, and Mohit Bansal.
2019. Combining fact extraction and verifica-
tion with neural semantic matching networks.
In Association for the Advancement of Artifi-
cial Intelligence (AAAI). https://doi.org
/10.1609/aaai.v33i01.33016859

Yixin Nie, Adina Williams, Emily Dinan,
Mohit Bansal, Jason Weston, and Douwe
Kiela. 2020. Adversarial NLI: A new bench-
mark for natural language understanding. 在
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguis-
抽动症. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2020.acl-main.441

Timothy Niven and Hung-Yu Kao. 2019. 问题-
ing neural network comprehension of natural
language arguments. 在诉讼程序中

57th Annual Meeting of the Association for
计算语言学, pages 4658–4664,
Florence,
意大利. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P19-1459

Maxime Peyrard. 2019. Studying summariza-
tion evaluation metrics in the appropriate
scoring range. 在诉讼程序中
the 57th
the Association for
Annual Meeting of
计算语言学, pages 5093–5100,
Florence,
意大利. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/P19-1502

Adam Poliak,

Jason Naradowsky, Aparajita
Haldar, Rachel Rudinger,
and Benjamin
Van Durme. 2018. Hypothesis only baselines
in natural language inference. In Proceedings
of the Seventh Joint Conference on Lexical and
Computational Semantics, pages 180–191, 新的
Orleans, Louisiana. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/S18-2023

Tharindu Ranasinghe, Constantin Orasan, 和
Ruslan Mitkov. 2020. TransQuest: 反式-
lation quality estimation with cross-lingual
the 28th
transformers.
国际计算会议
语言学, pages 5070–5081, 巴塞罗那, 西班牙

在诉讼程序中

Ricardo Rei, Craig Stewart, Ana C. Farinha,
and Alon Lavie. 2020A. COMET: A neural
framework for MT evaluation. In Proceedings
的 2020 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 2685–2702, 在线的. 协会
计算语言学. https://doi.org
/10.18653/v1/2020.emnlp-main.213

Ricardo Rei, Craig Stewart, Ana C. Farinha, 和
Alon Lavie. 2020乙. Unbabel’s participation in
the WMT20 metrics shared task. In Proceed-
ings of the Fifth Conference on Machine Trans-
关系, pages 911–920, 在线的. 协会
计算语言学.

Marco Tulio Ribeiro, Tongshuang Wu, Carlos
Guestrin, and Sameer Singh. 2020. 超过
testing of NLP mod-
准确性: Behavioral
els with CheckList. 在诉讼程序中

58th Annual Meeting of the Association for
计算语言学, pages 4902–4912,
在线的. Association for Computational Lin-
语言学. https://doi.org/10.18653
/v1/2020.acl-main.442

Md Rashad Al Hasan Rony, Liubov Kovriguina,
Debanjan Chaudhuri, Ricardo Usbeck, 和
Jens Lehmann. 2022. RoMe: A robust met-
ric for evaluating natural language generation.
In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 5645–5657,
爱尔兰. Association for Computa-
都柏林,
tional Linguistics. https://doi.org/10
.18653/v1/2022.acl-long.387

Ananya B. Sai, Tanay Dixit, Dev Yashpal Sheth,
Sreyas Mohan, and Mitesh M. Khapra. 2021.
Perturbation checklists for evaluating NLG
evaluation metrics. In Proceedings of the Con-
ference on Empirical Methods in Natural
语言处理 (EMNLP).

Thibault Sellam, Dipanjan Das, and Ankur P.
Parikh. 2020. BLEURT: Learning robust met-
rics for text generation. 在诉讼程序中
前交叉韧带. https://doi.org/10.18653/v1
/2020.acl-main.704

Rico Sennrich. 2017. How grammatical


character-level neural machine translation?

823

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Assessing MT quality with contrastive translation
对. In Proceedings of the 15th Conference
of the European Chapter of the Association for
计算语言学: 体积 2, Short
文件, pages 376–382, Valencia, 西班牙.
for Computational Linguis-
协会
抽动症. https://doi.org/10.18653/v1
/E17-2060

Ori Shapira, David Gabay, Yang Gao, Hadar
Ronen, Ramakanth Pasunuru, Mohit Bansal,
Yael Amsterdamer, and Ido Dagan. 2019.
Crowdsourcing lightweight pyramids for man-
ual summary evaluation. 在诉讼程序中
2019 Conference of the North American Chapter
of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 1
(Long and Short Papers), pages 682–687,
明尼阿波利斯, Minnesota. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N19-1072

Yurun Song, Junchen Zhao, and Lucia Specia.
2021. Sentsim: Crosslingual semantic evalua-
tion of machine translation. 在诉讼程序中
这 2021 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
pages 3143–3156. https://doi.org/10
.18653/v1/2021.naacl-main.252

Miloˇs Stanojevi´c, Amir Kamran, Philipp Koehn,
and Ondˇrej Bojar. 2015. Results of the WMT15
metrics shared task. In Proceedings of the Tenth
Workshop on Statistical Machine Translation,
pages 256–273, 里斯本, Portugal. 协会
for Computational Linguistics. https://
doi.org/10.18653/v1/W15-3031

Tianxiang Sun, Junliang He, Xipeng Qiu, 和
Xuanjing Huang. 2022. BERTScore is unfair:
On social bias in language model-based met-
rics for text generation. 在诉讼程序中
2022 Conference on Empirical Methods in Nat-
ural Language Processing, pages 3726–3739,
阿布扎比, United Arab Emirates. 协会
for Computational Linguistics.

Gongbo Tang, Philipp R¨onchen, Rico Sennrich,
and Joakim Nivre. 2021. Revisiting negation
in neural machine translation. Transactions of
the Association for Computational Linguistics,
9:740–755. https://doi.org/10.1162
/tacl_a_00395

Brian Thompson and Matt Post. 2020. 汽车-
matic machine translation evaluation in many
languages via zero-shot paraphrasing. In Pro-
ceedings of the 2020 Conference on Empiri-
cal Methods in Natural Language Processing
(EMNLP), pages 90–121, 在线的. 协会
for Computational Linguistics. https://土井
.org/10.18653/v1/2020.emnlp-main.8

Prasetya Ajie Utama, Joshua Bambrick, Nafise
Sadat Moosavi, and Iryna Gurevych. 2022.
Falsesum: Generating document-level nli ex-
amples for recognizing factual inconsistency
in summarization. https://doi.org/10
.48550/arXiv.2205.06009

Prasetya Ajie Utama, Nafise Sadat Moosavi, 和
伊琳娜·古列维奇. 2020. Mind the trade-off: 的-
biasing NLU models without degrading the
in-distribution performance. 在诉讼程序中
the 58th Annual Meeting of the Association for
计算语言学, pages 8717–8729,
在线的. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2020.acl-main.770

Doan Nam Long Vu, Nafise Sadat Moosavi, 和
Steffen Eger. 2022. Layer or representation
空间: What makes BERT-based evaluation
the 29th
metrics robust? 在诉讼程序中
国际计算会议
语言学, pages 3401–3411, Gyeongju, 关于-
public of Korea. International Committee on
计算语言学.

Yu Wan, Dayiheng Liu, Baosong Yang, Haibo
张, Boxing Chen, Derek Wong, and Lidia
Chao. 2022. UniTE: Unified translation eval-
uation. 在诉讼程序中
the 60th Annual
Meeting of
the Association for Computa-
tional Linguistics (体积 1: Long Papers),
pages 8117–8127, 都柏林, 爱尔兰. 协会
for Computational Linguistics. https://土井
.org/10.18653/v1/2022.acl-long.558

Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through in-
参考. 在诉讼程序中 2018 Confer-
ence of the North American Chapter of the
计算语言学协会:
人类语言技术, 体积 1 (长的
文件), pages 1112–1122. 协会
计算语言学. https://土井
.org/10.18653/v1/N18-1101

824

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Yinfei Yang, Yuan Zhang, Chris Tar, 和
Jason Baldridge. 2019. PAWS-X: A cross-
for paraphrase
lingual adversarial dataset
ofEMNLP.
identification.
https://doi.org/10.18653/v1/D19
-1382

In Proceedings

Wenpeng Yin, Dragomir Radev, and Caiming
Xiong. 2021. DocNLI: A large-scale dataset
language infer-
for document-level natural
恩斯. In Findings of the Association for Com-
putational Linguistics: ACL-IJCNLP 2021,
pages 4913–4922, 在线的. 协会
计算语言学. https://doi.org
/10.18653/v1/2021.findings-acl.435

Weizhe Yuan, Graham Neubig, and Pengfei Liu.
2021. BARTscore: Evaluating generated text as
text generation. Advances in Neural Informa-
tion Processing Systems, 34:27263–27277.

Tianyi Zhang, Varsha Kishore, Felix Wu,
Kilian Q. 温伯格, and Yoav Artzi. 2020.
BERTscore: Evaluating text generation with
BERT. In International Conference on Learn-
ing Representations.

Yuan Zhang, Jason Baldridge, and Luheng He.
2019. PAWS: Paraphrase adversaries from
word scrambling. In Proceedings of NAACL.

Wei Zhao, Goran Glavaˇs, Maxime Peyrard,
Yang Gao, Robert West, and Steffen Eger.
2020. On the limitations of cross-lingual en-

coders as exposed by reference-free machine
translation evaluation. 在诉讼程序中
58th Annual Meeting of the Association for
计算语言学, pages 1656–1671,
在线的. Association for Computational Lin-
语言学. https://doi.org/10.18653
/v1/2020.acl-main.151

Wei Zhao, Maxime Peyrard, Fei Liu, 哪个
高, Christian M. 迈耶, and Steffen Eger.
2019. MoverScore: Text generation evaluat-
ing with contextualized embeddings and earth
mover distance. 在诉讼程序中 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th Interna-
tional Joint Conference on Natural Language
加工 (EMNLP-IJCNLP), pages 563–578,
香港, 中国. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D19-1053

Wei Zhao, Michael Strube, and Steffen Eger.
2023. Discoscore: Evaluating text generation
with BERT and discourse coherence. In EACL.

在诉讼程序中

Xiang Zhou and Mohit Bansal. 2020. 到-
wards robustifying NLI models against lex-

ical dataset biases.
58th Annual Meeting of the Association for
计算语言学, pages 8759–8771,
在线的. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2020.acl-main.773

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
7
6
2
1
4
3
2
9
7

/

/
t

A
C
_
A
_
0
0
5
7
6
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

825MENLI: Robust Evaluation Metrics from Natural Language Inference image
MENLI: Robust Evaluation Metrics from Natural Language Inference image
MENLI: Robust Evaluation Metrics from Natural Language Inference image
MENLI: Robust Evaluation Metrics from Natural Language Inference image
MENLI: Robust Evaluation Metrics from Natural Language Inference image
MENLI: Robust Evaluation Metrics from Natural Language Inference image

下载pdf