SUMMAC: Re-Visiting NLI-based Models for - Am MIT spezialisierte KI-Forschung

SUMMAC: Re-Visiting NLI-based Models for
Inconsistency Detection in Summarization

Philippe Laban Tobias Schnabel Paul N. Bennett Marti A. Hearst
UC Berkeley, USA Microsoft, USA Microsoft, USA UC Berkeley, USA∗

Abstrakt

In the summarization domain, a key require-
ment for summaries is to be factually consis-
tent with the input document. Previous work
has found that natural
language inference
(NLI) models do not perform competitively
when applied to inconsistency detection. In
this work, we revisit the use of NLI for in-
consistency detection, finding that past work
suffered from a mismatch in input granular-
ity between NLI datasets (sentence-level), Und
inconsistency detection (document level). Wir
provide a highly effective and light-weight
method called SUMMACCONV that enables NLI
models to be successfully used for this task by
segmenting documents into sentence units and
aggregating scores between pairs of sentences.
We furthermore introduce a new benchmark
called SUMMAC (Summary Consistency) welche
consists of six large inconsistency detection
datasets. On this dataset, SUMMACConv obtains
state-of-the-art results with a balanced accu-
racy of 74.4%, A 5% improvement compared
with prior work.

Einführung

Recent progress in text summarization has been
remarkable, with ROUGE record-setting models
published every few months, and human eval-
uations indicating that automatically generated
summaries are matching human-written sum-
maries in terms of fluency and informativeness
(Zhang et al., 2020A).

A major limitation of current summarization
models is their inability to remain factually consis-
tent with the respective input document. Summary
inconsistencies are diverse—from inversions (d.h.,
negation) to incorrect use of an entity (d.h., Thema,
object swapping), or hallucinations (d.h., intro-
duction of entity not in the original document).
Recent studies have shown that in some scenarios,

∗Author emails: {phillab,hearst}@berkeley.edu, {Tobias.

Schnabel,Paul.N.Bennett}@microsoft.com

163

even state-of-the-art pre-trained language mod-
els can generate inconsistent summaries in more
als 70% of all cases (Pagnoni et al., 2021). Das
has led to accelerated research around summary
inconsistency detection.

A closely related task to inconsistency detection
is textual entailment, also referred to as Natural
Language Inference (NLI), in which a hypothesis
sentence must be classified as either entailed by,
neutral, or contradicting a premise sentence. En-
abled by the crowd-sourcing of large NLI datasets
such as SNLI (Bowman et al., 2015) and MNLI
(Williams et al., 2018), modern architectures have
achieved close to human performance at the task.
The similarity of NLI to inconsistency detec-
tion, as well as the availability of high-performing
NLI models, led to early attempts at using NLI to
detect consistencyerrorsinsummaries. These early at-
tempts were unsuccessful, finding that re-ranking
summaries according to an NLI model can lead
to an increase in consistency errors (Falke et al.,
2019), or that out-of-the-box NLI models obtain
52% accuracy at the binary classification task
of inconsistency detection, only slightly above
random guessing (Kryscinski et al., 2020).

In this work, we revisit this approach, showing
that NLI models can in fact successfully be used
for inconsistency detection, as long as they are
used at the appropriate granularity. Figur 1 zeigt an
how crucial using the correct granularity as input
to NLI models is. An inconsistency checker should
flag the last sentence in the summary (shown right)
as problematic. When treating the entire document
as the premise and the summary as the hypothesis,
a competitive NLI model predicts with probability
von 0.91 that the summary is entailed by the docu-
ment. Jedoch, when splitting the documents into
sentence premise-hypothesis pairs (visualized as
edges in Figure 1) the NLI model correctly deter-
mines that S3 is not supported by any document
Satz. This illustrates that working with sen-
tence pairs is crucial for making NLI models work
for inconsistency detection.

Transactions of the Association for Computational Linguistics, Bd. 10, S. 163–177, 2022. https://doi.org/10.1162/tacl a 00453
Action Editor: Shay Cohen. Submission batch: 8/2021; Revision batch: 11/2021; Published 2/2022.
C(cid:3) 2022 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

2.1 Fact Checking and Verification

Fact checking is a related task in which a model
receives an input claim along with a corpus of
ground truth information. The model must then
retrieve relevant evidence and decide whether the
claim is supported, refuted, or if there is not
enough information in the corpus (Thorne et al.,
2018). The major difference to our task lies in the
different semantics of consistency and accuracy.
If a summary adds novel and accurate information
not present in the original document (z.B., adding
background information), the summary is accurate
but inconsistent. In the summary inconsistency
detection domain, the focus is on detecting any
inconsistency, regardless of its accuracy, as prior
work has shown that current automatic summariz-
ers are predominantly inaccurate when inconsist-
ent (Maynez et al., 2020).

2.2 Datasets for Inconsistency Detection

Several datasets have been annotated to evalu-
ate model performance in inconsistency detection,
typically comprising up to two thousand annotated
summaries. Datasets are most commonly crowd-
annotated with three judgements each, despite
some work showing that as many as eight anno-
tators are required to achieve high inter-annotator
Vereinbarung (Falke et al., 2019).

Reading the entire original document being
summarized is time-consuming, and to amortize
this cost, consistency datasets often contain multi-
ple summaries, generated by different models, für
the same original document.

Some datasets consist of an overall consistency
label for a summary (z.B., FactCC [Kryscinski
et al., 2020]), while others propose a finer-grained
typology with up to 8 types of consistency errors
(Huang et al., 2020).

We include the six largest summary consistency
datasets in the SUMMAC Benchmark, and describe
them more in detail in Section 4.

2.3 Methods for Inconsistency Detection

Due to data limitations, most inconsistency detec-
tion methods adapt NLP pipelines from other tasks
including QAG models, synthetic classifiers, Und
parsing-based methods.

QAG methods follow three steps: (1) question
Generation (QG), (2) question answering (QA)
with the document and the summary, (3) matching
document and summary answers. A summary is

Figur 1: Example document with an inconsistent
summary. When running each sentence pair (Aus, Sj)
through an NLI model, S3 is not entailed by any doc-
ument sentence. Jedoch, when running the entire
(document, summary) at once, the NLI model incor-
rectly predicts that the document highly entails the
entire summary.

Our contributions are two-fold. Erste, we intro-
duce a new approach for inconsistency detection
based on the aggregation of sentence-level entail-
ment scores for each pair of input document and
summary sentences. We present two model vari-
ants that differ in the way they aggregate sentence-
level scores into a single score. SUMMACZS
performs zero-shot aggregation by combining
sentence-level scores using max and mean op-
erators. SUMMACCONV is a trained model consisting
of a single learned convolution layer compiling the
distribution of entailment scores of all document
sentences into a single score.

Zweite, to evaluate our approach, we introduce
the SUMMAC Benchmark by standardizing existing
datasets. Because the benchmark contains the six
largest summary consistency datasets, es ist mehr
comprehensive and includes a broader range of
inconsistency errors than prior work.

The SUMMAC models outperform existing in-
consistency detection models on the benchmark,
with the SUMMACCONV obtaining an overall bal-
anced accuracy of 74.4%, 5% above prior work.
We publicly release the models and datasets.1

2 Related Work

We briefly survey existing methods and datasets
for fact checking, inconsistency detection, Und
inconsistency correction.

1https://github.com/tingofurro/summac/.

164

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

considered consistent if few or no questions have
differing answer with the document. A key design
choice for these methods lies in the source for
question generation. Durmus et al. (2020) generate
questions using the summary as a source, Herstellung
their FEQA method precision-oriented. Scialom
et al. (2019) generate questions with the document
as a source, creating a recall-focused measure.
Scialom et al. (2021) unite both in QuestEval, von
generating two sets of questions, sourced from the
summary and document respectively. We include
FEQA and QuestEval in our benchmark results.

Synthetic classifiers rely on large, synthetic
datasets of summaries with inconsistencies, Und
use those to train a classifier with the expectation
that the model generalizes to non-synthetic sum-
maries. To generate a synthetic dataset, Kryscinski
et al. (2020) propose a set of semantically invari-
ant (z.B., paraphrasing) and variant (z.B., Satz
negation) text transformations that they apply to a
large summarization dataset. FactCC-CLS, Die
classifier obtained when training on the synthetic
dataset, is included in our benchmark results for
comparison.

Parsing-based methods generate

Beziehungen
through parsing and compute the fraction of
summary relations that are compatible with docu
ment relations as a precision measure of summary
factuality. Goodrich et al. (2019) extract (sub-
ject, relation, Objekt) tuples most
commonly using OpenIE (Etzioni et al., 2008). In
the recent DAE model, Goyal and Durrett (2020)
propose to use arc labels from a dependency
parser instead of relation triplet. We include the
DAE model in our benchmark results.

2.4 Methods for Consistency Correction

Complementary to inconsistency detection, manche
work focused on the task of mitigating inconsis-
tency errors during summarization. Approaches
fall in two categories: Reinforcement Learning
(RL) methods to improve models and stand-alone
re-writing methods.

RL methods often rely on an out-of-the-box
inconsistency detection model and use reinforce-
ment
learning to optimize a reward with a
consistency component. Arumae and Liu (2019)
optimize a QA-based consistency reward, Und
Nan et al. (2021) streamline a QAG reward by
combining the QG and QA model, making it more
efficient for RL training. Pasunuru and Bansal

(2018) leverage an NLI-based component as
part of an overall ROUGE-based reward, Und
Zhang et al. (2020B) use a parsing-based measure
in the domain of medical report summarization.

Re-writing methods typically operate as a
modular component that is applied after an ex-
isting summarization model. Cao et al. (2020) use
a synthetic dataset of rule-corrupted summaries
to train a post-corrector model, but find that this
model does not transfer well to real summarizer
Fehler. Dong et al. (2020) propose to use a QAG
model to find erroneous spans, which are then
corrected using a post-processing model.

Since all methods discussed above for con-
sistency correction rely on a model
to detect
inconsistencies, they will naturally benefit from
more accurate inconsistency detectors.

3 SUMMAC Models

We now introduce our SUMMAC models for incon-
sistency detection. The first step common to all
models is to apply an out-of-the-box NLI model to
generate an NLI Pair Matrix for a (document,
summary) pair. The two models we present then
differ in the way they process this pair matrix
to produce a single consistency score for a given
summary. We also describe the SUMMAC evalua-
tion benchmark, a set of inconsistency detection
datasets, in Section 4. In Section 5, we measure the
performance of the SUMMAC models on this bench-
mark and investigate components of the models,
including which NLI model achieves highest per-
Form, which NLI categories should be used,
and what textual granularity is most effective.

3.1 Generating the NLI Pair Matrix

NLI datasets are predominantly represented at the
sentence level. In our pilot experiments, we found
that this causes the resulting NLI models to fail
in assessing consistency for documents with 50
sentences and more.

This motivates the following approach. Wir
generate an NLI Pair Matrix by splitting a (doc-
ument, summary) pair into sentence blocks.
The document is split into M blocks, each consid-
ered a premise labeled from D1, . . . , DM , und das
summary is split into N blocks, each considered
a hypothesis labeled from S1, . . . , SN .

Each Di, Sj combination is run through the
NLI model, which produces a probability distribu-
tion over the three NLI categories (Eij, Cij, Nij)

165

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

step consists of retaining the score for the docu-
ment sentence that provides the strongest support
for each summary sentence. For the example in
Figur 1:

max(Xpair, axis=‘col’) =

0.98 0.99 0.04

(cid:8)

(cid:9)

The second step consists of taking the mean
of the produced vector, reducing the vector to a
scalar which is used as the final model score. Bei
a high level, this step aggregates sentence-level
information into a single score for the entire sum-
mary. Zum Beispiel, in Abbildung 1, the score produced
by SUMMACZS would be 0.67. If we removed the
third sentence from the summary, the score would
increase to 0.985. We experiment with replacing
the max and mean operators with other operators
in Appendix B.

3.3 SUMMACCONV: Convolution

One limitation of SUMMACZS is that it is highly
sensitive to extrema, which can be noisy due to
the presence of outliers and the imperfect na-
ture of NLI models. In SUMMACCONV, we reduce
the reliance on extrema values by instead taking
into account the entire distribution of entailment
scores for each summary sentence. For each sum-
mary sentence, a learned convolutional layer is in
charge of converting the entire distribution into a
single score.

The first step of the SUMMACCONV algorithm is
to turn each column of the NLI Pair Matrix into
a fixed-size histogram that represents the distribu-
tion of scores for that given summary sentence.

if H = 5,

We bin the NLI scores into H evenly spaced
the bins are [0, 0.2),
Mülleimer (z.B.,
[0.2, 0.4), [0.4, 0.6), [0.6, 0.8), [0.8, 1)). Thus the
first summary sentence of the example in Figure 1
would have the following histogram: [2, 0, 1, 0, 1],
because there are two values between [0.0, 0.2] In
the first column, one in [0.4, 0.6] and one in
[0.8, 1.0].

By producing one histogram for each summary
Satz, the binning process in the example of
Figur 1 would produce:

bin(Xpair) =

⎡

⎢
⎢
⎢
⎢
⎣

⎤

⎥
⎥
⎥
⎥
⎦

2 3 4
0 0 0
1 0 0
0 0 0
1 1 0

Figur 2: Diagram of the SUMMACZS (top) Und
SUMMACCONV (bottom) Modelle. Both models utilize the
same NLI Pair Matrix (Mitte) but differ in their pro-
cessing to obtain a score. The SUMMACZS is Zero-Shot,
and does not have trained parameters. SUMMACCONV
uses a convolutional layer trained on a binned version
of the NLI Pair Matrix.

for entailment, contradiction, and neutral, bzw-
aktiv. If not specified otherwise, the pair matrix
is an M × N matrix consisting of the entail-
ment scores Eij. In Section 5.3.3, we examine the
effect of granularity by splitting texts at the para-
graph level or binning two sentences at a time. In
Abschnitt 5.3.2, we explore the use of the contradic-
tion and neutral categories in our experiments.

The example in Figure 1 has M = 4 document
Sätze, and N = 3 summary sentences, und das
corresponding NLI Pair Matrix is the following:

Xpair =

⎡

⎢
⎢
⎣

⎤

⎥
⎥
⎦

0.02 0.02 0.04
0.98 0.00 0.00
0.43 0.99 0.00
0.00 0.00 0.01

The pair matrix can be interpreted as the weights
of a bipartite graph, which is also illustrated in
Figur 1 where the opacity of each edge (ich, J)
represents the entailment probability Eij.

The two SUMMAC models take as input the same
NLI Pair Matrix, but differ in the aggregation
method to transform the pair matrix into a score.
Figur 2 presents an overview of SUMMACZS and
SUMMACCONV.

3.2 SUMMACZS: Zero-Shot
In the SUMMACZS model, we reduce the pair ma-
trix to a one-dimensional vector by taking the
maximum (max) value of each column. On an
intuitive level, for each summary sentence, Das

166

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Dataset

Size
Valid. Test

% Positive

IAA Source # Summarizer # Sublabel

CoGenSumm (Falke et al., 2019)
XSumFaith (Maynez et al., 2020)
Polytope (Huang et al., 2020)
FactCC (Kryscinski et al., 2020)
SummEval (Fabbri et al., 2021)
FRANK (Pagnoni et al., 2021)

1281
1250
634
931
850
671

400
1250
634
503
850
1575

49.8
10.2
6.6
85.0
90.6
33.2

0.65
C
0.80 X
−
C
−
C
C
0.7
C+X
0.53

3
5
10
10
23
9

0
2
8
0
4
7

Tisch 1: Statistics of the six datasets in the SUMMAC Benchmark. For each dataset, we report the
validation and test set sizes, the percentage of summaries with positive (konsistent) labels (% Positive),
the inter-annotator agreement (when available, IAA), the source of the documents (Quelle: C for
CNN/DM, X for XSum), the number of summarizers evaluated, and the number of sublabels annotated.

The binned matrix is then passed through a 1-D
convolution layer with a kernel size of H. Der
convolution layer scans the summary histograms
one at a time, and compiles each into a scalar
value for each summary. Endlich, the scores of
each summary sentence are averaged to obtain the
final summary-level score.

In order to learn the weights of the convolution
layer, we train the SUMMACCONV model end-to-
end with the synthetic training data in FactCC
(Kryscinski et al., 2020). The original training
dataset contains one million (document,
summary) pairs evenly distributed with con-
sistent and inconsistent summaries. Because we
are only training a small set of H parameters
(we use H = 50), we find that using a 10,000
sub-sample is sufficient. We train the model using
a cross-entropy loss, the Adam optimizer, a batch
size of 32, and a learning rate of 10−2. We perform
hyper-parameter tuning on a validation set from
the FactCC dataset.

The number of bins used in the binning process,
which corresponds to the number of parameters in
the convolution layer, is also a hyper-parameter
we tune on the validation set. We find that per-
formance increases until 50 Mülleimer (d.h., a bin width
von 0.02) and then plateaus. Wir gebrauchen 50 bins in all
our experiments.

4 SUMMAC Benchmark

To rigorously evaluate the SUMMAC models on
a diverse set of summaries with consistency
judgements, we introduce a new large benchmark
dataset, the SUMMAC Benchmark. It comprises the
six largest available datasets for summary incon-

sistency detection, which we standardize to use
the same classification task.

4.1 Benchmark Standardization

We standardize the task of summary inconsistency
detection by casting it as a binary classification
Aufgabe. Each dataset contains (document, sum-
mary, label) Proben, where the label can
either be consistent or inconsistent.

Each dataset is divided into a validation and
test split, with the validation being available for
parameter tuning. We used existing validation/test
splits created by dataset authors when available.
We did not find a split for XSumFaith, Poly-
tope, and SummEval, and created one by putting
even-indexed samples in a validation split, Und
odd-indexed samples in the test split. This method
of splitting maintains similar class imbalance and
summarizer identity with the entire dataset.

We computed inter-annotator agreement cal-
culated with Fleiss’ Kappa (Fleiss, 1971) on the
dataset as an estimate for dataset quality, omit-
ting datasets for which summaries only had a
single annotator (Polytope and FactCC). Tisch 1
summarizes dataset statistics and properties.

4.2 Benchmark Datasets

We introduce each dataset
in the benchmark
chronologically, and describe the standardizing
procedure.

CoGenSumm (Correctness of Generated
Summaries, CGS) (Falke et al., 2019) is the first
introduced dataset for summary inconsistency
detection, based on models
trained on the
CNN/DM dataset (Nallapati et al., 2016). Der

167

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

authors proposed that consistency detection
should be approached as a ranking problem:
Given a consistent and inconsistent summary for a
common document, a ranking model should score
the consistent summary higher. Although inno-
vative, other datasets in the benchmark do not
always have positive and negative samples for a
given document. We thus map the dataset to a
classification task by using all inconsistent and
consistent summaries as individual samples.

XSumFaith (eXtreme Summarization Faith-
fulness, XSF) (Maynez et al., 2020) is a data-
set with models trained on the XSum dataset
(Narayan et al., 2018), which consists of more
abstractive summaries than CoGenSumm. Der
authors find that standard generators remain con-
sistent for only 20-30% of generated summaries.
The authors differentiate between extrinsic and
intrinsic hallucinations (which we call inconsis-
tencies in this work). Extrinsic hallucinations,
which involve words or concepts not in the original
document can nonetheless be accurate or inaccu-
rate. In order for a summarizer to generate an
accurate extrinsic hallucination, the summarizer
must possess external world knowledge. Weil
the authors found that the models are primarily
inaccurate in terms of extrinsic hallucinations, Wir
map both extrinsic and intrinsic hallucinations to
a common inconsistent label.

Polytope (Huang et al., 2020) introduces a
more extensive typology of summarization errors,
based on the Multi-dimensional Quality Metric
(Mariana, 2014). Each summary is annotated with
eight possible errors, as well as a severity level for
the error. We standardize this dataset by labeling
a summary as inconsistent if it was annotated with
any of the five accuracy errors (and disregarded
the three fluency errors). Each summary in Poly-
tope was labeled by a single annotator, making it
impossible to measure inter-annotator agreement.
FactCC (Kryscinski et al., 2020) contains
validation and test splits that are entirely anno-
tated by authors of the paper, because attempts
at crowd-sourced annotation yielded low inter-
annotator agreement. Prior work (Gillick and Liu,
there can be divergence in
2010) zeigt, dass
annotations between experts and non-experts in
summarization, and because the authors of the
paper are NLP researchers familiar with the lim-
itations of automatic summarizations, we expect
that FactCC annotations differs in quality from
other datasets. FactCC also introduces a synthetic

dataset by modifying consistent summaries with
semantically variant rules. We use a sub-portion
of this synthetic dataset to train the SUMMACCONV
Modell.

SummEval (Fabbri et al., 2021) contains sum-
marizer outputs from seven extractive models and
sixteen abstractive models. Each summary was
labeled using a 5-point Likert scale along four
categories: coherence, consistency, fluency, Und
relevance by 3 annotators. We label summaries as
consistent if all annotators gave a score of 5 In
consistency, and inconsistent otherwise.

FRANK (Pagnoni et al., 2021) contains anno-
tations for summarizers trained on both CNN/DM
and XSum, with each summary annotated by three
crowd-workers. The authors propose a new ty-
pology with seven error types, organisiert in
semantic frame errors, discourse errors and con-
tent verifiability errors. The authors confirm that
models trained on the more abstractive XSum
dataset generate a larger proportion of inconsis-
tent summaries, compared to models trained on
CNN/DM. We label summaries as consistent if
a majority of annotators labeled the summary as
containing no error.

4.3 Benchmark Evaluation Metrics

With each dataset in the SUMMAC Benchmark
converted to a binary classification task, we now
discuss the choice of appropriate evaluation met-
rics for the benchmark. Previous work on each
dataset in the benchmark used different evaluation
Methoden, falling into three main categories.

Erste, CoGenSumm proposes a re-ranking based
messen, requiring pairs of consistent and incon-
sistent summaries for any document evaluated;
this information is not available in several datasets
in the benchmark.

Zweite, XSumFaith, SummEval, and FRANK
report on correlation of various metrics with
human annotations. Correlation has some advan-
tages, such as not requiring a threshold and being
compatible with the Likert-scale annotations of
SummEval, however it is an uncommon choice
to measure performance of a classifier due to the
discrete and binary label.

Dritte, authors of FactCC measured model per-
formance using binary F1 score, and balanced
accuracy, which corrects unweighed accuracy
with the class imbalance ratio, so that majority
class voting obtains a score of 50%.

168

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

The datasets have widely varying class imbal-
ances, ranging from 6% Zu 91% positive sam-
ples. daher, we select balanced accuracy
(Brodersen et al., 2010) as the primary evalua-
tion metric for the SUMMAC Benchmark. Balanced
accuracy is defined as:

BAcc =

(cid:10)

1
2

T P
T P + F N

T N
T N + F P

(cid:11)

(1)

Where T P stands for true positive, F P false pos-
itive, T N true negative, and F N false negative.
The choice of metric is based on the fact that
accuracy is a conceptually simple, interpretable
metric, and that adjusting the class imbalance
out of the metric makes the score more uniform
across datasets.

The balanced accuracy metric requires models
to output a binary label (d.h., not a scalar score),
which for most models requires the selection of a
threshold in the score. The threshold is selected
using the validation set, allowing for a different
threshold for each dataset in the benchmark. Per-
formance on the benchmark is the unweighted
average of performance on the six datasets.

We choose Area Under the Curve of the
Receiver Operating Chart (ROC-AUC) as a sec-
ondary evaluation metric, a common metric to
summarize a classifier’s performance at different
threshold levels (Bradley, 1997).

5 Ergebnisse

We compared the SUMMAC models against a wide
array of baselines and state-of-the-art methods.

5.1 Comparison Models

We evaluated the following models on the
SUMMAC Benchmark:

NER Overlap uses the spaCy named entity
recognition (NER) Modell (Honnibal et al., 2020)
to detect when an entity present in the summary is
not present in the document. This model, adapted
from Laban et al. (2021), considers only a sub-
set of entity types as hallucinations (PERSON,
LOCATION, ORGANIZATION, usw.)

MNLI-doc is a RoBERTa (Liu et al., 2019)
model finetuned on the MNLI dataset (Williams
et al., 2018). The document is used as the premise
and the summary as a hypothesis, and we use
the predicted probability of entailment as a score,

similar to prior work on using NLI models for
inconsistency detection (Kryscinski et al., 2020).
FactCC-CLS is a RoBERTa-base model fine-
tuned on the synthetic training portion of the
FactCC dataset. Although trained solely on artifi-
cially created inconsistent summaries, prior work
showed the model to be competitive on the FactCC
and FRANK datasets.

DAE (Goyal and Durrett, 2020) is a parsing-
based model using the default model and hyper-
parameters provided by the authors of the paper.2
FEQA (Durmus et al., 2020) is a QAG method,
using the default model and hyper-parameters
provided by the authors of the paper.3

QuestEval (Scialom et al., 2021) is a QAG
method taking both precision and recall
into
account. We use the default model and hyper-
parameters provided by the authors of the paper.4
The model has an option to use an additional
question weighter, however experiments revealed
that the weighter lowered overall performance on
the validation portion of the SUMMAC Benchmark,
and we compare to the model without weighter.

5.2 SUMMAC Benchmark Results

Balanced accuracy results are summarized in
Tisch 2. We find that the SUMMAC models achieve
the two best performances in the benchmark.
SUMMACCONV achieves the best benchmark per-
formance at 74.4%, 5 points above QuestEval, Die
best method not involving NLI.

Looking at the models’ ability to generalize
across datasets and varying scenarios of inconsis-
tency detection provides interesting insights. Für
Beispiel, the FactCC-CLS model achieves strong
performance on the FactCC dataset, but close to
lowest performance on FRANK and XSumFaith.
In comparison, SUMMAC model performance is
strong across the board.

The strong improvement from the SUMMACZS
to SUMMACCONV also shines a light on the im-
portance of considering the entire distribution of
document scores for each summary sentence, In-
stead of taking only the maximum score: Der
SUMMACCONV model learns to look at the distribu-
tion and makes more robust decisions, leading to
gains in performance.

The table of results with the ROC-AUC metric,
the secondary metric of the SUMMAC Benchmark,

2https://github.com/tagoyal/dae-factuality.
3https://github.com/esdurmus/feqa.
4https://github.com/ThomasScialom/QuestEval.

169

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Polytope FactCC SummEval FRANK

Baseline

Classifier
Parsing

Model Type Model Name
NER-Overlap
MNLI-doc
FactCC-CLS
DAE
FEQA
QuestEval
SUMMACZS
SUMMACCONV

QAG

NLI

CGS
53.0
57.6
63.1
63.4
61.0
62.6
70.4*
64.7

SUMMAC Benchmark Datasets
XSF
63.3
57.5
57.6
50.8
56.0
62.1
58.4
66.4*

55.0
61.3
75.9
75.9
53.6
66.6
83.8*
89.5**

52.0
61.0
61.0
62.8
57.8
70.3*
62.0
62.7

56.8
66.6
60.1
70.3
53.8
72.5
78.7
81.7**

60.9
63.6
59.4
61.7
69.9
82.1
79.0
81.6

Gesamt
56.8
61.3
62.8
64.2
58.7
69.4
72.1*
74.4**

Doc./min.
55,900
6,200
13,900
755
33.9
22.7
435
433

Tisch 2: Performance of Summary Inconsistency Detection models on the test set of the SUMMAC
Benchmark. Balanced accuracy is computed for each model on the six datasets in the benchmark, Und
the average is computed as the overall performance on the benchmark. We obtain confidence intervals
comparing the SUMMAC models to prior work: * indicates an improvement with 95% confidence, Und **
99% confidence (details in Section 5.2.1). The results of the throughput analysis of Section 5.2.2 are in
column Doc./min (Documents per minute).

is included in Appendix A2, echoing the trends
seen with the balanced accuracy metric.

(d.h., number of documents processed by the model
per unit of time).

5.2.1 Statistical Testing

We aim to determine whether the performance im-
provements of the SUMMAC models are statistically
bedeutsam. For each dataset of the benchmark, Wir
perform two tests through bootstrap resampling
(Efron, 1982), comparing each of the SUMMAC
models to the best-performing model from prior
arbeiten. We perform interval comparison at two sig-
nificance level: p = 0.05 and p = 0.01, and apply
the Bonferroni correction (Bonferroni, 1935) als
we perform several tests on each dataset. Wir
summarize which improvements are significant in
Tisch 2, and perform a similar testing procedure
for the ROC-AUC results in Table A2.

SUMMAC models lead to a statistically signifi-
cant improvement on CoGenSumm, XSumFaith,
FactCC, and SummEval. QuestEval outperforms
the SUMMAC models on Polytope at a confidence
von 95%. On the FRANK dataset, QuestEval and
SUMMACCONV achieve highest performance with no
statistical difference. Overall on the benchmark,
both SUMMAC models significantly outperform
prior work, SUMMACZS at a p = 0.05 significance
level and SUMMACCONV at p = 0.01.

5.2.2 Computational Cost Comparison

Computational cost of the method is an important
practical factor to consider when choosing a model
to use, as some applications such as training with
a generator with Reinforcement Learning might
require a minimum throughput from the model

A common method to compare algorithms is
using computational complexity analysis, com-
puting the amount of resources (Zeit, Raum)
needed as the size of the input varies. Compu-
tational complexity analysis is impractical in our
Fall, as the units of analysis differ between mod-
els, and do not allow for a direct comparison.
Genauer, some of the models’ complex-
ity scales with the number of sub-word units
in the document (MNLI-doc, FactCC-CLS),
some with the number of entities in a document
(NER-Overlap, DAE, QuestEval), and some
with number of sentences (the SUMMAC models).
We instead compare models by measuring
throughput on a fixed dataset using a common
hardware setup. More precisely, we measured the
processing time of each model on the 503 docu-
ments in the test set of FactCC (with an average
von 33.2 sentences per document), running a single
Quadro RTX 8000 GPU. For prior work, we used
implementation publicly released by the authors,
and made a best effort to use the model at an
appropriate batch size for a fair comparison.

The result of the throughput analysis is included
in Table 2 (column Docs./min.). SUMMAC mod-
els are able to process around 430 documents
per minute, which is much lower than some of
the baselines capable of processing more than
10,000 documents per minute. Jedoch, QAG
methods are more than 10 times slower than
SUMMAC models, processing only 20-40 docu-
ments per minute.

170

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Architecture

NLI Dataset

Performance
Conv
ZS

Dec. Attn

SNLI

SNLI
BERT Base MNLI

MNLI+VitaminC

56.9

66.6
69.5
67.9

BERT Large VitaminC

SNLI
66.6
SNLI+MNLI+ANLI 69.9
71.1
70.9
72.1

MNLI
MNLI+VitaminC

56.4

64.0
69.8
71.2

62.4
71.7
72.8
73.0
74.4

Tisch 3: Effect of NLI model choice on SUMMAC
models performance. For each NLI model, wir drin-
clude the balanced accuracy scores of SUMMACZS
and SUMMACCONV. BERT X corresponds to a BERT
or other pre-trained models of similar size.

5.3 Further Results

We now examine how different components and
design choices affect SUMMAC model performance.

5.3.1 Choice of NLI Model
SUMMAC models rely on an NLI model at their
core, which consists of choosing two main com-
ponents: a model architecture, and a dataset to
train on. We investigate the effect of both of these
choices on the performance of SUMMAC models
on the benchmark.

Regarding model architectures, we experi
ment with the decomposable attention model
(Parikh et al., 2016), which is a pre-Transformer
architecture model that was shown to achieve high
performance on SNLI, as well as Transformer base
and Transformer Large architectures.

With respect to datasets, we include models
trained on standard NLI datasets such as SNLI
(Bowman et al., 2015) and MNLI (Williams
et al., 2018), as well as more recent datasets
such as Adversarial NLI (Nie et al., 2019) Und
Vitamin C (Schuster et al., 2021).

Results are summarized in Table 3, and we em-
phasize three trends. Erste, the low performance of
the decomposable attention model used in experi-
ments in prior work (Falke et al., 2019) confirms
that less recent NLI models did not transfer well
to summary inconsistency detection.

Zweite, NLI models based on pre-trained
Transformer architectures all achieve strong per-
formance on the benchmark, with an average

171

increase of 1.3 percentage points when going
from a base to a large architecture.

Dritte, the choice of NLI dataset has a strong
influence on overall performance. SNLI leads to
lowest performance, which is expected as its tex-
tual domain is based on image captions, welche
are dissimilar to the news domain. MNLI and
Vitamin C trained models both achieve close to
the best performance, and training on both jointly
leads to the best model, which we designate as the
default NLI model for the SUMMAC models (d.h.,
the model included in Table 2).

The latter two trends point to the fact that
improvements in the field of NLI lead to improve-
ments in the SUMMAC models, and we can expect
that future progress in the NLI community will
translate to gains of performance when integrated
into the SUMMAC model.

We relied on trained models available in Hug-
gingFace’s Model Hub (Wolf et al., 2020). Details
in Appendix A.

5.3.2 Choice of NLI Category

The NLI task is a three-way classification task,
yet most prior work has limited usage of the
model to the use of the entailment probability for
inconsistency detection (Kryscinski et al., 2020;
Falke et al., 2019). We run a systematic experiment
by training multiple SUMMACCONV models that have
access to varying subsets of the NLI labels, Und
measure the impact on overall performance. Re-
sults are summarized in Table 4. Using solely the
entailment category leads to strong performance
for all models. Jedoch, explicitly including the
contradiction label as well leads to small boosts in
performance for the ANLI and MNLI models.

With future NLI models being potentially more
nuanced and calibrated, it is possible that incon-
sistency detector models will be able to rely on
scores from several categories.

5.3.3 Choice of Granularity

So far, we’ve reported experiments primarily with
a sentence-level granularity, as it matches the
granularity of NLI datasets. One can imagine
cases where sentence-level granularity might be
limiting. Zum Beispiel, in the case of a summary
performing a sentence fusion operation, an NLI
model might not be able to correctly predict en-
tailment of the fused sentence, seeing only one
sentence at a time.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Category
C
N

SUMMACCONV Performance

VITC+MNLI

ANLI MNLI

✓

✓
✓

✓

✓
✓
✓

74.4
71.2
72.5
73.1
74.0
72.5
74.0

69.2
55.8
69.2
69.6
70.2
69.2
69.7

72.6
66.4
72.6
72.6
73.0
72.6
73.0

✓

✓
✓

✓

Tisch 4: Effect of NLI category inclusion on
SUMMACCONV performance. Models had access
to different subsets of the three category predic-
tionen (Entailment, Neutral, Contradiction), mit
performance measured in terms of balanced ac-
curacy. Experiments were performed with 3 NLI
Modelle: Vitamic C+MNLI, ANLI, and MNLI.

Performance

Granularity
Document Summary

Full

Paragraph

Two Sent.

Sentence

Full
Sentence
Full
Sentence
Full
Sentence
Full
Sentence

MNLI
ZS Conv
56.4
57.4
59.8
65.2
64.0
71.2
58.7
70.3

−
−
61.8
64.7
63.8
73.5
61.1
73.0

MNLI + VitC
Conv
ZS
−
72.1
−
73.1
71.2
69.8
74.3
72.6
71.3
69.7
74.7
72.5
69.4
68.4
74.4
72.1

Tisch 5: Effect of granularity choice on
SUMMAC models performance. We tested four
granularities on the document side: full, para-
two sentence, and sentence, and two
graph,
granularities on the summary side: full and sen-
tence. Performance of the four models is measured
in balanced accuracy on the benchmark test set.

To explore this facet further, we experiment
with modifying the granularity of both the docu-
ment and the summary. With regard to document
granularity, we consider four granularities: (1)
full text, the text is treated as a single block,
(2) paragraph-level granularity, the text is sep-
arated into paragraph blocks, (3) two-sentence
granularity, the text is separated into blocks of
contiguous sentences of size two (d.h., block 1
contains sentence 1-2, block 2 contains sentence
3-4), Und (4) sentence-level, splitting text at in-
dividual sentences. For the summary granularity,
we only consider two granularities: (1) full text,
Und (2) Satz, because other granularities are
less applicable since summaries usually consist of
three sentences or fewer.

We study the total of 8 (document, sum-
mary) granularity combinations with the two
best-performing NLI models of Table 2: MNLI
and Vitamin C, each included as SUMMACZS and
SUMMACCONV models.5

Results for the granularity experiments are sum-
marized in Table 5. Gesamt, finer granularities
lead to better performance, mit (Satz,
Satz) Und (two sent, Satz)
achieving highest performance across all four
Modelle.

The MNLI-only trained model achieves lowest
performance when used with full text granularity
on the document level, and performance steadily
increases from 56.4% Zu 73.5% as granularity is
made finer both on the document and summary
Seite. Results for the MNLI+VitaminC model vary
less with changing granularity, showcasing that
the model is perhaps more robust to different
granularity levels. However the (two sent,
Satz) Und (Satz,Satz)
settings achieve highest performance, implying
that finer granularity remains valuable.

Ebene

For all models, performance degrades in cases
where granularity on the document
Ist
finer than summary granularity. Zum Beispiel
Die (Satz, full) oder (two sent,
full) combinations lead to some of the low-
est performance. This is expected, as in cases in
which summaries have several sentences, it is un-
likely that they will fully be entailed by a single
document sentence. This implies that granularity
on the document side should be coarser or equal
the summary’s granularity.

Gesamt, we find that finer granularity for the
document and summary is beneficial in terms of
performance and recommend the use of a (sen-
tence, Satz) granularity combination.

6 Discussion and Future Work

Improvements on the Benchmark. The models
we introduced in this paper are just a first step
towards harnessing NLI models for inconsistency
detection. Future work could explore a number
of improvements: combining the predictions of
multiple NLI models, or combining multiple gran-
ularitiy levels—for example, through multi-hop
reasoning (Zhao et al., 2019).

5We skip SUMMACCONV experiments involving full text
granularity on the document-side, as that case reduces the
binning process to having a single non-zero value.

172

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Interpretability of Model Output. If a model
can pinpoint which portion of a summary is in-
konsistent, some work has shown that corrector
models can effectively re-write the problem-
atic portions and often remove the inconsistency
(Dong et al., 2020). Außerdem, fine-grained
consistency scores can be incorporated into visual
analysis tools for summarization such as Summ-
Viz (Vig et al., 2021). The SUMMACZS model is
directly interpretable, whereas the SUMMACCONV
is slightly more opaque, due to the inability to
trace back a low score to a single sentence in the
document being invalidated. Improving the inter-
pretability of the SUMMACCONV model is another
open area for future work.

Beyond News Summarization. Der

sechs
datasets in the SUMMAC Benchmark contain
summaries from the news domain, one of the
most common application of summarization tech-
nology. Recent efforts to expand the application
of summarization to new domains such as legal
(Kornilova and Eidelman, 2019) or scholarly
(Cachola et al., 2020) text will hopefully lead to
the study of inconsistency detection in these novel
domains, and perhaps even out of summarization
on tasks such as text simplification, or code
Generation.

Towards Consistent Summarization. Incon-
sistency detection is but a first step in eliminating
inconsistencies from summarization. Future work
can include more powerful inconsistency detec-
tors in the training of next generation summarizers
to reduce the prevalence of inconsistencies in
generated text.

7 Abschluss

We introduce SUMMACZS and SUMMACCONV, zwei
NLI-based models for summary inconsistency de-
tection based on the key insight that NLI models
require sentence-level input to work best. Beide
models achieve strong performance on the SUM-
MAC Benchmark, a new diverse and standardized
collection of the six largest datasets for inconsis-
tency detection. SUMMACCONV outperforms all prior
work with a balanced accuracy score of 74.4%, ein
improvement of five absolute percentage points
over the best baseline. To the best of our knowl-
edge, this the first successful attempt at adapting
NLI models for inconsistency detection, and we
believe that there are many exciting opportuni-

ties for further improvements and applications of
our methods.

Danksagungen

We would like to thank Katie Stasaski, Dongyeop
Kang, and the TACL reviewers and editors for
their helpful comments, as well as Artidoro
Pagnoni for helpful pointers during the project.
This work was supported by a Microsoft BAIR
Commons grant as well as a Microsoft Azure
Sponsorship.

Verweise

In Proceedings of

Kristjan Arumae and Fei Liu. 2019. Guid-
ing extractive summarization with question-
Die
answering rewards.
2019 Conference of
the North American
Chapter of
the Association for Computa-
tional Linguistics: Human Language Tech-
nologies, Volumen 1 (Long and Short Papers),
pages 2566–2577. https://doi.org/10
.18653/v1/N19-1264

Carlo E. Bonferroni. 1935. Il calcolo delle assi-
curazioni su gruppi di teste. Studi in onore del
professore salvatore ortu carboni, pages 13–60.

Samuel Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015. A
large annotated corpus for learning natural lan-
guage inference. In Proceedings of the 2015
Conference on Empirical Methods in Natural
Language Processing, pages 632–642.

Andrew P. Bradley. 1997. The use of the area
under the ROC curve in the evaluation of
machine learning algorithms. Pattern Recogni-
tion, 30(7):1145–1159. https://doi.org
/10.1016/S0031-3203(96)00142-2

Und

Kay Henning Brodersen, Cheng Soon Ong,
Klaas Enno Stephan,
Joachim M.
Buhmann. 2010. The balanced accuracy and
its posterior distribution. In 2010 20th Inter-
national Conference on Pattern Recognition,
pages 3121–3124. IEEE. https://doi.org
/10.1109/ICPR.2010.764

173

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Isabel Cachola, Kyle Lo, Arman Cohan, Und
Daniel S. Weld. 2020. Tldr: Extreme summa-
rization of scientific documents. In Proceedings
of the 2020 Conference on Empirical Meth-
ods in Natural Language Processing: Findings,
pages 4766–4777. https://doi.org/10
.18653/v1/2020.findings-emnlp.428

Meng Cao, Yue Dong, Jiapeng Wu, and Jackie Chi
Kit Cheung. 2020. Factual error correction for
abstractive summarization models. In Proceed-
ings of the 2020 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 6251–6258. https://doi.org/10
.18653/v1/2020.emnlp-main.506

Yue Dong, Shuohang Wang, Zhe Gan, Yu
Cheng, Jackie Chi Kit Cheung, and Jingjing
Liu. 2020. Multi-fact correction in abstrac-
tive text summarization. In Proceedings of
Die 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 9320–9331. https://doi.org/10
.18653/v1/2020.emnlp-main.749

Esin Durmus, He He,

and Mona Diab.
2020. Feqa: A question answering evalua-
tion framework for faithfulness assessment
In Proceed-
in abstractive summarization.
ings of
Die
Verein für Computerlinguistik,
pages 5055–5070. https://doi.org/10
.18653/v1/2020.acl-main.454

the 58th Annual Meeting of

Bradley Efron. 1982. The jackknife, the boot-
strap and other resampling plans. In CBMS-
NSF Regional Conference Series in Applied
Mathematik. https://doi.org/10.1145
/1409360.1409378

Oren Etzioni, Michele Banko, Stephen Soderland,
and Daniel S. Weld. 2008. Open information
extraction from the web. Communications of
the ACM, 51(12):68–74.

Alexander R. Fabbri, Wojciech Kry´sci´nski, Bryan
McCann, Caiming Xiong, Richard Socher,
and Dragomir Radev. 2021. Summeval: Re-
evaluating summarization evaluation. Trans-
actions of the Association for Computational
Linguistik, 9:391–409. https://doi.org
/10.1162/tacl_a_00373

Tobias Falke, Leonardo F. R. Ribeiro, Prasetya
Ajie Utama, Ido Dagan, and Iryna Gurevych.

174

2019. Ranking generated summaries by correct-
ness: An interesting but challenging application
language inference. In Proceed-
for natural
ings of
Die
Verein für Computerlinguistik,
pages 2214–2220. https://doi.org/10
.1037/h0031619

the 57th Annual Meeting of

Joseph L. Fleiss. 1971. Measuring nominal scale
agreement among many raters. Psychological
Bulletin, 76(5):378.

Dan Gillick and Yang Liu. 2010. Non-expert
evaluation of summarization systems is risky.
In Proceedings of
the NAACL HLT 2010
Workshop on Creating Speech and Lan-
guage Data with Amazon’s Mechanical Turk,
pages 148–151.

Ben Goodrich, Vinay Rao, Peter J. Liu, Und
Mohammad Saleh. 2019. Assessing the fac-
tual accuracy of generated text. In Proceedings
of the 25th ACM SIGKDD International Con-
ference on Knowledge Discovery & Data
Mining, pages 166–175. https://doi.org
/10.1145/3292500.3330955

Tanya Goyal and Greg Durrett. 2020. Evaluating
factuality in generation with dependency-level
entailment. arXiv preprint arXiv:2010.05478.

Matthew Honnibal, Ines Montani, Sofie Van
Landeghem, and Adriane Boyd. 2020. spaCy:
Industrial-strength Natural Language Pro-
cessing in Python. https://doi.org/10
.18653/v1/2020.findings-emnlp.322

Dandan Huang, Leyang Cui, Sen Yang,
Guangsheng Bao, Kun Wang, Jun Xie, Und
Yue Zhang. 2020. What have we achieved
on text summarization? In Proceedings of
Die 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
446–469. https://doi.org/10
Seiten
.18653/v1/2020.emnlp-main.33

Anastassia Kornilova and Vladimir Eidelman.
2019. Billsum: A corpus for automatic sum-
marization of us legislation. In Proceedings
the 2nd Workshop on New Frontiers in
von
Summarization, pages 48–56.

Wojciech Kryscinski, Bryan McCann, Caiming
Xiong, and Richard Socher. 2020. Eval-
uating the factual consistency of abstrac-
tive text summarization. In Proceedings of

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Die 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 9332–9346. https://doi.org/10
.18653/v1/2020.emnlp-main.750

Philippe Laban, Tobias Schnabel, Paul Bennett,
and Marti A. Hearst. 2021. Keep it simple:
Unsupervised simplification of multi-paragraph
Text. In Proceedings of the 59th Annual Meet-
the Association for Computational
ing of
Linguistics and the 11th International Joint
Conference on Natural Language Processing
(Volumen 1: Long Papers), pages 6365–6378,
Online. Association for Computational Linguis-
Tics. https://doi.org/10.18653/v1
/2021.acl-long.498

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Von, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly op-
timized BERT pretraining approach. arXiv
preprint arXiv:1907.11692.

Valerie R. Mariana. 2014. The Multidimensional
Quality Metric (MQM) Rahmen: Ein neues
framework for translation quality assessment.
Brigham Young University.

Joshua Maynez,

Shashi Narayan, Bernd
Bohnet, and Ryan McDonald. 2020. An
and factuality in abstractive
faithfulness
summarization. arXiv preprint arXiv:2005
https://doi.org/10.18653
.00661.
/v1/2020.acl-main.173

Ramesh Nallapati, Bowen Zhou, Cicero dos
Santos, C¸ a˘glar Gulc¸ehre, and Bing Xiang.
2016. Abstractive text summarization using
sequence-to-sequence rnns and beyond. In Pro-
ceedings of The 20th SIGNLL Conference
on Computational Natural Language Learn-
ing, pages 280–290. https://doi.org/10
.18653/v1/K16-1028

Feng Nan, Cicero Nogueira dos Santos, Henghui
Zhu, Patrick Ng, Kathleen McKeown, Ramesh
Nallapati, Dejiao Zhang, Zhiguo Wang,
Andrew O. Arnold, and Bing Xiang. 2021.
Improving factual consistency of abstractive
summarization via question answering. arXiv
preprint arXiv:2105.04623. https://doi
.org/10.18653/v1/2021.acl-long.536

175

Shashi Narayan, Shay B. Cohen, and Mirella
Lapata. 2018. Don’t give me the details, just
the summary! Topic-aware convolutional neu-
ral networks for extreme summarization. In
Verfahren der 2018 Conference on Empir-
ical Methods in Natural Language Processing,
pages 1797–1807. https://doi.org/10
.18653/v1/D18-1206

Yixin Nie, Adina Williams, Emily Dinan, Mohit
Bansal, Jason Weston, and Douwe Kiela. 2019.
Adversarial NLI: A new benchmark for nat-
language understanding. arXiv preprint
ural
arXiv:1910.14599. https://doi.org/10
.18653/v1/2020.acl-main.441

Artidoro Pagnoni, Vidhisha Balachandran, Und
Yulia Tsvetkov. 2021. Understanding factual-
ity in abstractive summarization with frank: A
benchmark for factuality metrics. In NAACL.

Ankur Parikh, Oscar T¨ackstr¨om, Dipanjan Das,
and Jakob Uszkoreit. 2016. A decomposable at-
tention model for natural language inference. In
Verfahren der 2016 Conference on Empir-
ical Methods in Natural Language Processing,
pages 2249–2255.

Ramakanth Pasunuru and Mohit Bansal. 2018.
Multi-reward reinforced summarization with
saliency and entailment. In Proceedings of
Die 2018 Conference of the North American
Chapter of the Association for Computational
Linguistik: Human Language Technologies,
Volumen 2 (Short Papers), pages 646–653.
https://doi.org/10.18653/v1/N18
-2102

Tal Schuster, Adam Fisch, and Regina Barzilay.
2021. Get your Vitamin C! Robust fact verifica-
tion with contrastive evidence. In Proceedings
of the 2021 Conference of the North American
Chapter of
the Association for Computa-
tional Linguistics: Human Language Technolo-
gies, pages 624–643. https://doi.org
/10.18653/v1/2021.naacl-main.52

Sylvain

Lamprier,

Thomas Scialom, Paul-Alexis Dray, Patrick
Gallinari,
Benjamin
Piwowarski, Jacopo Staiano, and Alex Wang.
2021. Questeval: Summarization asks for fact-
based evaluation. arXiv preprint arXiv:2103.
12693. https://doi.org/10.18653/v1
/2021.emnlp-main.529

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Thomas Scialom, Sylvain Lamprier, Benjamin
Piwowarski, and Jacopo Staiano. 2019. Ein-
swers unite! unsupervised metrics for rein-
forced summarization models. In Proceedings
of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 3246–3256. https://doi.org/10
.18653/v1/D19-1320

James Thorne, Andreas Vlachos, Christos
Christodoulopoulos, and Arpit Mittal. 2018.
Fever: A large-scale dataset for fact extraction
and verification. In Proceedings of the 2018
Conference of the North American Chapter
of the Association for Computational Linguis-
Tics: Human Language Technologies, Volumen 1
(Long Papers), pages 809–819. https://
doi.org/10.18653/v1/N18-1074

Jesse Vig, Wojciech Kryscinski, Karan Goel, Und
Nazneen Fatema Rajani. 2021. Summvis: Inter-
active visual analysis of models, Daten, and eval-
uation for text summarization. arXiv preprint
arXiv:2104.07605. https://doi.org/10
.18653/v1/2021.acl-demo.18

Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through in-
Referenz. In Proceedings of the 2018 Conference
of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human
Language Technologies, Volumen 1 (Long Pa-
pers), pages 1112–1122. https://doi.org
/10.18653/v1/N18-1101

Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, R´emi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drame, Quentin
Lhoest, and Alexander M. Rush. 2020. Trans-
formers: State-of-the-art natural language pro-
Abschließen. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language
Processing: Systemdemonstrationen, pages 38–45,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/2020.emnlp-demos.6

Jingqing Zhang, Yao Zhao, Mohammad Saleh,
and Peter Liu. 2020A. Pegasus: Pre-training
with extracted gap-sentences for abstractive
summarization. In International Conference
on Machine Learning, pages 11328–11339.
PMLR.

Yuhao Zhang, Derek Merck, Emily Tsai,
Christopher D. Manning, and Curtis Langlotz.
2020B. Optimizing the factual correctness of
a summary: A study of summarizing radiology
Berichte. In Proceedings of the 58th Annual
Meeting of the Association for Computational
Linguistik, pages 5108–5120. https://doi
.org/10.18653/v1/2020.acl-main.458

Chen Zhao, Chenyan Xiong, Corby Rosset, Xia
Song, Paul Bennett, and Saurabh Tiwary.
2019. Transformer-XH: Multi-evidence reason-
ing with extra hop attention. In International
Conference on Learning Representations.

Appendix

A NLI Model Origin

We list the NLI models we used throughout the
Papier, which can be retrieved on HuggingFace’s
model hub.6 BERT stands for any Pre-trained
bi-directional Transformer of an equivalent size:

• boychaboy/SNLI roberta-base

BERT Base+SNLI

• microsoft/deberta-base-mnli

BERT Base+MNLI

• tals/albert-base-vitaminc-mnli

BERT Base + MNLI + VitaminC

• boychaboy/SNLI roberta-large

BERT Large+SNLI

• tals/albert-xlarge-vitaminc

Bert Large+VitaminC

• roberta-large-mnli

Bert Large+MNLI

• tals/albert-xlarge-vitaminc-mnli

BERT Large+MNLI+VitaminC

6https://huggingface.co/models.

176

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

B SUMMACZS Operator Choice

Table A1 measures the effect of the choice of
the two operators in the SUMMACZS model. Wir
explore three options (min, mean, and max)
for each operator. We find that the choice of max
for Operator 1 and mean for Operator 2 achieves
the highest performance and use these choices in
our model.

C SUMMAC Benchmark ROC-AUC

Ergebnisse

Table A2 details results of models on the
benchmark according to the ROC-AUC metric,
confirming that the SUMMAC models achieve the
two best accuracy results on the benchmark.

Op. 1

Min
Mean
Max

Min

53.1
60.5
68.8

Operator 2
Mean

55.7
62.8
72.1

Max

57.4
62.0
69.1

Table A1: Effect of operator choice on the per-
formance of the SUMMACZS model, measured
in terms of balanced accuracy. Operator 1
reduces the row dimension of the NLI Pair
Matrix, and Operator 2 reduces the column
dimension.

SUMMAC Benchmark Datasets

Model Type Model Name CGS XSF Polytope FactCC SummEval FRANK Overall

Baseline

NER-Overlap
MNLI-doc

53.0 61.7
59.4 59.4

Classifier

FactCC-CLS

65.0 59.2

Parsing

DAE

QAG

NLI

FEQA
QuestEval

SUMMACZS
SUMMACCONV

67.8 41.3

60.8 53.4
64.4 66.4

73.1 58.0
67.6 70.2

51.6
62.6

63.5

64.1

54.6
72.2

60.3
62.4

53.1
62.1

79.6

82.7

50.7
71.5

56.8
70.0

61.4

77.4

52.2
79.0

83.7
92.2**

85.5
86.0*

60.9
67.2

62.7

64.3

74.8
87.9

85.3
88.4

56.2
63.4

65.2

66.3

57.7
73.6

74.3
77.8**

Table A2: Performance of Summary Inconsistency Detection models on the test portion of the SUMMAC
Benchmark in terms of ROC-AUC metric. The metric is computed for each model on the six datasets in
the benchmark, and the average is computed as the overall performance on the benchmark. Confi-
dence intervals comparing the SUMMAC models to prior work: * indicates an improvement with 95%
confidence, Und ** 99% confidence (details in Section 5.2.1).

177

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
4
5
3
1
9
8
7
0
1
4

/
T

A
C
_
A
_
0
0
4
5
3
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3 SUMMAC: Re-Visiting NLI-based Models for image

PDF Herunterladen