SUMMAC: Re-Visiting NLI-based Models for - IA de Investigación especializada en el MIT

SUMMAC: Re-Visiting NLI-based Models for
Inconsistency Detection in Summarization

Philippe Laban Tobias Schnabel Paul N. Bennett Marti A. Hearst
UC Berkeley, USA Microsoft, USA Microsoft, USA UC Berkeley, USA∗

Abstracto

In the summarization domain, a key require-
ment for summaries is to be factually consis-
tent with the input document. Trabajo previo
has found that natural
language inference
(NLI) models do not perform competitively
when applied to inconsistency detection. En
this work, we revisit the use of NLI for in-
consistency detection, finding that past work
suffered from a mismatch in input granular-
ity between NLI datasets (nivel de oración), y
inconsistency detection (document level). Nosotros
provide a highly effective and light-weight
method called SUMMACCONV that enables NLI
models to be successfully used for this task by
segmenting documents into sentence units and
aggregating scores between pairs of sentences.
We furthermore introduce a new benchmark
called SUMMAC (Summary Consistency) cual
consists of six large inconsistency detection
conjuntos de datos. On this dataset, SUMMACConv obtains
state-of-the-art results with a balanced accu-
racy of 74.4%, a 5% improvement compared
with prior work.

Introducción

Recent progress in text summarization has been
remarkable, with ROUGE record-setting models
published every few months, and human eval-
uations indicating that automatically generated
summaries are matching human-written sum-
maries in terms of fluency and informativeness
(Zhang et al., 2020a).

A major limitation of current summarization
models is their inability to remain factually consis-
tent with the respective input document. Summary
inconsistencies are diverse—from inversions (es decir.,
negation) to incorrect use of an entity (es decir., sujeto,
object swapping), or hallucinations (es decir., introducción-
duction of entity not in the original document).
Recent studies have shown that in some scenarios,

∗Author emails: {phillab,hearst}@berkeley.edu, {Tobias.

Schnabel,Paul.N.Bennett}@microsoft.com

163

even state-of-the-art pre-trained language mod-
els can generate inconsistent summaries in more
than 70% of all cases (Pagnoni et al., 2021). Este
has led to accelerated research around summary
inconsistency detection.

A closely related task to inconsistency detection
is textual entailment, also referred to as Natural
Language Inference (NLI), in which a hypothesis
sentence must be classified as either entailed by,
neutral, or contradicting a premise sentence. En-
abled by the crowd-sourcing of large NLI datasets
such as SNLI (Bowman et al., 2015) and MNLI
(Williams et al., 2018), modern architectures have
achieved close to human performance at the task.
The similarity of NLI to inconsistency detec-
ción, as well as the availability of high-performing
NLI models, led to early attempts at using NLI to
detect consistencyerrorsinsummaries. These early at-
tempts were unsuccessful, finding that re-ranking
summaries according to an NLI model can lead
to an increase in consistency errors (Falke et al.,
2019), or that out-of-the-box NLI models obtain
52% accuracy at the binary classification task
of inconsistency detection, only slightly above
random guessing (Kryscinski et al., 2020).

En este trabajo, we revisit this approach, demostración
that NLI models can in fact successfully be used
for inconsistency detection, as long as they are
used at the appropriate granularity. Cifra 1 muestra
how crucial using the correct granularity as input
to NLI models is. An inconsistency checker should
flag the last sentence in the summary (shown right)
as problematic. When treating the entire document
as the premise and the summary as the hypothesis,
a competitive NLI model predicts with probability
de 0.91 that the summary is entailed by the docu-
mento. Sin embargo, when splitting the documents into
sentence premise-hypothesis pairs (visualized as
edges in Figure 1) the NLI model correctly deter-
mines that S3 is not supported by any document
oración. This illustrates that working with sen-
tence pairs is crucial for making NLI models work
for inconsistency detection.

Transacciones de la Asociación de Lingüística Computacional, volumen. 10, páginas. 163–177, 2022. https://doi.org/10.1162/tacl a 00453
Editor de acciones: Shay Cohen. Lote de envío: 8/2021; Lote de revisión: 11/2021; Publicado 2/2022.
C(cid:3) 2022 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

2.1 Fact Checking and Verification

Fact checking is a related task in which a model
receives an input claim along with a corpus of
ground truth information. The model must then
retrieve relevant evidence and decide whether the
claim is supported, refuted, or if there is not
enough information in the corpus (Thorne et al.,
2018). The major difference to our task lies in the
different semantics of consistency and accuracy.
If a summary adds novel and accurate information
not present in the original document (p.ej., agregando
background information), the summary is accurate
but inconsistent. In the summary inconsistency
detection domain, the focus is on detecting any
inconsistency, regardless of its accuracy, as prior
work has shown that current automatic summariz-
ers are predominantly inaccurate when inconsist-
ent (Maynez et al., 2020).

2.2 Datasets for Inconsistency Detection

Several datasets have been annotated to evalu-
ate model performance in inconsistency detection,
typically comprising up to two thousand annotated
summaries. Datasets are most commonly crowd-
annotated with three judgements each, a pesar de
some work showing that as many as eight anno-
tators are required to achieve high inter-annotator
agreement (Falke et al., 2019).

Reading the entire original document being
summarized is time-consuming, and to amortize
this cost, consistency datasets often contain multi-
ple summaries, generated by different models, para
the same original document.

Some datasets consist of an overall consistency
label for a summary (p.ej., FactCC [Kryscinski
et al., 2020]), while others propose a finer-grained
typology with up to 8 types of consistency errors
(Huang et al., 2020).

We include the six largest summary consistency
datasets in the SUMMAC Benchmark, and describe
them more in detail in Section 4.

2.3 Methods for Inconsistency Detection

Due to data limitations, most inconsistency detec-
tion methods adapt NLP pipelines from other tasks
including QAG models, synthetic classifiers, y
parsing-based methods.

QAG methods follow three steps: (1) pregunta
generación (QG), (2) question answering (control de calidad)
with the document and the summary, (3) matching
document and summary answers. A summary is

Cifra 1: Example document with an inconsistent
summary. When running each sentence pair (Di, Sj)
through an NLI model, S3 is not entailed by any doc-
ument sentence. Sin embargo, when running the entire
(documento, summary) at once, the NLI model incor-
rectly predicts that the document highly entails the
entire summary.

Our contributions are two-fold. Primero, presentamos-
duce a new approach for inconsistency detection
based on the aggregation of sentence-level entail-
ment scores for each pair of input document and
summary sentences. We present two model vari-
ants that differ in the way they aggregate sentence-
level scores into a single score. SUMMACZS
performs zero-shot aggregation by combining
sentence-level scores using max and mean op-
erators. SUMMACCONV is a trained model consisting
of a single learned convolution layer compiling the
distribution of entailment scores of all document
sentences into a single score.

Segundo, to evaluate our approach, we introduce
the SUMMAC Benchmark by standardizing existing
conjuntos de datos. Because the benchmark contains the six
largest summary consistency datasets, it is more
comprehensive and includes a broader range of
inconsistency errors than prior work.

The SUMMAC models outperform existing in-
consistency detection models on the benchmark,
with the SUMMACCONV obtaining an overall bal-
anced accuracy of 74.4%, 5% above prior work.
We publicly release the models and datasets.1

2 Trabajo relacionado

We briefly survey existing methods and datasets
for fact checking, inconsistency detection, y
inconsistency correction.

1https://github.com/tingofurro/summac/.

164

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

considered consistent if few or no questions have
differing answer with the document. A key design
choice for these methods lies in the source for
question generation. Durmus et al. (2020) generate
questions using the summary as a source, haciendo
their FEQA method precision-oriented. Scialom
et al. (2019) generate questions with the document
as a source, creating a recall-focused measure.
Scialom et al. (2021) unite both in QuestEval, por
generating two sets of questions, sourced from the
summary and document respectively. We include
FEQA and QuestEval in our benchmark results.

Synthetic classifiers rely on large, synthetic
datasets of summaries with inconsistencies, y
use those to train a classifier with the expectation
that the model generalizes to non-synthetic sum-
maries. To generate a synthetic dataset, Kryscinski
et al. (2020) propose a set of semantically invari-
ant (p.ej., paraphrasing) and variant (p.ej., oración
negation) text transformations that they apply to a
large summarization dataset. FactCC-CLS, el
classifier obtained when training on the synthetic
conjunto de datos, is included in our benchmark results for
comparación.

Parsing-based methods generate

relaciones
through parsing and compute the fraction of
summary relations that are compatible with docu
ment relations as a precision measure of summary
factualidad. Goodrich et al. (2019) extract (sub-
ject, relation, object) tuples most
commonly using OpenIE (Etzioni et al., 2008). En
the recent DAE model, Goyal and Durrett (2020)
propose to use arc labels from a dependency
parser instead of relation triplet. We include the
DAE model in our benchmark results.

2.4 Methods for Consistency Correction

Complementary to inconsistency detection, alguno
work focused on the task of mitigating inconsis-
tency errors during summarization. Enfoques
fall in two categories: Reinforcement Learning
(rl) methods to improve models and stand-alone
re-writing methods.

RL methods often rely on an out-of-the-box
inconsistency detection model and use reinforce-
mento
learning to optimize a reward with a
consistency component. Arumae and Liu (2019)
optimize a QA-based consistency reward, y
Nan et al. (2021) streamline a QAG reward by
combining the QG and QA model, making it more
efficient for RL training. Pasunuru and Bansal

(2018) leverage an NLI-based component as
part of an overall ROUGE-based reward, y
Zhang et al. (2020b) use a parsing-based measure
in the domain of medical report summarization.

Re-writing methods typically operate as a
modular component that is applied after an ex-
isting summarization model. Cao et al. (2020) usar
a synthetic dataset of rule-corrupted summaries
to train a post-corrector model, but find that this
model does not transfer well to real summarizer
errores. Dong et al. (2020) propose to use a QAG
model to find erroneous spans, which are then
corrected using a post-processing model.

Since all methods discussed above for con-
sistency correction rely on a model
to detect
inconsistencies, they will naturally benefit from
more accurate inconsistency detectors.

3 SUMMAC Models

We now introduce our SUMMAC models for incon-
sistency detection. The first step common to all
models is to apply an out-of-the-box NLI model to
generate an NLI Pair Matrix for a (documento,
summary) pair. The two models we present then
differ in the way they process this pair matrix
to produce a single consistency score for a given
summary. We also describe the SUMMAC evalua-
tion benchmark, a set of inconsistency detection
conjuntos de datos, en la sección 4. En la sección 5, we measure the
performance of the SUMMAC models on this bench-
mark and investigate components of the models,
including which NLI model achieves highest per-
rendimiento, which NLI categories should be used,
and what textual granularity is most effective.

3.1 Generating the NLI Pair Matrix

NLI datasets are predominantly represented at the
sentence level. In our pilot experiments, we found
that this causes the resulting NLI models to fail
in assessing consistency for documents with 50
sentences and more.

This motivates the following approach. Nosotros
generate an NLI Pair Matrix by splitting a (doc-
umento, summary) pair into sentence blocks.
The document is split into M blocks, each consid-
ered a premise labeled from D1, . . . , DM , y el
summary is split into N blocks, each considered
a hypothesis labeled from S1, . . . , SN .

Each Di, Sj combination is run through the
NLI model, which produces a probability distribu-
tion over the three NLI categories (Eij, Cij, Nij)

165

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

step consists of retaining the score for the docu-
ment sentence that provides the strongest support
for each summary sentence. For the example in
Cifra 1:

máximo(Xpair, axis=‘col’) =

0.98 0.99 0.04

(cid:8)

(cid:9)

The second step consists of taking the mean
of the produced vector, reducing the vector to a
scalar which is used as the final model score. En
a high level, this step aggregates sentence-level
information into a single score for the entire sum-
mary. Por ejemplo, En figura 1, the score produced
by SUMMACZS would be 0.67. If we removed the
third sentence from the summary, the score would
increase to 0.985. We experiment with replacing
the max and mean operators with other operators
en el Apéndice B.

3.3 SUMMACCONV: Convolution

One limitation of SUMMACZS is that it is highly
sensitive to extrema, which can be noisy due to
the presence of outliers and the imperfect na-
ture of NLI models. In SUMMACCONV, we reduce
the reliance on extrema values by instead taking
into account the entire distribution of entailment
scores for each summary sentence. For each sum-
mary sentence, a learned convolutional layer is in
charge of converting the entire distribution into a
single score.

The first step of the SUMMACCONV algorithm is
to turn each column of the NLI Pair Matrix into
a fixed-size histogram that represents the distribu-
tion of scores for that given summary sentence.

if H = 5,

We bin the NLI scores into H evenly spaced
the bins are [0, 0.2),
bins (p.ej.,
[0.2, 0.4), [0.4, 0.6), [0.6, 0.8), [0.8, 1)). Thus the
first summary sentence of the example in Figure 1
would have the following histogram: [2, 0, 1, 0, 1],
because there are two values between [0.0, 0.2] en
the first column, one in [0.4, 0.6] and one in
[0.8, 1.0].

By producing one histogram for each summary
oración, the binning process in the example of
Cifra 1 would produce:

bin(Xpair) =

⎡

⎢
⎢
⎢
⎢
⎣

⎤

⎥
⎥
⎥
⎥
⎦

2 3 4
0 0 0
1 0 0
0 0 0
1 1 0

Cifra 2: Diagram of the SUMMACZS (arriba) y
SUMMACCONV (abajo) modelos. Both models utilize the
same NLI Pair Matrix (middle) but differ in their pro-
cessing to obtain a score. The SUMMACZS is Zero-Shot,
and does not have trained parameters. SUMMACCONV
uses a convolutional layer trained on a binned version
of the NLI Pair Matrix.

for entailment, contradiction, and neutral, respetar-
activamente. If not specified otherwise, the pair matrix
is an M × N matrix consisting of the entail-
ment scores Eij. En la sección 5.3.3, we examine the
effect of granularity by splitting texts at the para-
graph level or binning two sentences at a time. En
Sección 5.3.2, we explore the use of the contradic-
tion and neutral categories in our experiments.

The example in Figure 1 has M = 4 documento
oraciones, and N = 3 summary sentences, y el
corresponding NLI Pair Matrix is the following:

Xpair =

⎡

⎢
⎢
⎣

⎤

⎥
⎥
⎦

0.02 0.02 0.04
0.98 0.00 0.00
0.43 0.99 0.00
0.00 0.00 0.01

The pair matrix can be interpreted as the weights
of a bipartite graph, which is also illustrated in
Cifra 1 where the opacity of each edge (i, j)
represents the entailment probability Eij.

The two SUMMAC models take as input the same
NLI Pair Matrix, but differ in the aggregation
method to transform the pair matrix into a score.
Cifra 2 presents an overview of SUMMACZS and
SUMMACCONV.

3.2 SUMMACZS: Zero-Shot
In the SUMMACZS model, we reduce the pair ma-
trix to a one-dimensional vector by taking the
maximum (máximo) value of each column. On an
intuitive level, for each summary sentence, este

166

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Dataset

Size
Valid. Prueba

% Positive

IAA Source # Summarizer # Sublabel

CoGenSumm (Falke et al., 2019)
XSumFaith (Maynez et al., 2020)
Polytope (Huang et al., 2020)
FactCC (Kryscinski et al., 2020)
EvaluaciónSumm (Fabbri et al., 2021)
FRANK (Pagnoni et al., 2021)

1281
1250
634
931
850
671

400
1250
634
503
850
1575

49.8
10.2
6.6
85.0
90.6
33.2

0.65
C
0.80 X
−
C
−
C
C
0.7
C+X
0.53

3
5
10
10
23
9

0
2
8
0
4
7

Mesa 1: Statistics of the six datasets in the SUMMAC Benchmark. For each dataset, we report the
validation and test set sizes, the percentage of summaries with positive (consistent) labels (% Positive),
the inter-annotator agreement (when available, IAA), the source of the documents (Fuente: C for
CNN/DM, X for XSum), the number of summarizers evaluated, and the number of sublabels annotated.

The binned matrix is then passed through a 1-D
convolution layer with a kernel size of H. El
convolution layer scans the summary histograms
one at a time, and compiles each into a scalar
value for each summary. Finalmente, the scores of
each summary sentence are averaged to obtain the
final summary-level score.

In order to learn the weights of the convolution
capa, we train the SUMMACCONV model end-to-
end with the synthetic training data in FactCC
(Kryscinski et al., 2020). The original training
dataset contains one million (documento,
summary) pairs evenly distributed with con-
sistent and inconsistent summaries. Porque nosotros
are only training a small set of H parameters
(we use H = 50), we find that using a 10,000
sub-sample is sufficient. We train the model using
a cross-entropy loss, the Adam optimizer, a batch
tamaño de 32, and a learning rate of 10−2. Actuamos
hyper-parameter tuning on a validation set from
the FactCC dataset.

The number of bins used in the binning process,
which corresponds to the number of parameters in
the convolution layer, is also a hyper-parameter
we tune on the validation set. We find that per-
formance increases until 50 bins (es decir., a bin width
de 0.02) and then plateaus. Usamos 50 bins in all
our experiments.

4 SUMMAC Benchmark

To rigorously evaluate the SUMMAC models on
a diverse set of summaries with consistency
judgements, we introduce a new large benchmark
conjunto de datos, the SUMMAC Benchmark. It comprises the
six largest available datasets for summary incon-

sistency detection, which we standardize to use
the same classification task.

4.1 Benchmark Standardization

We standardize the task of summary inconsistency
detection by casting it as a binary classification
tarea. Each dataset contains (documento, sum-
mary, label) muestras, where the label can
either be consistent or inconsistent.

Each dataset is divided into a validation and
test split, with the validation being available for
parameter tuning. We used existing validation/test
splits created by dataset authors when available.
We did not find a split for XSumFaith, Poly-
tope, and SummEval, and created one by putting
even-indexed samples in a validation split, y
odd-indexed samples in the test split. Este método
of splitting maintains similar class imbalance and
summarizer identity with the entire dataset.

We computed inter-annotator agreement cal-
culated with Fleiss’ Kappa (Fleiss, 1971) sobre el
dataset as an estimate for dataset quality, omit-
ting datasets for which summaries only had a
single annotator (Polytope and FactCC). Mesa 1
summarizes dataset statistics and properties.

4.2 Benchmark Datasets

We introduce each dataset
in the benchmark
chronologically, and describe the standardizing
procedimiento.

CoGenSumm (Correctness of Generated
Summaries, CGS) (Falke et al., 2019) is the first
introduced dataset for summary inconsistency
detección, based on models
trained on the
CNN/DM dataset (Nallapati et al., 2016). El

167

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

authors proposed that consistency detection
should be approached as a ranking problem:
Given a consistent and inconsistent summary for a
common document, a ranking model should score
the consistent summary higher. Although inno-
vative, other datasets in the benchmark do not
always have positive and negative samples for a
given document. We thus map the dataset to a
classification task by using all inconsistent and
consistent summaries as individual samples.

XSumFaith (eXtreme Summarization Faith-
fulness, XSF) (Maynez et al., 2020) is a data-
set with models trained on the XSum dataset
(Narayan et al., 2018), which consists of more
abstractive summaries than CoGenSumm. El
authors find that standard generators remain con-
sistent for only 20-30% of generated summaries.
The authors differentiate between extrinsic and
intrinsic hallucinations (which we call inconsis-
tencies in this work). Extrinsic hallucinations,
which involve words or concepts not in the original
document can nonetheless be accurate or inaccu-
tasa. In order for a summarizer to generate an
accurate extrinsic hallucination, the summarizer
must possess external world knowledge. Porque
the authors found that the models are primarily
inaccurate in terms of extrinsic hallucinations, nosotros
map both extrinsic and intrinsic hallucinations to
a common inconsistent label.

Polytope (Huang et al., 2020) introduces a
more extensive typology of summarization errors,
based on the Multi-dimensional Quality Metric
(Mariana, 2014). Each summary is annotated with
eight possible errors, as well as a severity level for
the error. We standardize this dataset by labeling
a summary as inconsistent if it was annotated with
any of the five accuracy errors (and disregarded
the three fluency errors). Each summary in Poly-
tope was labeled by a single annotator, making it
impossible to measure inter-annotator agreement.
FactCC (Kryscinski et al., 2020) contiene
validation and test splits that are entirely anno-
tated by authors of the paper, because attempts
at crowd-sourced annotation yielded low inter-
annotator agreement. Prior work (Gillick and Liu,
there can be divergence in
2010) shows that
annotations between experts and non-experts in
summarization, and because the authors of the
paper are NLP researchers familiar with the lim-
itations of automatic summarizations, we expect
that FactCC annotations differs in quality from
other datasets. FactCC also introduces a synthetic

dataset by modifying consistent summaries with
semantically variant rules. We use a sub-portion
of this synthetic dataset to train the SUMMACCONV
modelo.

EvaluaciónSumm (Fabbri et al., 2021) contains sum-
marizer outputs from seven extractive models and
sixteen abstractive models. Each summary was
labeled using a 5-point Likert scale along four
categories: coherencia, consistencia, fluidez, y
relevance by 3 annotators. We label summaries as
consistent if all annotators gave a score of 5 en
consistencia, and inconsistent otherwise.

FRANK (Pagnoni et al., 2021) contains anno-
tations for summarizers trained on both CNN/DM
and XSum, with each summary annotated by three
crowd-workers. The authors propose a new ty-
pology with seven error types, organized into
semantic frame errors, discourse errors and con-
tent verifiability errors. The authors confirm that
models trained on the more abstractive XSum
dataset generate a larger proportion of inconsis-
tent summaries, compared to models trained on
CNN/DM. We label summaries as consistent if
a majority of annotators labeled the summary as
containing no error.

4.3 Benchmark Evaluation Metrics

With each dataset in the SUMMAC Benchmark
converted to a binary classification task, we now
discuss the choice of appropriate evaluation met-
rics for the benchmark. Previous work on each
dataset in the benchmark used different evaluation
methods, falling into three main categories.

Primero, CoGenSumm proposes a re-ranking based
measure, requiring pairs of consistent and incon-
sistent summaries for any document evaluated;
this information is not available in several datasets
in the benchmark.

Segundo, XSumFaith, EvaluaciónSumm, and FRANK
report on correlation of various metrics with
human annotations. Correlation has some advan-
tages, such as not requiring a threshold and being
compatible with the Likert-scale annotations of
EvaluaciónSumm, however it is an uncommon choice
to measure performance of a classifier due to the
discrete and binary label.

Tercero, authors of FactCC measured model per-
formance using binary F1 score, and balanced
exactitud, which corrects unweighed accuracy
with the class imbalance ratio, so that majority
class voting obtains a score of 50%.

168

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

The datasets have widely varying class imbal-
ances, que van desde 6% a 91% positive sam-
ples. Por lo tanto, we select balanced accuracy
(Brodersen et al., 2010) as the primary evalua-
tion metric for the SUMMAC Benchmark. Balanced
accuracy is defined as:

BAcc =

(cid:10)

1
2

T P
T P + F N

T N
T N + F P

(cid:11)

(1)

Where T P stands for true positive, F P false pos-
itive, T N true negative, and F N false negative.
The choice of metric is based on the fact that
accuracy is a conceptually simple, interpretable
métrico, and that adjusting the class imbalance
out of the metric makes the score more uniform
across datasets.

The balanced accuracy metric requires models
to output a binary label (es decir., not a scalar score),
which for most models requires the selection of a
threshold in the score. The threshold is selected
using the validation set, allowing for a different
threshold for each dataset in the benchmark. Por-
formance on the benchmark is the unweighted
average of performance on the six datasets.

We choose Area Under the Curve of the
Receiver Operating Chart (ROC-AUC) as a sec-
ondary evaluation metric, a common metric to
summarize a classifier’s performance at different
threshold levels (Bradley, 1997).

5 Resultados

We compared the SUMMAC models against a wide
array of baselines and state-of-the-art methods.

5.1 Comparison Models

We evaluated the following models on the
SUMMAC Benchmark:

NER Overlap uses the spaCy named entity
recognition (NER) modelo (Honnibal et al., 2020)
to detect when an entity present in the summary is
not present in the document. This model, adapted
from Laban et al. (2021), considers only a sub-
set of entity types as hallucinations (PERSON,
LOCATION, ORGANIZATION, etc.)

MNLI-doc is a RoBERTa (Liu et al., 2019)
model finetuned on the MNLI dataset (williams
et al., 2018). The document is used as the premise
and the summary as a hypothesis, and we use
the predicted probability of entailment as a score,

similar to prior work on using NLI models for
inconsistency detection (Kryscinski et al., 2020).
FactCC-CLS is a RoBERTa-base model fine-
tuned on the synthetic training portion of the
FactCC dataset. Although trained solely on artifi-
cially created inconsistent summaries, prior work
showed the model to be competitive on the FactCC
and FRANK datasets.

DAE (Goyal and Durrett, 2020) is a parsing-
based model using the default model and hyper-
parameters provided by the authors of the paper.2
FEQA (Durmus et al., 2020) is a QAG method,
using the default model and hyper-parameters
provided by the authors of the paper.3

QuestEval (Scialom et al., 2021) is a QAG
method taking both precision and recall
en
cuenta. We use the default model and hyper-
parameters provided by the authors of the paper.4
The model has an option to use an additional
question weighter, however experiments revealed
that the weighter lowered overall performance on
the validation portion of the SUMMAC Benchmark,
and we compare to the model without weighter.

5.2 SUMMAC Benchmark Results

Balanced accuracy results are summarized in
Mesa 2. We find that the SUMMAC models achieve
the two best performances in the benchmark.
SUMMACCONV achieves the best benchmark per-
formance at 74.4%, 5 points above QuestEval, el
best method not involving NLI.

Looking at the models’ ability to generalize
across datasets and varying scenarios of inconsis-
tency detection provides interesting insights. Para
ejemplo, the FactCC-CLS model achieves strong
performance on the FactCC dataset, but close to
lowest performance on FRANK and XSumFaith.
En comparación, SUMMAC model performance is
strong across the board.

The strong improvement from the SUMMACZS
to SUMMACCONV also shines a light on the im-
portance of considering the entire distribution of
document scores for each summary sentence, en-
stead of taking only the maximum score: El
SUMMACCONV model learns to look at the distribu-
tion and makes more robust decisions, conduciendo a
gains in performance.

The table of results with the ROC-AUC metric,
the secondary metric of the SUMMAC Benchmark,

2https://github.com/tagoyal/dae-factuality.
3https://github.com/esdurmus/feqa.
4https://github.com/ThomasScialom/QuestEval.

169

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Polytope FactCC SummEval FRANK

Base

Classifier
Parsing

Model Type Model Name
NER-Overlap
MNLI-doc
FactCC-CLS
DAE
FEQA
QuestEval
SUMMACZS
SUMMACCONV

QAG

NLI

CGS
53.0
57.6
63.1
63.4
61.0
62.6
70.4*
64.7

SUMMAC Benchmark Datasets
XSF
63.3
57.5
57.6
50.8
56.0
62.1
58.4
66.4*

55.0
61.3
75.9
75.9
53.6
66.6
83.8*
89.5**

52.0
61.0
61.0
62.8
57.8
70.3*
62.0
62.7

56.8
66.6
60.1
70.3
53.8
72.5
78.7
81.7**

60.9
63.6
59.4
61.7
69.9
82.1
79.0
81.6

En general
56.8
61.3
62.8
64.2
58.7
69.4
72.1*
74.4**

Doc./min.
55,900
6,200
13,900
755
33.9
22.7
435
433

Mesa 2: Performance of Summary Inconsistency Detection models on the test set of the SUMMAC
Benchmark. Balanced accuracy is computed for each model on the six datasets in the benchmark, y
the average is computed as the overall performance on the benchmark. We obtain confidence intervals
comparing the SUMMAC models to prior work: * indicates an improvement with 95% confidence, y **
99% confidence (details in Section 5.2.1). The results of the throughput analysis of Section 5.2.2 are in
column Doc./min (Documents per minute).

is included in Appendix A2, echoing the trends
seen with the balanced accuracy metric.

(es decir., number of documents processed by the model
per unit of time).

5.2.1 Statistical Testing

We aim to determine whether the performance im-
provements of the SUMMAC models are statistically
significativo. For each dataset of the benchmark, nosotros
perform two tests through bootstrap resampling
(Efron, 1982), comparing each of the SUMMAC
models to the best-performing model from prior
trabajar. We perform interval comparison at two sig-
nificance level: pag = 0.05 and p = 0.01, and apply
the Bonferroni correction (Bonferroni, 1935) como
we perform several tests on each dataset. Nosotros
summarize which improvements are significant in
Mesa 2, and perform a similar testing procedure
for the ROC-AUC results in Table A2.

SUMMAC models lead to a statistically signifi-
cant improvement on CoGenSumm, XSumFaith,
FactCC, and SummEval. QuestEval outperforms
the SUMMAC models on Polytope at a confidence
de 95%. On the FRANK dataset, QuestEval and
SUMMACCONV achieve highest performance with no
statistical difference. Overall on the benchmark,
both SUMMAC models significantly outperform
prior work, SUMMACZS at a p = 0.05 significance
level and SUMMACCONV at p = 0.01.

5.2.2 Computational Cost Comparison

Computational cost of the method is an important
practical factor to consider when choosing a model
to use, as some applications such as training with
a generator with Reinforcement Learning might
require a minimum throughput from the model

A common method to compare algorithms is
using computational complexity analysis, com-
puting the amount of resources (tiempo, espacio)
needed as the size of the input varies. Compu-
tational complexity analysis is impractical in our
caso, as the units of analysis differ between mod-
los, and do not allow for a direct comparison.
More specifically, some of the models’ complex-
ity scales with the number of sub-word units
in the document (MNLI-doc, FactCC-CLS),
some with the number of entities in a document
(NER-Overlap, DAE, QuestEval), y algunos
with number of sentences (the SUMMAC models).
We instead compare models by measuring
throughput on a fixed dataset using a common
hardware setup. Más precisamente, we measured the
processing time of each model on the 503 docu-
ments in the test set of FactCC (with an average
de 33.2 sentences per document), running a single
Quadro RTX 8000 GPU. For prior work, we used
implementation publicly released by the authors,
and made a best effort to use the model at an
appropriate batch size for a fair comparison.

The result of the throughput analysis is included
en mesa 2 (column Docs./min.). SUMMAC mod-
els are able to process around 430 documentos
per minute, which is much lower than some of
the baselines capable of processing more than
10,000 documents per minute. Sin embargo, QAG
methods are more than 10 times slower than
SUMMAC models, processing only 20-40 docu-
ments per minute.

170

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Arquitectura

NLI Dataset

Actuación
Conv
ZS

Dec. Attn

SNLI

SNLI
BERT Base MNLI

MNLI+VitaminC

56.9

66.6
69.5
67.9

BERT Large VitaminC

SNLI
66.6
SNLI+MNLI+ANLI 69.9
71.1
70.9
72.1

MNLI
MNLI+VitaminC

56.4

64.0
69.8
71.2

62.4
71.7
72.8
73.0
74.4

Mesa 3: Effect of NLI model choice on SUMMAC
models performance. For each NLI model, nosotros en-
clude the balanced accuracy scores of SUMMACZS
and SUMMACCONV. BERT X corresponds to a BERT
or other pre-trained models of similar size.

5.3 Further Results

We now examine how different components and
design choices affect SUMMAC model performance.

5.3.1 Choice of NLI Model
SUMMAC models rely on an NLI model at their
core, which consists of choosing two main com-
ponents: a model architecture, and a dataset to
train on. We investigate the effect of both of these
choices on the performance of SUMMAC models
on the benchmark.

Regarding model architectures, we experi
ment with the decomposable attention model
(Parikh et al., 2016), which is a pre-Transformer
architecture model that was shown to achieve high
performance on SNLI, as well as Transformer base
and Transformer Large architectures.

With respect to datasets, we include models
trained on standard NLI datasets such as SNLI
(Bowman et al., 2015) and MNLI (williams
et al., 2018), as well as more recent datasets
such as Adversarial NLI (Nie et al., 2019) y
Vitamin C (Schuster et al., 2021).

Results are summarized in Table 3, and we em-
phasize three trends. Primero, the low performance of
the decomposable attention model used in experi-
ments in prior work (Falke et al., 2019) confirms
that less recent NLI models did not transfer well
to summary inconsistency detection.

Segundo, NLI models based on pre-trained
Transformer architectures all achieve strong per-
formance on the benchmark, with an average

171

increase of 1.3 percentage points when going
from a base to a large architecture.

Tercero, the choice of NLI dataset has a strong
influence on overall performance. SNLI leads to
lowest performance, which is expected as its tex-
tual domain is based on image captions, cual
are dissimilar to the news domain. MNLI and
Vitamin C trained models both achieve close to
the best performance, and training on both jointly
leads to the best model, which we designate as the
default NLI model for the SUMMAC models (es decir.,
the model included in Table 2).

The latter two trends point to the fact that
improvements in the field of NLI lead to improve-
ments in the SUMMAC models, and we can expect
that future progress in the NLI community will
translate to gains of performance when integrated
into the SUMMAC model.

We relied on trained models available in Hug-
gingFace’s Model Hub (Wolf et al., 2020). Details
in Appendix A.

5.3.2 Choice of NLI Category

The NLI task is a three-way classification task,
yet most prior work has limited usage of the
model to the use of the entailment probability for
inconsistency detection (Kryscinski et al., 2020;
Falke et al., 2019). We run a systematic experiment
by training multiple SUMMACCONV models that have
access to varying subsets of the NLI labels, y
measure the impact on overall performance. Re-
sults are summarized in Table 4. Using solely the
entailment category leads to strong performance
for all models. Sin embargo, explicitly including the
contradiction label as well leads to small boosts in
performance for the ANLI and MNLI models.

With future NLI models being potentially more
nuanced and calibrated, it is possible that incon-
sistency detector models will be able to rely on
scores from several categories.

5.3.3 Choice of Granularity

Hasta ahora, we’ve reported experiments primarily with
a sentence-level granularity, as it matches the
granularity of NLI datasets. One can imagine
cases where sentence-level granularity might be
limiting. Por ejemplo, in the case of a summary
performing a sentence fusion operation, an NLI
model might not be able to correctly predict en-
tailment of the fused sentence, seeing only one
sentence at a time.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Category
C
norte

SUMMACCONV Performance

VITC+MNLI

ANLI MNLI

✓

✓
✓

✓

✓
✓
✓

74.4
71.2
72.5
73.1
74.0
72.5
74.0

69.2
55.8
69.2
69.6
70.2
69.2
69.7

72.6
66.4
72.6
72.6
73.0
72.6
73.0

✓

✓
✓

✓

Mesa 4: Effect of NLI category inclusion on
SUMMACCONV performance. Models had access
to different subsets of the three category predic-
ciones (Entailment, Neutral, Contradiction), con
performance measured in terms of balanced ac-
curacy. Experiments were performed with 3 NLI
modelos: Vitamic C+MNLI, ANLI, and MNLI.

Actuación

Granularity
Document Summary

Lleno

Párrafo

Two Sent.

Oración

Lleno
Oración
Lleno
Oración
Lleno
Oración
Lleno
Oración

MNLI
ZS Conv
56.4
57.4
59.8
65.2
64.0
71.2
58.7
70.3

−
−
61.8
64.7
63.8
73.5
61.1
73.0

MNLI + VitC
Conv
ZS
−
72.1
−
73.1
71.2
69.8
74.3
72.6
71.3
69.7
74.7
72.5
69.4
68.4
74.4
72.1

Mesa 5: Effect of granularity choice on
SUMMAC models performance. We tested four
granularities on the document side: full, para-
two sentence, and sentence, and two
graph,
granularities on the summary side: full and sen-
tence. Performance of the four models is measured
in balanced accuracy on the benchmark test set.

To explore this facet further, we experiment
with modifying the granularity of both the docu-
ment and the summary. With regard to document
granularity, we consider four granularities: (1)
full text, the text is treated as a single block,
(2) paragraph-level granularity, the text is sep-
arated into paragraph blocks, (3) two-sentence
granularity, the text is separated into blocks of
contiguous sentences of size two (es decir., block 1
contains sentence 1-2, block 2 contains sentence
3-4), y (4) nivel de oración, splitting text at in-
dividual sentences. For the summary granularity,
we only consider two granularities: (1) full text,
y (2) oración, because other granularities are
less applicable since summaries usually consist of
three sentences or fewer.

We study the total of 8 (documento, sum-
mary) granularity combinations with the two
best-performing NLI models of Table 2: MNLI
and Vitamin C, each included as SUMMACZS and
SUMMACCONV models.5

Results for the granularity experiments are sum-
marized in Table 5. En general, finer granularities
lead to better performance, con (oración,
oración) y (two sent, oración)
achieving highest performance across all four
modelos.

The MNLI-only trained model achieves lowest
performance when used with full text granularity
on the document level, and performance steadily
increases from 56.4% a 73.5% as granularity is
made finer both on the document and summary
lado. Results for the MNLI+VitaminC model vary
less with changing granularity, showcasing that
the model is perhaps more robust to different
granularity levels. However the (two sent,
oración) y (oración,oración)
settings achieve highest performance, implying
that finer granularity remains valuable.

nivel

For all models, performance degrades in cases
where granularity on the document
es
finer than summary granularity. Por ejemplo
el (oración, full) o (two sent,
full) combinations lead to some of the low-
est performance. This is expected, as in cases in
which summaries have several sentences, it is un-
likely that they will fully be entailed by a single
document sentence. This implies that granularity
on the document side should be coarser or equal
the summary’s granularity.

En general, we find that finer granularity for the
document and summary is beneficial in terms of
performance and recommend the use of a (sen-
tence, oración) granularity combination.

6 Discussion and Future Work

Improvements on the Benchmark. The models
we introduced in this paper are just a first step
towards harnessing NLI models for inconsistency
detección. Future work could explore a number
of improvements: combining the predictions of
multiple NLI models, or combining multiple gran-
ularitiy levels—for example, through multi-hop
reasoning (Zhao et al., 2019).

5We skip SUMMACCONV experiments involving full text
granularity on the document-side, as that case reduces the
binning process to having a single non-zero value.

172

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Interpretability of Model Output. If a model
can pinpoint which portion of a summary is in-
consistent, some work has shown that corrector
models can effectively re-write the problem-
atic portions and often remove the inconsistency
(Dong et al., 2020). Además, de grano fino
consistency scores can be incorporated into visual
analysis tools for summarization such as Summ-
Viz (Vig et al., 2021). The SUMMACZS model is
directly interpretable, whereas the SUMMACCONV
is slightly more opaque, due to the inability to
trace back a low score to a single sentence in the
document being invalidated. Improving the inter-
pretability of the SUMMACCONV model is another
open area for future work.

Beyond News Summarization. El

six
datasets in the SUMMAC Benchmark contain
summaries from the news domain, one of the
most common application of summarization tech-
nología. Recent efforts to expand the application
of summarization to new domains such as legal
(Kornilova and Eidelman, 2019) or scholarly
(Cachola et al., 2020) text will hopefully lead to
the study of inconsistency detection in these novel
dominios, and perhaps even out of summarization
on tasks such as text simplification, or code
generación.

Towards Consistent Summarization. Incon-
sistency detection is but a first step in eliminating
inconsistencies from summarization. Trabajo futuro
can include more powerful inconsistency detec-
tors in the training of next generation summarizers
to reduce the prevalence of inconsistencies in
generated text.

7 Conclusión

We introduce SUMMACZS and SUMMACCONV, two
NLI-based models for summary inconsistency de-
tection based on the key insight that NLI models
require sentence-level input to work best. Ambos
models achieve strong performance on the SUM-
MAC Benchmark, a new diverse and standardized
collection of the six largest datasets for inconsis-
tency detection. SUMMACCONV outperforms all prior
work with a balanced accuracy score of 74.4%, un
improvement of five absolute percentage points
over the best baseline. To the best of our knowl-
borde, this the first successful attempt at adapting
NLI models for inconsistency detection, and we
believe that there are many exciting opportuni-

ties for further improvements and applications of
our methods.

Expresiones de gratitud

We would like to thank Katie Stasaski, Dongyeop
Kang, and the TACL reviewers and editors for
their helpful comments, as well as Artidoro
Pagnoni for helpful pointers during the project.
This work was supported by a Microsoft BAIR
Commons grant as well as a Microsoft Azure
Sponsorship.

Referencias

En procedimientos de

Kristjan Arumae and Fei Liu. 2019. Guid-
ing extractive summarization with question-
el
answering rewards.
2019 Conference of
the North American
Chapter of
la Asociación de Computación-
lingüística nacional: Human Language Tech-
nológico, Volumen 1 (Artículos largos y cortos),
pages 2566–2577. https://doi.org/10
.18653/v1/N19-1264

Carlo E. Bonferroni. 1935. Il calcolo delle assi-
curazioni su gruppi di teste. Studi in onore del
professore salvatore ortu carboni, pages 13–60.

Samuel Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015. A
large annotated corpus for learning natural lan-
guage inference. En Actas de la 2015
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, pages 632–642.

Andrew P. Bradley. 1997. The use of the area
under the ROC curve in the evaluation of
machine learning algorithms. Pattern Recogni-
ción, 30(7):1145–1159. https://doi.org
/10.1016/S0031-3203(96)00142-2

Kay Henning Brodersen, Cheng Soon Ong,
Klaas Enno Stephan,
Joachim M.
Buhmann. 2010. The balanced accuracy and
its posterior distribution. En 2010 20th Inter-
national Conference on Pattern Recognition,
pages 3121–3124. IEEE. https://doi.org
/10.1109/ICPR.2010.764

173

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Isabel Cachola, Kyle Lo, Arman Cohan, y
Daniel S.. Weld. 2020. Tldr: Extreme summa-
rization of scientific documents. En procedimientos
del 2020 Conference on Empirical Meth-
ods in Natural Language Processing: Findings,
pages 4766–4777. https://doi.org/10
.18653/v1/2020.findings-emnlp.428

Meng Cao, Yue Dong, Jiapeng Wu, and Jackie Chi
Kit Cheung. 2020. Factual error correction for
abstractive summarization models. En curso-
ings of the 2020 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 6251–6258. https://doi.org/10
.18653/v1/2020.emnlp-main.506

Yue Dong, Shuohang Wang, Zhe Gan, Yu
cheng, Jackie Chi Kit Cheung, and Jingjing
Liu. 2020. Multi-fact correction in abstrac-
tive text summarization. En procedimientos de
el 2020 Conferencia sobre métodos empíricos
en procesamiento del lenguaje natural (EMNLP),
pages 9320–9331. https://doi.org/10
.18653/v1/2020.emnlp-main.749

Esin Durmus, He He,

and Mona Diab.
2020. Feqa: A question answering evalua-
tion framework for faithfulness assessment
En curso-
in abstractive summarization.
cosas de
el
Asociación de Lingüística Computacional,
pages 5055–5070. https://doi.org/10
.18653/v1/2020.acl-main.454

the 58th Annual Meeting of

Bradley Efron. 1982. The jackknife, the boot-
strap and other resampling plans. In CBMS-
NSF Regional Conference Series in Applied
Matemáticas. https://doi.org/10.1145
/1409360.1409378

Oren Etzioni, Michele Banko, Stephen Soderland,
y Daniel S.. Weld. 2008. Open information
extraction from the web. Comunicaciones de
the ACM, 51(12):68–74.

Alexander R. Fabbri, Wojciech Kry´sci´nski, Bryan
McCann, Caiming Xiong, Richard Socher,
and Dragomir Radev. 2021. Summeval: Re-
evaluating summarization evaluation. Trans-
acciones de la Asociación de Computación
Lingüística, 9:391–409. https://doi.org
/10.1162/tacl_a_00373

Tobias Falke, Leonardo F. R. Ribeiro, Prasetya
Ajie Utama, Ido Dagan, and Iryna Gurevych.

174

2019. Ranking generated summaries by correct-
ness: An interesting but challenging application
language inference. En curso-
for natural
cosas de
el
Asociación de Lingüística Computacional,
pages 2214–2220. https://doi.org/10
.1037/h0031619

the 57th Annual Meeting of

Joseph L. Fleiss. 1971. Measuring nominal scale
agreement among many raters. Psicológico
Boletín, 76(5):378.

Dan Gillick and Yang Liu. 2010. Non-expert
evaluation of summarization systems is risky.
En procedimientos de
the NAACL HLT 2010
Workshop on Creating Speech and Lan-
guage Data with Amazon’s Mechanical Turk,
pages 148–151.

Ben Goodrich, Vinay Rao, Peter J. Liu, y
Mohammad Saleh. 2019. Assessing the fac-
tual accuracy of generated text. En procedimientos
of the 25th ACM SIGKDD International Con-
ference on Knowledge Discovery & Datos
Minería, pages 166–175. https://doi.org
/10.1145/3292500.3330955

Tanya Goyal and Greg Durrett. 2020. Evaluating
factuality in generation with dependency-level
entailment. arXiv preimpresión arXiv:2010.05478.

Matthew Honnibal, Ines Montani, Sofie Van
Landeghem, and Adriane Boyd. 2020. spaCy:
Industrial-strength Natural Language Pro-
cessing in Python. https://doi.org/10
.18653/v1/2020.findings-emnlp.322

Dandan Huang, Leyang Cui, Sen Yang,
Guangsheng Bao, Kun Wang, Jun Xie, y
Yue Zhang. 2020. What have we achieved
on text summarization? En procedimientos de
el 2020 Conferencia sobre métodos empíricos
en procesamiento del lenguaje natural (EMNLP),
446–469. https://doi.org/10
paginas
.18653/v1/2020.emnlp-main.33

Anastassia Kornilova and Vladimir Eidelman.
2019. Billsum: A corpus for automatic sum-
marization of us legislation. En procedimientos
the 2nd Workshop on New Frontiers in
de
Summarization, pages 48–56.

Wojciech Kryscinski, Bryan McCann, Caiming
xiong, and Richard Socher. 2020. Eval-
uating the factual consistency of abstrac-
tive text summarization. En procedimientos de

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

el 2020 Conferencia sobre métodos empíricos
en procesamiento del lenguaje natural (EMNLP),
pages 9332–9346. https://doi.org/10
.18653/v1/2020.emnlp-main.750

Philippe Laban, Tobias Schnabel, Paul Bennett,
and Marti A. Hearst. 2021. Keep it simple:
Unsupervised simplification of multi-paragraph
texto. In Proceedings of the 59th Annual Meet-
the Association for Computational
ing of
Linguistics and the 11th International Joint
Conferencia sobre procesamiento del lenguaje natural
(Volumen 1: Artículos largos), pages 6365–6378,
En línea. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.acl-long.498

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
mike lewis, Lucas Zettlemoyer, and Veselin
Stoyanov. 2019. RoBERTa: A robustly op-
timized BERT pretraining approach. arXiv
preprint arXiv:1907.11692.

Valerie R. Mariana. 2014. The Multidimensional
Quality Metric (MQM) estructura: A new
framework for translation quality assessment.
Brigham Young University.

Joshua Maynez,

Shashi Narayan, Bernd
Bohnet, and Ryan McDonald. 2020. On
and factuality in abstractive
faithfulness
summarization. arXiv preimpresión arXiv:2005
https://doi.org/10.18653
.00661.
/v1/2020.acl-main.173

Ramesh Nallapati, Bowen Zhou, Cicero dos
Santos, C¸ a˘glar Gulc¸ehre, and Bing Xiang.
2016. Abstractive text summarization using
sequence-to-sequence rnns and beyond. En profesional-
ceedings of The 20th SIGNLL Conference
on Computational Natural Language Learn-
En g, pages 280–290. https://doi.org/10
.18653/v1/K16-1028

Feng Nan, Cicero Nogueira dos Santos, Henghui
Zhu, Patrick Ng, Kathleen McKeown, Ramesh
Nallapati, Dejiao Zhang, Zhiguo Wang,
Andrew O. arnold, and Bing Xiang. 2021.
Improving factual consistency of abstractive
summarization via question answering. arXiv
preprint arXiv:2105.04623. https://doi
.org/10.18653/v1/2021.acl-long.536

175

Shashi Narayan, Shay B.. cohen, and Mirella
Lapata. 2018. Don’t give me the details, justo
the summary! Topic-aware convolutional neu-
ral networks for extreme summarization. En
Actas de la 2018 Conferencia sobre el Imperio-
Métodos icales en el procesamiento del lenguaje natural,
pages 1797–1807. https://doi.org/10
.18653/v1/D18-1206

Yixin Nie, Adina Williams, Emily Dinan, Mohit
Bansal, Jason Weston, and Douwe Kiela. 2019.
Adversarial NLI: A new benchmark for nat-
language understanding. arXiv preprint
ural
arXiv:1910.14599. https://doi.org/10
.18653/v1/2020.acl-main.441

Artidoro Pagnoni, Vidhisha Balachandran, y
Yulia Tsvetkov. 2021. Understanding factual-
ity in abstractive summarization with frank: A
benchmark for factuality metrics. In NAACL.

Ankur Parikh, Oscar T¨ackstr¨om, Dipanjan Das,
y Jakob Uszkoreit. 2016. A decomposable at-
tention model for natural language inference. En
Actas de la 2016 Conferencia sobre el Imperio-
Métodos icales en el procesamiento del lenguaje natural,
pages 2249–2255.

Ramakanth Pasunuru and Mohit Bansal. 2018.
Multi-reward reinforced summarization with
saliency and entailment. En procedimientos de
el 2018 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
Volumen 2 (Artículos breves), pages 646–653.
https://doi.org/10.18653/v1/N18
-2102

Tal Schuster, Adam Fisch, and Regina Barzilay.
2021. Get your Vitamin C! Robust fact verifica-
tion with contrastive evidence. En procedimientos
del 2021 Conference of the North American
Chapter of
la Asociación de Computación-
lingüística nacional: Human Language Technolo-
gies, pages 624–643. https://doi.org
/10.18653/v1/2021.naacl-main.52

Sylvain

Lamprier,

Thomas Scialom, Paul-Alexis Dray, Patrick
Gallinari,
Benjamín
Piwowarski, Jacopo Staiano, and Alex Wang.
2021. Questeval: Summarization asks for fact-
based evaluation. arXiv preimpresión arXiv:2103.
12693. https://doi.org/10.18653/v1
/2021.emnlp-main.529

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Thomas Scialom, Sylvain Lamprier, Benjamín
Piwowarski, and Jacopo Staiano. 2019. Un-
swers unite! unsupervised metrics for rein-
forced summarization models. En procedimientos
del 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
Procesamiento del lenguaje oral (EMNLP-IJCNLP),
páginas 3246–3256. https://doi.org/10
.18653/v1/D19-1320

James Thorn, Andreas Vlachos, Christos
Christodoulopoulos, and Arpit Mittal. 2018.
Fever: A large-scale dataset for fact extraction
and verification. En Actas de la 2018
Conferencia del Capítulo Norteamericano
de la Asociación de Linguis Computacional-
tics: Tecnologías del lenguaje humano, Volumen 1
(Artículos largos), pages 809–819. https://
doi.org/10.18653/v1/N18-1074

Jesse Vig, Wojciech Kryscinski, Karan Goel, y
Nazneen Fatema Rajani. 2021. Summvis: Enterrar-
active visual analysis of models, datos, and eval-
uation for text summarization. arXiv preprint
arXiv:2104.07605. https://doi.org/10
.18653/v1/2021.acl-demo.18

Adina Williams, Nikita Nangia, and Samuel
Bowman. 2018. A broad-coverage challenge
corpus for sentence understanding through in-
ference. En Actas de la 2018 Conferencia
of the North American Chapter of the Asso-
ciation for Computational Linguistics: Humano
Language Technologies, Volumen 1 (Long Pa-
pers), pages 1112–1122. https://doi.org
/10.18653/v1/N18-1101

Tomás Lobo, Debut de Lysandre, Víctor Sanh,
Julien Chaumond, Clemente Delangue, Antonio
moi, Pierric Cistac, Tim Rault, R´emi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drama, Quintín
Lhoest, and Alexander M. Rush. 2020. Trans-
formadores: State-of-the-art natural language pro-
cesando. En Actas de la 2020 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando: Demostraciones del sistema, páginas 38–45,
En línea. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/2020.emnlp-demos.6

Jingqing Zhang, Yao Zhao, Mohammad Saleh,
and Peter Liu. 2020a. Pegasus: Pre-entrenamiento
with extracted gap-sentences for abstractive
summarization. In International Conference
on Machine Learning, pages 11328–11339.
PMLR.

Yuhao Zhang, Derek Merck, Emily Tsai,
Cristóbal D.. Manning, and Curtis Langlotz.
2020b. Optimizing the factual correctness of
a summary: A study of summarizing radiology
reports. In Proceedings of the 58th Annual
Meeting of the Association for Computational
Lingüística, pages 5108–5120. https://doi
.org/10.18653/v1/2020.acl-main.458

Chen Zhao, Chenyan Xiong, Corby Rosset, Xia
Song, Paul Bennett, and Saurabh Tiwary.
2019. Transformer-XH: Multi-evidence reason-
ing with extra hop attention. En internacional
Conferencia sobre Representaciones del Aprendizaje.

Apéndice

A NLI Model Origin

We list the NLI models we used throughout the
paper, which can be retrieved on HuggingFace’s
model hub.6 BERT stands for any Pre-trained
bi-directional Transformer of an equivalent size:

• boychaboy/SNLI roberta-base

BERT Base+SNLI

• microsoft/deberta-base-mnli

BERT Base+MNLI

• tals/albert-base-vitaminc-mnli

BERT Base + MNLI + VitaminC

• boychaboy/SNLI roberta-large

BERT Large+SNLI

• tals/albert-xlarge-vitaminc

Bert Large+VitaminC

• roberta-large-mnli

Bert Large+MNLI

• tals/albert-xlarge-vitaminc-mnli

BERT Large+MNLI+VitaminC

6https://huggingface.co/models.

176

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

B SUMMACZS Operator Choice

Table A1 measures the effect of the choice of
the two operators in the SUMMACZS model. Nosotros
explore three options (mín., significar, and max)
for each operator. We find that the choice of max
for Operator 1 and mean for Operator 2 achieves
the highest performance and use these choices in
our model.

C SUMMAC Benchmark ROC-AUC

Resultados

Table A2 details results of models on the
benchmark according to the ROC-AUC metric,
confirming that the SUMMAC models achieve the
two best accuracy results on the benchmark.

Op. 1

mín.
Significar
máx.

mín.

53.1
60.5
68.8

Operator 2
Significar

55.7
62.8
72.1

máx.

57.4
62.0
69.1

Table A1: Effect of operator choice on the per-
formance of the SUMMACZS model, measured
in terms of balanced accuracy. Operator 1
reduces the row dimension of the NLI Pair
Matrix, and Operator 2 reduces the column
dimension.

SUMMAC Benchmark Datasets

Model Type Model Name CGS XSF Polytope FactCC SummEval FRANK Overall

Base

NER-Overlap
MNLI-doc

53.0 61.7
59.4 59.4

Classifier

FactCC-CLS

65.0 59.2

Parsing

DAE

QAG

NLI

FEQA
QuestEval

SUMMACZS
SUMMACCONV

67.8 41.3

60.8 53.4
64.4 66.4

73.1 58.0
67.6 70.2

51.6
62.6

63.5

64.1

54.6
72.2

60.3
62.4

53.1
62.1

79.6

82.7

50.7
71.5

56.8
70.0

61.4

77.4

52.2
79.0

83.7
92.2**

85.5
86.0*

60.9
67.2

62.7

64.3

74.8
87.9

85.3
88.4

56.2
63.4

65.2

66.3

57.7
73.6

74.3
77.8**

Table A2: Performance of Summary Inconsistency Detection models on the test portion of the SUMMAC
Benchmark in terms of ROC-AUC metric. The metric is computed for each model on the six datasets in
the benchmark, and the average is computed as the overall performance on the benchmark. Confi-
dence intervals comparing the SUMMAC models to prior work: * indicates an improvement with 95%
confidence, y ** 99% confidence (details in Section 5.2.1).

177

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
4
5
3
1
9
8
7
0
1
4

/
t

a
C
_
a
_
0
0
4
5
3
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3 SUMMAC: Re-Visiting NLI-based Models for image

Descargar PDF