Evaluating Document Coherence Modeling

Aili Shen♣, Meladel Mistica♣, Bahar Salehi♣,
Hang Li♦, Timothy Baldwin♣, Jianzhong Qi♣
♣ The University of Melbourne, Australia
♦ AI Lab at ByteDance, Porcelana
{aili.shen, misticam, tbaldwin, jianzhong.qi}@unimelb.edu.au
baharsalehi@gmail.com, lihang.lh@bytedance.com

Abstracto

While pretrained language models (LMs)
have driven impressive gains over morpho-
syntactic and semantic tasks, their ability to
model discourse and pragmatic phenomena is
less clear. As a step towards a better under-
standing of their discourse modeling capabili-
corbatas, we propose a sentence intrusion detection
tarea. We examine the performance of a broad
range of pretrained LMs on this detection task
para ingles. Lacking a dataset for the task, nosotros
introduce INSteD, a novel intruder sentence
detection dataset, containing 170,000+ doc-
uments constructed from English Wikipedia
and CNN news articles. Our experiments show
that pretrained LMs perform impressively
in in-domain evaluation, but experience a
substantial drop in the cross-domain setting,
indicating limited generalization capacity.
Further results over a novel linguistic probe
dataset show that there is substantial room
for improvement, especially in the cross-
domain setting.

Introducción

Rhetorical relations refer to the transition of one
sentence to the next in a span of text (Mann y
Thompson, 1988; Asher and Lascarides, 2003).
They are important as a discourse device that
contributes to the overall coherence, understand-
En g, and flow of the text. These relations span a
tremendous breadth of types, including contrast,
elaboración, narration, and justification. Estos
connections allow us to communicate coopera-
tively in understanding one another (Grice, 2002;
Wilson and Sperber, 2004). The ability to under-
stand such coherence (and conversely detect
incoherence) is potentially beneficial for down-
stream tasks, such as storytelling (Fan et al., 2019;

621

Hu et al., 2020b), recipe generation (Chandu et al.,
2019), document-level text generation (Park and
kim, 2015; Holtzman et al., 2018), and essay
scoring (Tay et al., 2018; Le et al., 2018).

Sin embargo,

there is little work on document
coherence understanding, especially examining
the capacity of pretrained language models (LMs)
to model the coherence of longer documents. A
address this gap, we examine the capacity of
pretrained language models to capture document
coherencia, focused around two research questions:
(1) do models truly capture the intrinsic properties
of document coherence? y (2) what types of doc-
ument incoherence can/can’t these models detect?
We propose the sentence intrusion detection
tarea: (1) to determine whether a document con-
tains an intruder sentence (coarse-grained level);
y (2) to identify the span of any intruder sen-
tence (fine-grained level). We restrict the scope of
the intruder text to a single sentence, noting that in
práctica, the incoherent text could span multiple
oraciones, or alternatively be sub-sentential.

Existing datasets in document coherence mea-
surement (Chen et al., 2019; Clercq et al., 2014;
Lai and Tetreault, 2018; Mim et al., 2019; Pitler
and Nenkova, 2008; Tien Nguyen and Joty, 2017)
are unsuitable for our task: They are either pro-
hibitively small, or do not specify the span of in-
coherent text. Por ejemplo, in the dataset of Lai
and Tetreault (2018), each document is assigned a
coherence score, but the span of incoherent text is
not specified. There is thus a need for a large-scale
dataset which includes annotation of the position
of intruder text. Identifying the span of incoherent
text can benefit tasks where explainability and
immediate feedback are important, such as essay
scoring (Tay et al., 2018; Le et al., 2018).

En este trabajo, we introduce a dataset consist-
ing of English documents from two domains:
Wikipedia articles (106k) and CNN news articles

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 621–640, 2021. https://doi.org/10.1162/tacl a 00388
Editor de acciones: Noah Smith. Lote de envío: 11/2020; Lote de revisión: 1/2021; Publicado 7/2021.
C(cid:4) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

2.1 Document Coherence Measurement

Coherence measurement has been studied across
various tasks, such as the document discrimina-
tion task (Barzilay and Lapata, 2005; Elsner et al.,
2007; Barzilay and Lapata, 2008; Elsner and
Charniak, 2011; Li and Jurafsky, 2017; Putra
and Tokunaga, 2017), sentence insertion (Elsner
and Charniak, 2011; Putra and Tokunaga, 2017;
Xu et al., 2019), paragraph reconstruction (Lapata,
2003; Elsner et al., 2007; Li and Jurafsky, 2017;
Xu et al., 2019; Prabhumoye et al., 2020), sum-
mary coherence rating (Barzilay and Lapata
2005; Pitler et al., 2010; Guinaudeau and Strube,
2013; Tien Nguyen and Joty, 2017), readabil-
ity assessment (Guinaudeau and Strube, 2013;
Mesgar and Strube, 2016, 2018), and essay scoring
(Mesgar and Strube, 2018; Somasundaran et al.,
2014; Tay et al., 2018). These tasks differ from
our task of intruder sentence detection as follows.
Primero, the document discrimination task assigns
coherence scores to a document and its sentence-
permuted versions, where the original document
is considered to be well-written and coherent
and permuted versions incoherent. Incoherence
is introduced by shuffling sentences, while our
intruder sentences are selected from a second doc-
umento, and there is only ever a single intruder
sentence per document. Segundo, sentence inser-
tion aims to find the correct position to insert a re-
moved sentence back into a document. Párrafo
reconstruction aims to recover the original sen-
tence order of a shuffled paragraph given its first
oración. These two tasks do not consider sen-
tences from outside of the document of interest.
Tercero, the aforementioned three tasks are arti-
ficial, and have very limited utility in terms of
real-world tasks, while our task can provide direct
benefit in applications such as essay scoring, en
identifying incoherent (intruder) sentences as a
means of providing user feedback and explainabil-
ity of essay scores. Por último, in summary coherence
rating, readability assessment, and essay scoring,
coherence is just one dimension of the overall
document quality measurement.

Various methods have been proposed to capture
local and global coherence, while our work aims
to examine the performance of existing pretrained
LMs in document coherence understanding. A
assess local coherence, traditional studies have
used entity matrices, Por ejemplo, to represent
entity transitions across sentences (Barzilay and

Cifra 1: An excerpt of an incoherent document, con
the ‘‘intruder’’ sentence indicated in bold.

(72k). This dataset fills a gap in research pertain-
ing to document coherence: Our dataset is large
in scale, includes both coherent and incoherent
documentos, and has mark-up of the position of
any intruder sentence. Cifra 1 is an example
document with an intruder sentence. Aquí, el
highlighted sentence reads as though it should
be an elaboration of the previous sentence, pero
clearly exhibits an abrupt change of topic and the
pronoun it cannot be readily resolved.

This paper makes the following contributions:
(1) we propose the sentence intrusion detection
tarea, and examine how pretrained LMs perform
over the task and hence at document coherence
comprensión; (2) we construct a large-scale
dataset from two domains—Wikipedia and CNN
news articles—that consists of coherent and inco-
herent documents, and is accompanied with the
positions of intruder sentences, to evaluate in
both in-domain and cross-domain settings; (3) nosotros
examine the behavior of models and humans, a
better understand the ability of models to model
the intrinsic properties of document coherence;
y (4) we further hand-craft adversarial
prueba
instances across a variety of linguistic phenomena
to better understand the types of incoherence that
a given model can detect.

2 Trabajo relacionado

We first review tasks relevant to our proposed
tarea, then describe existing datasets used in coher-
ence measurement, and finally discuss work on
dataset artefacts and linguistic probes.

622

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Lapata, 2005, 2008). Guinaudeau and Strube
(2013) and Mesgar and Strube (2016) use a graph
to model entity transition sequences. Sentences
in a document are represented by nodes in the
graph, and two nodes are connected if they share
the same or similar entities. Neural models have
also been proposed (Ji and Smith, 2017; Li and
Jurafsky, 2017; Le et al., 2018; Mesgar and Strube,
2018; Mim et al., 2019; Tien Nguyen and Joty,
2017). Por ejemplo, Tay et al. (2018) captura
local coherence by computing the similarity of
the output of
two LSTMs (Hochreiter y
Schmid-huber, 1997), which they concatenate with
essay representations to score essays. Li et al.
(2018) use multi-headed self-attention to capture
long distance relationships between words, cual
are passed to an LSTM layer to estimate essay
coherence scores. Xu et al. (2019) use the average
of local coherence scores between consecutive
pairs of sentences as the document coherence
puntaje.

Another relevant task is disfluency detection
in spontaneous speech transcription (Johnson
and Charniak, 2004; Jamshid Lou et al., 2018).
This task detects the reparandum and repair in
spontaneous speech transcriptions to make the
text fluent by replacing the reparandum with the
repair. Also relevant is language identification
in code-switched text (Adouane et al., 2018a,b;
Mave et al., 2018; Yirmibes¸o˘glu and Eryi˘git,
2018), where disfluency is defined at the language
nivel (p.ej., for a monolingual speaker). Lau et al.
(2015) and Warstadt et al. (2019) predict sentence-
level acceptability (how natural a sentence is).
Sin embargo, none of tasks are designed to measure
document coherence, although sentence-level
phenomena can certainly impact on document
coherencia.

2.2 Document Coherence Datasets

There exist a number of datasets targeted at dis-
course understanding. Por ejemplo, Alikhani et al.
(2019) construct a multi-modal dataset for under-
standing discourse relations between text and
imagery, such as elaboration and exemplifica-
ción. A diferencia de, we focus on discourse relations
in a document at the inter-sentential level. El
Penn Discourse Treebank (Miltsakaki et al., 2004;
Prasad et al., 2008) is a corpus of coherent doc-
uments with annotations of discourse connectives
and their arguments, noting that inter-sentential

discourse relations are not always lexically marked
(Webber, 2009).

The most relevant work to ours is the dis-
course coherence dataset of Chen et al. (2019),
which was proposed to evaluate the capabilities
of pretrained LMs in capturing discourse context.
This dataset contains documents (18K Wikipedia
articles and 10K documents from the Ubuntu IRC
channel) with fixed sentence length, and labels
documents only in terms of whether they are
incoherent, without considering the position of
the incoherent sentence. A diferencia de, nuestro conjunto de datos:
(1) provides more fine-grained information (es decir.,
the sentence position); (2) is larger in scale (encima
170K documents); (3) contains documents of
varying length; (4) incorporates adversarial filter-
ing to reduce dataset artefacts (mira la sección 3); y
(5) is accompanied with human annotation over
the Wikipedia subset, allowing us to understand
behavior patterns of machines and humans.

2.3 Dataset Artefacts

Also relevant to this research is work on removing
artefacts in datasets (Zellers et al., 2019; McCoy
et al., 2019; Zellers et al., 2018). Por ejemplo,
based on analysis of the SWAG dataset (Zellers
et al., 2018), Zellers et al. (2019) find artefacts
such as stylistic biases, which correlate with the
document labeling and mean that naive models
are able to achieve abnormally high results. Sim-
ilarly, McCoy et al. (2019) examine artefacts in
an NLI dataset, and find that naive heuristics that
are not directly related to the task can perform
remarkably well. We incorporate the findings of
such work in the construction of our dataset.

2.4 Linguistic Probes

Adversarial training has been used to craft adver-
sarial examples to obtain more robust models,
either by manipulating model parameters (white-
box attacks) or minimally editing text at
el
character/word/phrase level (black-box attacks).
Por ejemplo, Papernot et al. (2018) provide a ref-
erence library of adversarial example construction
techniques and adversarial training methods.

As we aim to understand the linguistic prop-
erties that each model has captured, we focus on
black-box attacks (Sato et al., 2018; Cheng et al.,
2020; Liang et al., 2018; Yang et al., 2020;
Samanta and Mehta, 2017). Por ejemplo, samantha
and Mehta (2017) construct adversarial examples
for sentiment classification and gender detection

623

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

by deleting, replacing, or inserting words in the
texto. For a comprehensive review of such studies,
see Belinkov and Glass (2019).

There is also a rich literature on exploring what
kinds of linguistic phenomena a model has learned
(Hu et al., 2020a; Hewitt and Liang, 2019; Hewitt
and Manning, 2019; Chen et al., 2019; McCoy
et al., 2019; Conneau et al., 2018; Gulordava
et al., 2018; Peters et al., 2018; Tang et al., 2018;
Blevins et al., 2018; Wilcox et al., 2018; Kuncoro
et al., 2018; Tran et al., 2018; Belinkov et al.,
2017). The basic idea is to use learned represen-
tations to predict linguistic properties of interest.
Example linguistic properties are subject–verb
agreement or syntactic structure, while represen-
tations can be word or sentence embeddings. Para
ejemplo, Marvin and Linzen (2018) construir
minimal sentence pairs, consisting of a grammat-
ical and ungrammtical sentence, to explore the
capacity of LMs in capturing phenomena such
as subject–verb agreement, reflexive anaphora,
and negative polarity items. In our work, nosotros
hand-construct intruder sentences which result in
incoherent documents, based on a broad range of
linguistic phenomena.

3 Dataset Construction

3.1 Dataset Desiderata

To construct a large-scale, low-noise dataset that
truly tests the ability of systems to detect intruder
oraciones, we posit five desiderata:

1. Multiple sources: The dataset should not
be too homogeneous in terms of genre or
domain, and should ideally test the ability of
models to generalize across domain.

2. Defences against hacking: Human annota-
tors and machines should not be able to hack
the task and reverse-engineer the labels by
sourcing the original documents.

3. Free of artefacts: The dataset should be free
of artefacts, that allow naive heuristics to
perform well.

4. Topic consistency: The intruder sentence,
which is used to replace a sentence from a
coherent document to obtain an incoherent
documento, should be relevant to the topic of
the document, to focus the task on coherence
and not simple topic detection.

5. KB-free: Our goal is NOT to construct a
fact-checking dataset; the intruder sentence
should be determinable based on the content
del documento, without reliance on external
knowledge bases or fact-checking.

3.2 Data Sources

We construct a dataset
from two sources—
Wikipedia and CNN—which differ in style and
genre, satisfying the first desideratum. Similar to
WikiQA (Yang et al., 2015) and HotpotQA (Cual
et al., 2018), we represent a Wikipedia document
by its summary section (es decir., the opening para-
graph), constraining the length to be between 3
y 8 oraciones. For CNN, we adopt the dataset of
Hermann et al. (2015) and Nallapati et al. (2016),
which consists of over 100,000 news articles. A
obtain documents with sentence length similar to
those from Wikipedia, we randomly select the
first 3–8 sentences from each article.

To defend against dataset hacks1 that could
expose the labels of the test data (desideratum
2), the Wikipedia test set is randomly sampled
de 37 historical dumps of Wikipedia, donde el
selected article has a cosine similarity less than the
historical average of 0.72 with its online version.2
For the training set, we remove this require-
ment and randomly select articles from different
Wikipedia dumps, a saber, the articles in the train-
ing set might be the same as their current online
versión. For CNN, we impose no such limitations.

3.3 Generating Candidate Positive Samples

We consider the original documents to be coher-
ent. We construct incoherent documents from half
of our sampled documents as follows (satisfying
desiderata 3–5):

1. Given a document D, use bigram hashing
and TF-IDF matching (Chen et al., 2017) a
retrieve the top-10 most similar documents
from a collection of documents from the same
domain, where D is the query text. Let the
set of retrieved documents be RD.

2. Randomly choose a non-opening sentence
S from document D, to be replaced by a
sentence candidate generated later. Hacemos

1Deliberate or otherwise, p.ej., via pre-training on the same

version of Wikipedia our dataset was constructed over.

2This threshold was determined by calculating the average
TF-IDF-weighted similarity of the summary section for
documents in all 37 dumps with their current online versions.

624

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

not replace the opening sentence as it is
needed to establish document context.

Fuente

#docs

Wikipedia
CNN

106,352 (46%)
72,670 (49%)

avg.
#sents
5±1
5±1

avg.
#tokens
126±24
134±32

3. For each document D(cid:5) ∈ RD, randomly
select one non-opening sentence S(cid:5) ∈ D(cid:5)
as an intruder sentence candidate.

4. Calculate the TF-IDF-weighted cosine simi-
larity between sentence S and each candidate
S(cid:5). Remove any candidates with similarity
scores ≥ 0.6, to attempt to generate a KB-free
incoherence.

5. Replace sentence S with each low-similarity
candidate S(cid:5), and use a fine-tuned XLNet-
Large model (Yang et al., 2019) to check
whether it is easy for XLNet-Large to detect
(mira la sección 5). For documents with both
easy and difficult sentence candidates, nosotros
randomly sample from the difficult sentence
candidates; de lo contrario, we randomly choose
from all the sentence candidates.

The decision to filter out sentence candidates
with similarity ≥ 0.6 was based on the observa-
tion that more similar sentences often led to the
need for world knowledge to identify the intruder
oración (violating the fifth desideratum). Para
ejemplo, given It is the second novel in the first
of three trilogies about Bernard Samson, …, a
candidate intruder sentence candidate with high
similarity is It is the first novel in the first of three
trilogies about Bernard Samson ….

We also trialed other ways of generating inco-
herent samples, such as using sentence S from
document D as the query text to retrieve docu-
mentos, and adopting a 2-hop process to retrieve
relevant documents. We found that these methods
resulted in documents that can be identified by
the pretrained models easily.

4 Dataset Analysis

4.1 Statistics of the Dataset

The process described in Section 3 resulted in
106,352 Wikipedia documents and 72,670 CNN
documentos, at an average sentence length of 5
in both cases (ver tabla 1). The percentages of
positive samples (46% y 49%, respectivamente) son
slightly less than 50% due to our data generation
constraints (detailed in Section 3.3), which can
lead to no candidate intruder sentence S(cid:5) ser
generated for original sentence S. We set aside 8%

Mesa 1: Dataset statistics for INSteD. Numbers
in parentheses are percentages of
incoherent
documentos.

of Wikipedia (which we manually tag, as detailed
en la sección 4.5) y 20% of CNN for testing.

4.2 Types of Incoherence

To better understand the different types of issues
resulting from our automatic method, we sampled
100 (synthesized) incoherent documents from
Wikipedia and manually classified the causes
of incoherence according to three overlapping
categories (ranked in terms of expected ease of
detección): (1) information structure inconsis-
tency (a break in information flow); (2) logical
inconsistency (a logically inconsistent world state
is generated, such as someone attending school
before they were born); y (3) factual incon-
sistency (where the intruder sentence is factually
incorrect). See Table 2 for a breakdown across the
categories, noting that a single document can be
incoherent across multiple categories. Información
structure inconsistency is the most common form
of incoherence, followed by factual inconsistency.
El 35% of documents with factual inconsisten-
cies break down into 8% (en general) that have other
types of incoherence, y 27% that only have
a factual inconsistency. This is an issue for the
fifth desideratum for our dataset (mira la sección 3.1),
motivating the need for manual checking of the
conjunto de datos
to determine how readily the intruder
sentence can be detected.3

4.3 Métricas de evaluación

We base evaluation of intruder sentence detection
at both the document and sentence levels:

• document level: Does the document contain
an intruder sentence? This is measured based
on classification accuracy (Acc), noting that
the dataset
el
document level (ver tabla 1). A prediction

is relatively balanced at

3We keep these documents in the dataset, as it is beyond

the scope of this work to filter these documents out.

625

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Incoherence

Información
estructura
inconsistency

Logical
inconsistency

Factual
inconsistency

Ejemplo

He is currently the senior pastor at Sovereign Grace Church of Louisville. The Church
is led by Senior Pastor Ray Johnston, Senior Pastor Curt Harlow and Senior Pastor
Andrew McCourt, and Senior Pastor Lincoln Brewster. Under Mahaney’s leadership,
Sovereign Grace Church of Louisville is a member of Sovereign Grace Churches.
Michael David, born September 22, 1954, is an American-born American painter. De
1947–1949 he attended the Otis Art Institute, de 1947 a 1950 he also attended the
Art Center College of Design in Los Angeles, and in 1950 the Chouinard Art Institute.
The Newport Tower has 37 floors. It is located on the beachfront on the east side of
Collins Avenue between 68th and 69th Streets. The building was developed by Melvin
Simón & Associates in 1990.

Mesa 2: Types of document incoherence in Wikipedia. Text in bold indicates the intruder sentence.

is ‘‘correct’’ if at least one sentence/none of
the sentences is predicted to be an intruder.

• sentence level: Is a given (non-opening) sen-
tence an intruder sentence? This is measured
based on F1, noting that most (apenas 88%)
sentences are non-intruder sentences.

4.4 Testing for Dataset Artefacts

To test for artefacts, we use XLNet-Large (Cual
et al., 2019) to predict whether each non-opening
sentence is an intruder sentence, in complete iso-
lation of its containing document (es decir., as a stand-
alone sentence classification task). comparamos
the performance of XLNet-Large with a majority-
class baseline (‘‘Majority-class’’) that predicts all
sentences to be non-intruder sentences (es decir., de
the original document), where XLNet-Large is
fine-tuned over the Wikipedia/CNN training set,
and tested over the corresponding test set.

For Wikipedia, XLNet-Large obtains an Acc
de 55.4% (vs. 55.1% for Majority-class) and F1
de 3.4% (vs. 0.0% for Majority-class). For CNN,
the results are 50.8% y 1.2%, respectivamente (vs.
51.0% y 0.0% resp. for Majority-class). Estos
results suggest that the dataset does not contain
obvious artefacts, at least for XLNet-Large. Nosotros
also experiment with a TF-IDF weighted bag-
of-words logistic regression model, achieving
slightly worse results than XLNet-Large (Acc =
55.1%, F1 = 0.05% for Wikipedia, and Acc =
50.6%, F1 = 0.3% for CNN).4

4For RoBERTa-Large (Sección 6.1), there were also no
obvious artefacts observed in the standalone sentence setting:
Acc = 55.7% and F1 = 5.3% over Wikipedia, and Acc =
51.3% and F1 = 4.3% over CNN.

4.5 Human Verification

We performed crowdsourcing via Amazon
Mechanical Turk over the Wikipedia test data
to examine how humans perform over this task.
Each Human Intelligence Task (HIT) contained
5 documents and was assigned to 5 workers. Para
each document, the task was to identify a single
sentence that ‘‘creates an incoherence or break in
the content flow’’, or in the case of no such sen-
tence, ‘‘None of the above’’, indicating a coherent
documento. In the task instructions, workers were
informed that there is at most one intruder sen-
tence per document, and were not able to select
the opening sentence. Among the 5 documentos
for each HIT, there was one incoherent document
from the training set, which was pre-identified as
being easily detectable by an author of the paper,
and acts as a quality control item. We include doc-
uments where at least 3 humans assign the same
label as our test dataset (90.3% of the Wikipedia
test dataset), where all the results are reported over
these documents, if not specified.5 Payment was
calibrated to be above Australian minimum wage.
Cifra 2 shows the distribution of instances
where different numbers of workers produced
the correct answer (the red bar). Por ejemplo,
para 6.2% of instances, 2 de 5 workers annotated
correctly. The blue bars indicate the proportion of
incoherent documents where the intruder sentence
was correctly detected by the given number of
annotators (p.ej., para 9.3% of incoherent docu-
mentos, solo 2 de 5 workers were able to identify
the intruder sentence correctly). Humans tend to
agree with each other over coherent documents,
as indicated by the increasing percentages for

5Different people may have different

thresholds in
considering a document to be incoherent, but this is beyond
the scope of our work.

626

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Bi-LSTM with average-pooling; word embed-
dings are initialized as with BoW.

InferSent: Generate representations for the sen-
tence and document with InferSent (Conneau
et al., 2017), and concatenate the two; InferSent
is based on a Bi-LSTM with a max-pooling layer,
trained on SNLI (Bowman et al., 2015).

Skip-Thought: Generate representations
para
the sentence and document with Skip-Thought
(Kiros et al., 2015), and concatenate the two; Skip-
Thought is an encoder–decoder model where the
encoder extracts generic sentence embeddings and
the decoder reconstructs surrounding sentences of
the encoded sentence.

BERT: Generate representations for the concate-
nated sentence and document with BERT (Devlin
et al., 2019), which was pretrained on the tasks
of masked language modeling and next sentence
prediction over Wikipedia and BooksCorpus
(Zhu et al., 2015); we experiment with both BERT-
Large and BERT-Base (the cased versions).

oración

RoBERTa: Generate representations for
el
concatenated
document with
RoBERTa (Liu et al., 2019), which was pre-
trained on the task of masked language modeling
(dynamically masking) and each input consisting
of continuous sentences from the same document
or multiple documents (providing broader context)
over Cc-news, OpenWebTextCorpus, and STO-
RIES (Trinh and Le, 2018), in addition to the same
data BERT was pretrained on; we experiment with
both RoBERTa-Large and RoBERTa-Base.

para

ALBERT: Generate representations
el
concatenated
document with
oración
ALBERT (Lan et al., 2020), which was pre-
trained over the same dataset as BERT but
replaces the next sentence prediction objective
with a sentence-order prediction objective,
a
model document coherence; we experiment with
both ALBERT-Large and ALBERT-xxLarge.

XLNet: Generate representations for the con-
catenated sentence and document with XLNet
(Yang et al., 2019), which was pretrained using a
permutation language modeling objective over
conjuntos de datos
including Wikipedia, BooksCorpus,
Giga5 (Parker et al., 2011), ClueWeb 2012-B
(Callan et al., 2009), and Common Crawl; nosotros
experiment with both XLNet-Largeand XLNet-
Base (the cased versions). Although XLNet-Large

627

Cifra 2: Distribution of instances where different
numbers of humans produce correct answers. Tenga en cuenta que
the red bars indicate distributions over all documents
and the blue bars indicate distributions over incoherent
documentos.

red bars but decreasing percentages for blue bars
across the x-axis. Intruder sentences in incoherent
documentos, sin embargo, are harder to detect. Uno
possible explanation is that the identification of
intruder sentences requires fact-checking, cual
workers were instructed not to do (and base their
judgment only on the information in the provided
documento); another reason is that intruder sen-
tences disrupt local incoherence with neighboring
oraciones, creating confusion as to which is the
intruder sentence (with many of the sentence-level
mis-annotations being off-by-one errors).

5 Modelos

We model intruder sentence detection as a binary
classification task: Each non-opening sentence in
a document is concatenated with the document,
and a model
is asked to predict whether the
sentence is an intruder sentence to the document.
Our focus is on the task, conjunto de datos, y cómo
existing models perform at document coherence
prediction rather than modeling novelty, and we
thus experiment with pre-existing pre-trained
modelos. The models are as follows, each of which
is fed into an MLP layer with a softmax output.

BoW: Average the word embeddings for the
combined document (oración + sequence of
sentences in the document), based on pretrained
300D GloVe embeddings trained on a 840B-token
cuerpo (Pennington et al., 2014).

Bi-LSTM: Feed the sequence of words in the
into a single-layer 512D
combined document

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

→ Wikipedia

→ CNN

Wiki→Wiki

CNN→Wiki

CNN→CNN

Wiki→CNN

Acc (%)

F1 (%) Acc (%)

F1 (%)

Acc (%)

F1 (%) Acc (%)

F1 (%)

Majority-class
BoW
Bi-LSTM
InferSent
Skip-Thought
BERT-Base
BERT-Large
XLNet-Base
XLNet-Large
RoBERTa-Base
RoBERTa-Large
ALBERT-Large
ALBERT-xxLarge

ALBERT-xxLarge-freeze

Humano

57.3
57.3
56.2
57.3
57.3
65.3
67.0
67.8
72.9
69.5
76.1
70.7
81.7

57.3

66.6

0.0
0.0
12.7
0.0
0.0
35.7
39.6
45.0
55.4
47.0
59.8
49.6
71.5

0.0

35.9

57.3
57.3
57.3
57.3
57.3
61.2
64.0
62.2
62.8
63.2
63.7
63.8
66.6

N/A

66.6

0.0
0.0
0.0
0.0
0.0
21.1
29.1
22.4
22.2
26.1
24.6
24.9
33.2

N/A

35.9

50.6
50.6
51.7
50.6
50.6
80.8
82.4
91.2
96.9
92.5
96.0
93.4
96.9

50.6

0.0
0.0
25.1
0.0
0.0
71.6
74.8
86.6
95.0
88.8
94.5
90.8
95.9

0.3

50.6
50.6
50.2
50.6
50.6
57.0
61.5
64.0
80.7
77.6
88.3
72.6
89.1

N/A

0.0
0.0
3.0
0.0
0.0
23.5
35.9
43.3
73.8
68.1
83.5
61.5
86.7

N/A

74.0

57.8

Mesa 3: Experimental results over Wikipedia and CNN, in both in-domain and cross-domain settings.
Acc is at the document level and F1 is at the sentence level.

is used in removing data artefacts when selecting
the intruder sentences, our experiments suggest
that the comparative results across models (con
or without artefact filtering) are robust.

6 experimentos

6.1 Preliminary Results

In our first experiments, we train the various
models across both Wikipedia and CNN, y
evaluate them in-domain and cross-domain. Nosotros
are particularly interested in the cross-domain
configuración, to test the true ability of the model to
detect document incoherence, as distinct from
overfitting to domain-specific idiosyncrasies. Él
is also worth mentioning that BERT, RoBERTa,
ALBERT, and XLNet are pretrained on multi-
sentence Wikipedia data, and have potentially
memorised sentence pairs, making in-domain
experiments problematic for Wikipedia in partic-
ular. Also of concern in applying models to the
automatically generated data is that it is entirely
possible that an intruder sentence is undetectable
to a human, because no incoherence results from
the sentence substitution (bearing in mind that
solo 58% of documents in Table 2 contained
information structure inconsistencies).

From Table 3, we can see that the simpler models
(BoW, Bi-LSTM, InferSent, and Skip-Thought)
perform only at the level of Majority-class at the

document level, for both Wikipedia and CNN.
At the sentence level (F1), once again the models
the level of Majority-class
perform largely at
(F1 = 0.0), other than Bi-LSTM in-domain for
Wikipedia and CNN. In the final row of the
mesa, we also see that humans are much better at
detecting whether documents are incoherent (en
the document level) than identifying the position
of intruder sentences (at the sentence level), y
that in general, human performance is low. Este
is likely the result of the fact that there are only
58% of documents in Table 2 containing informa-
tion structure inconsistencies. We only conducted
crowdsourcing over Wikipedia due to budget lim-
itations and the fact that the CNN documents are
available online, making dataset hacks possible.6
Among the pretrained LMs, ALBERT-xxLarge
achieves the best performance over Wikipedia and
CNN, at both the document and sentence levels.
Looking closer at the Wikipedia results, we find
that BERT-Large achieves a higher precision than
XLNet-Large (71.0% vs. 60.3%), while XLNet-
Large achieves a higher recall (51.3% vs. 27.4%).
ALBERT-xxLarge achieves a precision higher
than BERT-Large (79.7%) and a recall higher
than XLNet-Large (64.9%), leading to the overall

6To have a general idea about the difficulty of the CNN
conjunto de datos, one of the authors annotated 100 documentos (50
coherent and 50 incoherent documents), randomly sampled
from the test set.

628

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

best performance. Over CNN, ALBERT-xxLarge,
RoBERTa-Large, and XLNet-Large achieve high
precision and recall (apenas 93.0% a 97%).7
The competitive results for ALBERT-xxLarge
over Wikipedia and CNN result from the pre-
training strategies, especially the sentence-order
prediction loss capturing document coherence in
isolation, different from next sentence prediction
loss which conflates topic prediction and coher-
ence prediction in a lower-difficulty single task.
The performance gap for ALBERT, RoBERTa,
and XLNet between the base and large models are
bigger than that of BERT, suggesting that they
benefit from greater model capacity.8

We also examine how pretrained LMs perform
with only the classifier parameters being updated
durante el entrenamiento. Aquí, we focus on exclusively
on ALBERT-xxLarge, given its superiority. Como
como se muestra en la figura 3, the pretrained LM ALBERT-
xxLarge is unable to different coherent documents
from incoherent ones, resulting into random guess,
although it considers document coherence during
pretraining. This indicates the necessity of fine-
tuning LMs for document coherent understanding.
Looking to the cross-domain results, de nuevo,
ALBERT-xxLarge achieves the best performance
over both Wikipedia and CNN. The lower results
for RoBERTa-Large and XLNet-Large over
Wikipedia may be due to both RoBERTa and
XLNet being pretrained over newswire docu-
mentos, and fine-tuning over CNN reducing the
capacity of the model to generalize. ALBERT and
BERT do not suffer from this as they are not pre-
trained over newswire documents. The substantial
drop between the in- and cross-domain settings
for ALBERT, RoBERTa, XLNet, and BERT
indicates that the models have limited capacity
to learn a generalized representation of document
coherencia, in addition to the style differences
between Wikipedia and CNN.

7The higher performance for all models/humans over the
CNN dataset indicates that it is easier for models/humans to
identify the presence of intruder sentences. This is can be
explained by the fact that a large proportion of documents
include named entities, making it easier to detect the intruder
oraciones. Además, the database used to retrieve candidate
intruder sentences is smaller compared to that of Wikipedia.
8We also performed experiments where the models were
allowed to predict the first sentence as the intruder sentence.
As expected, model performance drops, p.ej., F1 of XLNet-
Large drops from 55.4% a 47.9%, reflecting both the
increased complexity of the task and the lack of (al menos) uno
previous sentence to provide document context.

Wiki→Wiki

Ubuntu→Wiki

Majority-class
ALBERT-xxLarge
Humano

50.0
96.8
98.0

50.0
53.1
98.0

Ubuntu→Ubuntu Wiki→Ubuntu

Majority-class
ALBERT-xxLarge
Humano

50.0
58.1
74.0

50.0
58.7
74.0

Mesa 4: Acc for the dataset of Chen et al. (2019).

6.2 Results over the Existing Dataset

We also examine how ALBERT-xxLarge per-
forms over the coarse-grained dataset of Chen
et al. (2019), dónde 50 documents from each do-
main were annotated by a native English speaker.
Performance is measured at the document level
solo, as the dataset does not include indication of
which sentence is the intruder sentence. As shown
en mesa 4, ALBERT-xxLarge achieves an Acc of
96.8% over the Wikipedia subset, demonstrating
that our Wikipedia dataset is more challenging
(Acc of 81.7%) and also underlining the utility of
adversarial filtering in dataset construction. Given
the considerably lower results, one could conclude
that Ubuntu is a good source for a dataset. Cómo-
alguna vez, when one of the authors attempted to perform
the task manually, they found the document-level
task to be extremely difficult as it relied heavily
on expert knowledge of Ubuntu packages, mucho
more so than document coherence understanding.
In the cross-domain setting, there is a substan-
tial drop over the Wikipedia dataset, which can be
explained by ALBERT-Large failing to generate
a representation of document coherence from the
Ubuntu dataset, due to the high dependence on
domain knowledge as described above, resultado-
ing in near-random results. The cross-domain
results for ALBERT-xxLarge over Ubuntu are
actually marginally higher than the in-domain
results but still close to random, suggesting that
the in-domain model isn’t able to capture either
document coherence or domain knowledge, y
underlining the relatively minor role of coherence
for the Ubuntu dataset.

6.3 Performance on Documents of Different

Difficulty Levels

One concern with our preliminary experiments
was whether the intruder sentences generate gen-
uine incoherence in the information structure
of the documents. We investigate this question

629

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Humanos

# −intruder docs # +intruder docs

Coherent
Incoherent

1385
11

177
404

Mesa 5: Statistics over documents where all
5 humans agree, where −intruder/+intruder in-
dicates the documents without/with an intruder
oración.

eso

which were annotated as incoherent by all
annotators, we find out
there is a break
in information flow due to references or urls,
even though there is no intruder sentence. Para
documents with an intruder sentence (+ intruder),
where humans disagree with the gold-standard
(humans perceive the documents as coherent or
the position of the intruder sentence to be other
than the actual intruder sentence), we find that 98%
of the documents are considered to be coherent.
We randomly sampled 100 documents from these
documents and examined whether the intruder
sentence results in a break in information flow.
We find that fact-checking is needed to identify
the intruder sentence for 93% of the documents.10
Mesa 6 shows the performance over
el
Wikipedia documents that are annotated con-
sistently by all 5 annotators (from Table 5).
Consistent with the results
from Table 3,
ALBERT-xxLarge achieves the best performance
both in- and cross-domain. To understand the
different behaviors of humans and ALBERT-
xxLarge we analyze documents which only
humans got correct, only ALBERT-xxLarge got
correcto, or neither humans nor ALBERT-xxLarge
got correct, como sigue:

1. Humans only: 7 incoherent (+intruder) y

73 coherent (−intruder) documentos

2. ALBERT-xxLarge only: 181 incoherent
(+intruder) (of which we found 97% a
require fact-checking11) y 9 coherent
(−intruder) documentos (of which 8 contener
urls/references, which confused humans)

10Aquí, the high percentage of incoherent documents with
factual inconsistencies does not necessarily point to a high
percentage of factual inconsistency in the overall dataset, como
humans are more likely to agree with the gold-standard for
coherent documents.

11Hay 4 documents that humans identify as incoherent
based on the wrong intruder sentence, due to the intruder
sentence leading to a misleading factual inconsistency.

630

Cifra 3: ALBERT-xxLarge vs. humanos.

by breaking down the results over the best-
performing model
(ALBERT-xxLarge) based
on the level of agreement between the human
annotations and the generated gold-standard, para
Wikipedia. The results are in Figure 3, dónde
the x-axis denotes the number of annotators who
agree with the gold-standard: Por ejemplo, ‘‘2’’
indicates that 2 de 5 annotators were able to assign
the gold-standard labels to the documents.

Our assumption is that the incoherent docu-
ments which humans fail to detect are actually not
perceptibly incoherent,9 and that any advantage
for the models over humans for documents with
low-agreement (with respect to the gold-standard)
is actually due to dataset artefacts. At the docu-
ment level (Acc), there is reasonable correlation
between model and human performance (es decir., el
model struggles on the same documents as the
humanos). At the sentence level (F1), there is less
discernible difference in model performance over
documents of varying human difficulty.

6.4 Analysis over Documents with High

Human Agreement

To understand the relationship between human-
assigned labels and the gold-standard, we further
examine documents where all 5 annotators agree,
noting that human-assigned labels can poten-
tially be different from the gold-standard here.
Mesa 5 shows the statistics of humans over
these documents, with regard to whether there
is an intruder sentence in the documents. En-
couragingly, we can see that humans tend to
agree more over coherent documents (documentos
without any intruder sentences) than incoherent
documentos (documents with an intruder sentence).
Examining the 11 original coherent documents

9Although the intruder sentence may lead to factual errors,

the annotators were instructed not to do fact checking.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Wiki→Wiki

CNN→Wiki

Acc (%) F1 (%) Acc (%) F1 (%)

Majority-class
BERT-Large
XLNet-Large
RoBERTa-Large
ALBERT-xxLarge
Humano

70.6
76.7
79.1
82.0
85.9
79.5

0.0
42.0
57.0
59.6
68.8
45.4

70.6
75.4
76.6
77.3
78.8
79.5

0.0
36.9
35.4
37.4
42.9
45.4

Mesa 6: Results over documents annotated con-
sistently by all 5 annotators, where annotations
can be the same as or different from gold-
standard.

3. Neither humans nor models: 223 incoherent
(+intruder) (of which 98.2% y 77.1%
were predicted to be coherent by humans
and ALBERT-xxLarge, respectivamente, and for
the remainder, the wrong intruder sentence
was identified) y 2 coherent (−intruder)
documentos (both of which were poorly
organised, confusing allcomers)

Looking over the incoherent documents that
require fact-checking, no obvious differences are
discernible between the documents that ALBERT-
xxLarge predicts correctly and those it misses.
Our assumption here is that ALBERT-xxLarge is
biased by the pretraining dataset, and that many
of the cases where it makes the wrong prediction
are attributable to mismatches between the text
in our dataset and the Wikipedia version used in
pretraining the model.

6.5 Question Revisited

Q1: Do models truly capture the intrinsic
properties of document coherence?

A: It is certainly true that models that incorpo-
rate a more explicit notion of document coher-
ence into pretraining (p.ej., ALBERT) tend to
perform better. Además, larger-context mod-
los (RoBERTa) and robust
training strategies
(XLNet) during pretraining are also beneficial for
document coherent understanding. This suggests
a tentative yes, but there were equally instances
of strong disagreement with human intuitions
and model predictions for the better-performing
models and evidence to suggest that the models
were performing fact-checking at the same time
as coherence modeling.

Q2: What types of document incoherence

can/can’t these models detect?

A: Over incoherent documents resulting from
fact inconsistencies, where humans tend to fail,
the better-performing models can often make cor-
rect predictions; over incoherent documents with
information structure or logical inconsistencies
which humans can easily detect, ALBERT-Large,
RoBERTa-Large, and XLNet-Large achieve an
Acc ≥ 87%, showing that they can certainly
capture information structure and logical incon-
sistencies to a high degree. dicho eso, the fact
that they misclassify clearly coherent documents
as incoherent suggests that are in part lacking in
their ability to capture document coherence. Nosotros
thus can conclude that they can reliably identify
intruder sentences which result in a break in infor-
mation structure or logical flow, but are imperfect
models of document coherence.

7 Linguistic Probes

To further examine the models, we constructed a
language probe dataset.

7.1 Linguistic Probe Dataset Construction

We handcrafted adversarial instances based on a
range of linguistic phenomena that generate infor-
mation structure inconsistencies. In constructing
tal conjunto de datos, minimal modifications were made
to the original sentences, to isolate the effect
of the linguistic probe. For each phenomenon,
we hand-constructed roughly 100 adversarial
instances by modifying intruder sentences in
incoherent Wikipedia test documents that were
manually pre-filtered for ease of detection/lack
of confounding effects in the original text. Eso
es, the linguistic probes for the different phe-
nomena were manually added to incoherent test
documentos, within intruder sentences; our interest
here is whether the addition of the linguistic
probes makes it easier for the models to detect
the incoherence. En tono rimbombante, we do not provide
any additional training data, meaning there is
no supervision signal specific to the phenomena.
There are roughly 8×100 instances in total,12 con
the eight phenomena being:

1. gender pronoun flip (Gender), converting a
pronoun to its opposite gender (p.ej., she →
él);

12Hay 100 instances for each phenomenon except for
Demonstrative, where there were only 95 instances in the
Wikipedia test data with singular demonstratives.

631

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

2. animacy downgrade (Animacy↓), downgrad-
ing pronouns and possessive determiners to
their inanimate versions (p.ej., she/he/her/him
→ it, and her/his → its);

3. animacy upgrade (Animacy↑), upgrading
pronouns and possessive determiners to
it →
su
third person version (p.ej.,
she/he/her/him, and its → her/his);

4. singular demonstrative flip (Demonstrative),
converting singular demonstratives to plural
unos (p.ej., this → these and that → those);
5. conjunction flip (Conjunction), converting
conjunctions to their opposites (p.ej., pero
→ and therefore, and → but, although →
por lo tanto, y viceversa);

6. past tense flip (Past to Future), converting
past to future tense (p.ej., was → will be and
led → will lead);

7. sentence negation (Negation), negating the
oración (p.ej., He has [a] . . . warrant … →
He doesn’t have [a] . . . warrant …);

8. number manipulation (Número), changing
numbers to implausible values (p.ej., Él
served as Chief Operating Officer . . . de
2002 a 2005 → He served as Chief
Operating Officer . . . de 200 BCE to 201
BCE and Line 11 has a length of 51.7 km and
a total of 18 stations. → Line 11 has a length
de 51.7 m and a total of 1.8 stations.).

All the probes generate syntactically correct
oraciones, and the first four generally lead to
sentences that are also semantically felicitous,
with the incoherence being at the document level.
Por ejemplo, in He was never convicted and was
out on parole within a few years, if we replace he
with she, the sentence is felicitous, but if the focus
entity in the preceding and subsequent sentences
is a male, the information flow will be disrupted.
The last four language probes are crafted to
explore the capacity of a model to capture com-
monsense reasoning, in terms of discourse re-
lationships,
tense and polarity awareness, y
understanding of numbers. For Conjunction,
we only focus on explicit connectives within
una sentencia. For Past
there can be
intra-sentence inconsistency if there are time-
specific signals, failing which broader document
context is needed to pick up on the tense flip.
Similarly for Negation and Number, the change

to Future,

can lead to inconsistency either intra- or inter-
sententially. Por ejemplo, He did not appear in
más que 400 films between 1914 y 1941 … es
intra-sententially incoherent.

7.2 Experimental Results

Mesa 7 lists the performance of pretrained LMs at
recognising intruder sentences within incoherent
documentos, with and without the addition of the
respective linguistic probes.13 For a given model,
we break down the results across probes into
two columns: The first column (‘‘F1’’) muestra
the sentence-level performance over the original
intruder sentence (without the addition of the lin-
guistic probe), and the second column (‘‘ΔF1’’)
shows the absolute difference in performance with
the addition of the linguistic probe. Our expec-
tation is that results should improve on average
with the inclusion of the linguistic probe (es decir.,
ΔF1 values should be positive), given that we
have reinforced the incoherence generated by the
intruder sentence.

All models achieve near-perfect results with
Gender linguistic probes (es decir., the sum of F1 and
ΔF1 is close to 100), and are also highly success-
ful at detecting Animacy mismatches and Past to
Future (the top half of Table 7). For the probes
in the bottom half of the table, none of the three
models except ALBERT-xxLarge performs par-
ticularly well, especially for Demonstrative. Para
each linguistic probe, we observe that the pre-
trained LMs can more easily detect incoherent
text with the addition of these lexical/gram-
matical inconsistencies (except for XLNet-Large
and ALBERT-xxLarge over Demonstrative and
ALBERT-xxLarge over Conjunction).

In the cross-domain setting, the overall per-
formance of XLNet-LargeCNN and ALBERT-
xxLargeCNN drops across all linguistic probes, pero
the absolute gain through the inclusion of the
linguistic probe is almost universally larger,
that while domain differences hurt
sugerir
the models, they are attuned to the impact of
linguistic probes on document coherence and
thus learning some more general properties of
(en)coherencia. Por otro lado,
documento
BERT-LargeCNN (over Gender, Animacy↓, y
Animacy↑) and RoBERTa-LargeCNN (Gender and
Animacy↑) actually perform better than in-domain.
RoBERTa-LargeCNN achieves the best overall

13Results for coherent documents are omitted due to space.

632

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Gender

Animacy↓

Animacy↑

Past to Future

ΔF1

BERT-Large
XLNet-Large
RoBERTa-Large
ALBERT-xxLarge

BERT-LargeCNN
XLNet-LargeCNN
RoBERTa-LargeCNN
ALBERT-xxLargeCNN

26.5 +65.3
55.8 +41.6
64.9 +32.5
74.0 +25.4

23.9 +70.0
13.6 +83.1
15.4 +82.4
21.6 +72.8

26.3 +53.2
50.0 +45.2
50.7 +38.3
+8.5
71.8

22.2 +60.2
10.0 +71.3
7.9 +64.4
20.2 +51.8

33.6 +45.1
64.0 +23.5
59.7 +21.7
+2.9
81.0

27.6 +51.4
8.0 +71.8
9.8 +73.3
27.6 +33.4

35.6 +42.1
64.9 +16.9
69.2 +19.9
+4.3
79.8

30.6 +14.7
23.2 +27.6
23.4 +40.0
38.0 +30.4

Humano

35.8 +53.4

36.6 +45.3

29.8 +53.9

40.9 +34.4

Conjunction

Demonstrative

Negation

Número

ΔF1

BERT-Large
XLNet-Large
RoBERTa-Large
ALBERT-xxLarge

BERT-LargeCNN
XLNet-LargeCNN
RoBERTa-LargeCNN
ALBERT-xxLargeCNN

51.9 +17.3
+3.6
68.6
+0.7
73.0
−1.6
83.5
−1.4
0.0
+1.4
+1.3

38.2
31.0
33.9
41.6

34.8 +15.6
0.0
55.4
0.0
57.9
+1.3
75.2
−5.7
0.0
0.0
0.0

35.6
14.1
17.8
30.9

34.5 +32.2
+8.9
57.7
68.4 +10.9
+2.9
79.5

+4.2
28.8
15.7 +11.8
21.0 +12.4
28.1 +19.2

32.5 +31.2
50.7 +11.3
54.2 +20.0
63.9 +10.4

19.6 +11.7
15.2 +13.1
18.3 +23.6
23.0 +16.0

Humano

40.5

+8.7

38.0

+1.0

40.4 +36.8

37.3 +24.2

Mesa 7: Results over language probes in incoherent Wikipedia test documents. BERT-LargeCNN, XLNet-
LargeCNN, RoBERTa-LargeCNN, and ALBERT-xxLargeCNN are trained over CNN, while BERT-Large,
XLNet-Large, RoBERTa-Large, and ALBERT-xxLarge are trained over Wikipedia. Aquí, F1 is over the
original incoherent documents (excluding linguistic probes), and ΔF1 indicates the absolute performance
difference resulting from incorporating linguistic probes.

performance over Gender, Animacy↑, and Num-
ber while ALBERT-xxLargeCNN achieves the best
overall performance over Past to Future, Conjunc-
ción, Demonstrative, and Negation. La razón
that the models tend to struggle with Demonstra-
tive and Conjunction is not immediately clear,
and will be explored in future work.

We also conducted human evaluations on this
dataset via Amazon Mechanical Turk, based on the
same methodology as described in Section 4.5
(without explicit instruction to look out for lin-
guistic artefacts, and with a mixture of coherent
and incoherent documents, as per the original
annotation task). As detailed in Table 7, humanos
generally benefit from the inclusion of the linguis-
tic probes. Largely consistent with the results for
the models, humans are highly sensitised to the
effects of Gender, Animacy, Past to Future, y
Negation, but largely oblivious to the effects of

Demonstrative and Conjunction. Extraordinariamente, el
best models (ALBERT-xxLarge and RoBERTa-
Large) perform on par with humans in the in-
domain setting, but are generally well below
humans in the cross-domain setting.

8 Conclusión

We propose the new task of detecting whether
there is an intruder sentence in a document, generación-
erated by replacing an original sentence with a
similar sentence from a second document. A
benchmark model performance over this task, nosotros
construct a large-scale dataset consisting of doc-
uments from English Wikipedia and CNN news
artículos. Experimental results show that pretrained
LMs that incorporate larger document contexts in
pretraining perform remarkably well in-domain,
but experience a substantial drop cross-domain. En

633

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

follow-up analysis based on human annotations,
substantial divergences from human intuitions
were observed, pointing to limitations in their
ability to model document coherence. Más
results over a linguistic probe dataset show that
pretrained models fail to identify some linguistic
characteristics that affect document coherence,
suggesting room to improve for them to truly
capture document coherence, and motivating the
construction of a dataset with intruder text at the
intra-sentential level.

Referencias

Wafia Adouane, Jean-Philippe Bernardy, y
Simon Dobnik. 2018a. Improving neural net-
work performance by injecting background
conocimiento: Detecting code-switching and
borrowing in Algerian texts. En procedimientos
de
the Third Workshop on Computational
to Linguistic Code-Switching,
Enfoques
pages 20–28. DOI: https://doi.org/10
.18653/v1/W18-3203

Wafia Adouane, Simon Dobnik, Jean-Philippe
Bernardy, and Nasredine Semmar. 2018b. A
comparison of character neural language model
and bootstrapping for language identification
in multilingual noisy texts. En procedimientos de
the Second Workshop on Subword/Character
LEvel Models, pages 22–31. DOI: https://
doi.org/10.18653/v1/W18-1203

Malihe Alikhani, Sreyasi Nag Chowdhury, Gerardo
de Melo, and Matthew Stone. 2019. CITE:
A corpus of image-text discourse relations. En
Actas de la 2019 Conference of the
North American Chapter of the Association for
Ligüística computacional: Human Language
Technologies, Volumen 1 (Long and Short
Documentos), pages 570–575.

Nicholas M. Asher and Alex Lascarides. 2003.
Logics of Conversation, Studies in Natural
Procesamiento del lenguaje. Cambridge University
Prensa.

Regina Barzilay and Mirella Lapata. 2005.
Modeling local coherence: An entity-based
acercarse. In Proceedings of the 43rd Annual
reunión de
la Asociación de Computación-
lingüística nacional, pages 141–148. DOI:
https://doi.org/10.3115/1219840
.1219858

Regina Barzilay and Mirella Lapata. 2008. Modelo-
ing local coherence: An entity-based approach.
Ligüística computacional, 34(1):1–34. DOI:
https://doi.org/10.1162/coli.2008
.34.1.1

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi,
Hassan Sajjad, and James Glass. 2017. Qué
do neural machine translation models learn
el
about morphology? En procedimientos de
55ª Reunión Anual de la Asociación de
Ligüística computacional (Volumen 1: Largo
Documentos), pages 861–872. DOI: https://
doi.org/10.18653/v1/P17-1080

Yonatan Belinkov and James Glass. 2019. Anal-
ysis methods in neural language processing:
A survey. Transactions of the Association for
Ligüística computacional, 7:49–72. DOI:
https://doi.org/10.1162/tacl a
00254

Terra Blevins, Omer

Exacción,

Luke
Zettlemoyer. 2018. Deep RNNs encode soft
el
hierarchical syntax.
56ª Reunión Anual de la Asociación de
Ligüística computacional (Volumen 2: Short
Documentos), pages 14–19. DOI: https://
doi.org/10.18653/v1/P18-2003

En procedimientos de

Samuel R. Bowman, Gabor Angeli, Christopher
Potts, and Christopher D. Manning. 2015. A
large annotated corpus for learning natural
language inference. En Actas de la 2015
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, pages 632–642. DOI:
https://doi.org/10.18653/v1/D15
-1075

Jamie Callan, Mark Hoy, Changkuk Yoo,
and Le Zhao. 2009. Clueweb09 data set.
https://lemurproject.org/clueweb09/.
Accedido: 15.12.2019.

Khyathi Chandu, Eric Nyberg, and Alan W Black.
2019. Storyboarding of recipes: Grounded con-
textual generation. In Proceedings of the 57th
Annual Meeting of the Association for Compu-
lingüística nacional, pages 6040–6046. DOI:
https://doi.org/10.18653/v1/P19
-1606

Danqi Chen, Adam Fisch, Jason Weston, y
Antonio Bordes. 2017. Reading Wikipedia to

634

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

answer open-domain questions. En procedimientos
of the 55th Annual Meeting of the Association
para Lingüística Computacional (Volumen 1:
Artículos largos),
1870–1879. DOI:
https://doi.org/10.18653/v1/P17
-1171

paginas

Mingda Chen, Zewei Chu, and Kevin Gimpel.
2019. Evaluation benchmarks and learning
criteria for discourse-aware sentence rep-
el 2019
En procedimientos de
resentaciones.
Conference on Empirical Methods in Nat-
ural Language Processing and the 9th
International Joint Conference on Natural
Procesamiento del lenguaje, pages 649–662. DOI:
https://doi.org/10.18653/v1/D19
-1060

Minhao Cheng, Jinfeng Yi, Pin-Yu Chen, Huan
zhang, and Cho-Jui Hsieh. 2020. Seq2sick:
sequence-to-
Evaluating the robustness of
sequence models with adversarial
exam-
the Thirty-Fourth
ples.
Conferencia AAAI sobre Inteligencia Artificial,
pages 3601–3608. DOI: https://doi.org
/10.1609/aaai.v34i04.5767

En procedimientos de

Orph´ee De Clercq, V´eronique Hoste, Bart
Desmet, Philip van Oosten, Martine De Cock,
and Lieve Macken. 2014. Using the crowd for
readability prediction. Natural Language En-
gineering, 20(3):293–325. DOI: https://
doi.org/10.1017/S1351324912000344

Alexis Conneau, Douwe Kiela, Holger Schwenk,
Lo¨ıc Barrault, and Antoine Bordes. 2017.
Supervised learning of universal
oración
language infer-
representations from natural
el 2017
En procedimientos de
ence data.
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, pages 670–680. DOI:
https://doi.org/10.18653/v1/D17
-1070

Alexis Conneau, German Kruszewski, Guillaume
Lample, Lo¨ıc Barrault, and Marco Baroni.
2018. What you can cram into a single $&!#*
para
vector: Probing sentence embeddings
linguistic properties. En procedimientos de
el
56ª Reunión Anual de la Asociación de
Ligüística computacional (Volumen 1: Largo
Documentos), pages 2126–2136. DOI: https://
doi.org/10.18653/v1/P18-1198

Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. En procedimientos de
el 2019
Conference of the North American Chapter of
la Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano, Volumen 1
(Artículos largos y cortos), páginas 4171–4186.

Micha Elsner, Joseph Austerweil, and Eugene
Charniak. 2007. A unified local and global
model for discourse coherence. In Human
Language Technologies 2007: The Conference
el
el Capítulo Norteamericano de
de
Asociación de Lingüística Computacional;
Actas
the Main Conference,
pages 436–443.

Micha Elsner and Eugene Charniak. 2011.
Extending the entity grid with entity-specific
características. In Proceedings of the 49th Annual
Meeting of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
pages 125–129.

Angela Fan, mike lewis, and Yann Dauphin.
2019. Strategies for structuring story gener-
the 57th Annual
ación.
Meeting of the Association for Computational
Lingüística, pages 2650–2660.

En procedimientos de

Herbert Paul Grice. 2002. Logic and conver-
estación. Foundations of Cognitive Psychology,
719–732.

Camille Guinaudeau and Michael Strube. 2013.
Graph-based local coherence modeling.
En
Proceedings of the 51st Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 93–103.

Kristina Gulordava, Piotr Bojanowski, Edouard
Grave, Tal Linzen, and Marco Baroni. 2018.
Colorless green recurrent networks dream
el 2018
hierarchically. En procedimientos de
Conference of the North American Chapter of
la Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano, Volumen 1
(Artículos largos), pages 1195–1205. DOI:
https://doi.org/10.18653/v1/N18
-1108

Karl Moritz Hermann, Tom´aˇs Koˇcisk´y, Eduardo
Grefenstette, Lasse Espeholt, Will Kay,
Mustafa Suleyman, and Phil Blunsom. 2015.

635

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Teaching machines to read and comprehend.
the 28th International
En procedimientos de
Conference on Neural Information Processing
Sistemas – Volumen 1, pages 1693–1701.

John Hewitt and Percy Liang. 2019. Designing
tareas.
and interpreting probes with control
el 2019 Conferencia sobre
En procedimientos de
Empirical Methods
in Natural Language
Procesamiento y IX Conjunción Internacional
Conferencia sobre procesamiento del lenguaje natural,
pages 2733–2743. DOI: https://doi.org
/10.18653/v1/D19-1275

John Hewitt and Christopher D. Manning. 2019.
A structural probe for finding syntax in word
representaciones. En Actas de la 2019
Conference of the North American Chapter of
la Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano, Volumen 1
(Artículos largos y cortos), pages 4129–4138.

Sepp Hochreiter y Jürgen Schmidhuber. 1997.
Memoria larga a corto plazo. Neural Computa-
ción, 9(8):1735–1780. DOI: https://doi
.org/10.1162/neco.1997.9.8.1735,
PMID: 9377276

Ari Holtzman,

Jan Buys, Maxwell Forbes,
Antoine Bosselut, David Golub, and Yejin
Choi. 2018. Learning to write with coopera-
el
tive discriminators.
56ª Reunión Anual de la Asociación de
Ligüística computacional (Volumen 1: Largo
Documentos), pages 1638–1649. DOI: https://
doi.org/10.18653/v1/P18-1152

En procedimientos de

Jennifer Hu, Sherry Y. Chen, and Roger P. Exacción.
2020a. A closer look at the performance of
neural language models on reflexive anaphor
the Society for
licensing. Actas de
Computation in Linguistics, 3(1):382–392.

Jun Jie Hu, Yu Cheng, Zhe Gan, Jingjing Liu,
Jianfeng Gao, y Graham Neubig. 2020b.
What makes a good story? Designing compos-
ite rewards for visual storytelling. En curso-
the Thirty-Fourth AAAI Conference
cosas de
sobre Inteligencia Artificial, pages 7969–7976.
DOI: https://doi.org/10.1609/aaai
.v34i05.6305

Paria Jamshid Lou, Peter Anderson, and Mark
Johnson. 2018. Disfluency detection using

636

auto-correlational neural networks. En curso-
el 2018 Conferencia sobre Empirismo
cosas de
Métodos en el procesamiento del lenguaje natural,
pages 4610–4619. DOI: https://doi.org
/10.18653/v1/D18-1490

Yangfeng Ji and Noah A. Herrero. 2017. Neural
discourse structure for text categorization. En
Proceedings of the 55th Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 996–1005.

Mark Johnson and Eugene Charniak. 2004. A
TAG-based noisy-channel model of speech
repairs. In Proceedings of the 42nd Annual
Meeting of the Association for Computational
Lingüística, pages 33–39. DOI: https://
doi.org/10.3115/1218955.1218960

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov,
Richard S. Zemel, Antonio Torralba, Raquel
and Sanja Fidler. 2015. Skip-
Urtasun,
el
thought vectors.
28th International Conference on Neural
Sistemas de procesamiento de información – Volumen 2,
pages 3294–3302.

En procedimientos de

Adhiguna Kuncoro, Chris Dyer, John Hale, Dani
Yogatama, Stephen Clark, and Phil Blunsom.
2018. LSTMs can learn syntax-sensitive depen-
dencies well, but modeling structure makes
the 56th
them better.
la Asociación para
Annual Meeting of
Ligüística computacional (Volumen 1: Largo
Documentos), pages 1426–1436. DOI: https://
doi.org/10.18653/v1/P18-1132

En procedimientos de

Alice Lai and Joel Tetreault. 2018. Discurso
coherence in the wild: Un conjunto de datos, evaluación
the 19th
and methods.
Annual SIGdial Meeting on Discourse and
Dialogue, pages 214–223. DOI: https://
doi.org/10.18653/v1/W18-5023

En procedimientos de

Zhenzhong Lan, Mingda Chen, Sebastian Good-
hombre, Kevin Gimpel, Piyush Sharma, and Radu
Soricut. 2020. ALBERT: A lite BERT for
self-supervised learning of language represen-
taciones. In Proceedings of the 8th International
Conferencia sobre Representaciones del Aprendizaje.

Mirella Lapata. 2003. Probabilistic text struc-
turing: Experiments with sentence ordering.
In Proceedings of the 41st Annual Meeting

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

de
the Association for Computational Lin-
guísticos, pages 545–552. DOI: https://
doi.org/10.3115/1075096.1075165

Jey Han Lau, Alejandro Clark, and Shalom
Lappin. 2015. Unsupervised prediction of
acceptability judgements. En procedimientos de
the 53rd Annual Meeting of the Association
for Computational Linguistics and the 7th In-
Conferencia Conjunta Internacional sobre Lan Natural-
Procesamiento de calibre (Volumen 1: Artículos largos),
pages 1618–1628.

Jiwei Li and Dan Jurafsky. 2017. Neural net
models of open-domain discourse coherence.
el 2017 Conferencia sobre
En procedimientos de
in Natural Language
Empirical Methods
Procesando, pages 198–209.

Xia Li, Minping Chen, Jianyun Nie, Zhenxing
Liu, Ziheng Feng, and Yingdan Cai. 2018.
Coherence-based automated essay scoring
using self-attention. En procedimientos de
el
17th China National Conference on Com-
Lingüística putacional, CCL 2018, y el
6th International Symposium on Natural
Language Processing Based on Naturally
Annotated Big Data, pages 386–397. DOI:
https://doi.org/10.1007/978-3-030
-01716-3 32

Actas de la 2018 Conference on Em-
pirical Methods in Natural Language Pro-
cesando, pages 1192–1202. DOI: https://
doi.org/10.18653/v1/D18-1151

Deepthi Mave, Suraj Maharjan, and Thamar
Solorio. 2018. Language identification and
analysis of code-switched social media text.
In Proceedings of the Third Workshop on Com-
putational Approaches to Linguistic Code-
Switching, pages 51–61. DOI: https://
doi.org/10.18653/v1/W18-3206

Tom McCoy, Ellie Pavlick, and Tal Linzen.
2019. Right for the wrong reasons: Diagnos-
ing syntactic heuristics in natural
idioma
inferencia. In Proceedings of the 57th Annual
reunión de
la Asociación de Computación-
lingüística nacional, pages 3428–3448. DOI:
https://doi.org/10.18653/v1/P19
-1334

En procedimientos de

Mohsen Mesgar and Michael Strube. 2016.
Lexical coherence graph modeling using word
el 2016
embeddings.
Conferencia del Capítulo Norteamericano
de
the Association for Computational Lin-
guísticos: Tecnologías del lenguaje humano,
pages 1414–1423. DOI: https://doi.org
/10.18653/v1/N16-1167

Bin Liang, Hongcheng Li, Miaoqiang Su, Cacerola
Bian, Xirong Li, and Wenchang Shi. 2018.
Deep text classification can be fooled. En profesional-
ceedings of the Twenty-Seventh International
Joint Conference on Artificial Intelligence,
pages 4208–4215. DOI: https://doi.org
/10.24963/ijcai.2018/585

Mohsen Mesgar and Michael Strube. 2018. A
neural local coherence model for text quality
el 2018
En procedimientos de
evaluación.
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, pages 4328–4339. DOI:
https://doi.org/10.18653/v1/D18
-1464

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Joshi, Danqi Chen, Omer
Du, Mandar
Exacción, mike lewis, Lucas Zettlemoyer, y
Veselin Stoyanov. 2019. RoBERTa: A robustly
optimized BERT pretraining approach. CORR,
cs.CL/1907.11692v1.

William C. Mann and Sandra A. Thompson.
1988. Teoría de la estructura retórica: Toward a
functional theory of text organization. Texto,
8(3):243–281. DOI: https://doi.org
/10.1515/text.1.1988.8.3.243

Rebecca Marvin and Tal Linzen. 2018. Targeted
syntactic evaluation of language models. En

Eleni Miltsakaki, Rashmi Prasad, Aravind
Joshi, and Bonnie Webber. 2004. The Penn
el
discourse treebank.
Fourth International Conference on Language
Resources and Evaluation.

En procedimientos de

Farjana Sultana Mim, Naoya

Inoue, Pablo
Reisert, Hiroki Ouchi, and Kentaro Inui. 2019.
Unsupervised learning of discourse-aware text
representación. En procedimientos de
el 2019
Student Research Workshop, pages 93–104.

Ramesh Nallapati, Bowen Zhou, C´ıcero Nogueira
dos Santos, C¸ aglar G¨ulc¸ehre, and Bing Xiang.
2016. Abstractive text summarization using

637

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

sequence-to-sequence RNNs and beyond. En
Proceedings of The 20th SIGNLL Conference
sobre el aprendizaje computacional del lenguaje natural,
pages 280–290. DOI: https://doi.org
/10.18653/v1/K16-1028

Nicolas Papernot, Fartash Faghri, Nicholas
Carlini,
Ian Goodfellow, Reuben Feinman,
Alexey Kurakin, Cihang Xie, Yash Sharma,
Tom Brown, Aurko Roy, Alexander Matyasko,
Vahid Behzadan, Karen Hambardzumyan,
Zhishuai Zhang, Yi-Lin Juang, Zhi Li, ryan
Sheatsley, Abhibhav Garg, Jonathan Uesato,
Willi Gierke, Yinpeng Dong, David Berthelot,
Paul Hendricks, Jonas Rauber, and Rujun Long.
2018. Technical report on the CleverHans
v2.1.0 adversarial examples library. CORR,
cs.LG/1610.00768v6.

Cesc C. Park and Gunhee Kim. 2015. Expressing
an image stream with a sequence of natural
In Proceedings of Advances in
oraciones.
Neural Information Processing Systems 28,
pages 73–81.

Robert Parker, David Graff,

Junbo Kong,
Ke Chen, and Kazuaki Maeda. 2011. Inglés
linguistic data con-
gigaword fifth edition,
sortium. Google Scholar.

jeffrey

Socher,

Pennington, Ricardo

y
Christopher Manning. 2014. GloVe: Global
vectors for word representation. En curso-
cosas de
el 2014 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural,
pages 1532–1543. DOI: https://doi.org
/10.3115/v1/D14-1162

Matthew Peters, Mark Neumann,

Luke
Zettlemoyer, and Wen-tau Yih. 2018. Dis-
secting contextual word embeddings: Archi-
En procedimientos
tecture and representation.
de
on Empirical
2018 Conferencia
Métodos en el procesamiento del lenguaje natural,
pages 1499–1509. DOI: https://doi.org
/10.18653/v1/D18-1179

evaluation of

Emily Pitler, Annie Louis, and Ani Nenkova.
2010. Automatic
linguistic
quality in multi-document summarization. En
Proceedings of the 48th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 544–554.

En procedimientos de

Emily Pitler and Ani Nenkova. 2008. Revisando
readability: A unified framework for predicting
el 2008
text quality.
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, pages 186–195. DOI:
https://doi.org/10.3115/1613715
.1613742

Shrimai Prabhumoye, Ruslan Salakhutdinov, y
Alan W Black. 2020. Topological sort for
sentence ordering. In Proceedings of the 58th
Annual Meeting of the Association for Compu-
lingüística nacional, pages 2783–2792. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.248

Rashmi Prasad, Nikhil Dinesh, Alan Lee,
Eleni Miltsakaki, Livio Robaldo, Aravind
Joshi, and Bonnie Webber. 2008. The Penn
discourse TreeBank 2.0. En Actas de la
Sixth International Conference on Language
Resources and Evaluation.

Jan Wira Gotama Putra and Takenobu Tokunaga.
2017. Evaluating text coherence based on
semantic similarity graph. En procedimientos de
TextGraphs-11: the Workshop on Graph-based
Methods for Natural Language Processing,
pages 76–85.

Suranjana Samanta and Sameep Mehta. 2017.
Towards crafting text adversarial samples.
CORR, cs.LG/1707.02812v1.

Motoki Sato, Jun Suzuki, Hiroyuki Shindo, y
Yuji Matsumoto. 2018. Interpretable adver-
sarial perturbation in input embedding space
for text. In Proceedings of the Twenty-Seventh
International Joint Conference on Artificial
Inteligencia, pages 4323–4330. DOI: https://
doi.org/10.24963/ijcai.2018/601

Swapna Somasundaran,

Jill Burstein,

y
Martin Chodorow. 2014. Lexical chaining
for measuring discourse coherence quality in
test-taker essays. In Proceedings of the 25th
International Conference on Computational
Lingüística: Technical Papers, pages 950–961.

Gongbo Tang, Mathias M¨uller, Annette Rios, y
Rico Sennrich. 2018. Why self-attention? A
targeted evaluation of neural machine trans-
lation architectures. En Actas de la 2018
Jornada sobre Métodos Empíricos en Natural

638

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Procesamiento del lenguaje,
4263–4272.
DOI: https://doi.org/10.18653/v1
/D18-1458

paginas

Deirdre Wilson and Dan Sperber. 2004. Rele-
vance theory, The Handbook of Pragmatics,
Blackwell, pages 607–632.

Yi Tay, Minh C. Phan, Luu Anh Tuan, y
Siu Cheung Hui. 2018. SkipFlow: Incorporating
neural
end-to-end
automatic text scoring. En Actas de la
Thirty-Second AAAI Conference on Artificial
Inteligencia, pages 5948–5955.

coherencia

características

para

Dat Tien Nguyen and Shafiq Joty. 2017. A neural
En procedimientos de
local coherence model.
the 55th Annual Meeting of the Association
para Lingüística Computacional (Volumen 1:
Artículos largos),
1320–1330. DOI:
https://doi.org/10.18653/v1/P17
-1121

paginas

Ke Tran, Arianna Bisazza, and Christof Monz.
2018. The importance of being recurrent
for modeling hierarchical structure. En profesional-
cesiones de la 2018 Conferencia sobre Empirismo
Métodos en el procesamiento del lenguaje natural,
pages 4731–4736. DOI: https://doi.org
/10.18653/v1/D18-1503

Trieu H. Trinh and Quoc V. Le. 2018. A simple
method for commonsense reasoning. CORR,
cs.AI/1806.02847v2.

Alex Warstadt, Amanpreet Singh, and Samuel
Bowman. 2019. Neural network acceptability
la Asociación
judgments. Transactions of
para Lingüística Computacional, 7(0). DOI:
https://doi.org/10.1162/tacl a
00290

Bonnie Webber. 2009. Genre distinctions for dis-
course in the Penn TreeBank. En procedimientos
of the Joint Conference of the 47th Annual
Meeting of the ACL and the 4th International
Conferencia conjunta sobre lenguaje natural Pro-
cessing of the AFNLP, pages 674–682. DOI:
https://doi.org/10.3115/1690219
.1690240

Ethan Wilcox, Roger Levy, Takashi Morita, y
Richard Futrell. 2018. What do RNN language
models learn about filler–gap dependencies?
En Actas de la 2018 EMNLP Workshop
BlackboxNLP: Analyzing and Interpreting Neu-
ral Networks for NLP, pages 211–221. DOI:
https://doi.org/10.18653/v1/W18
-5423

Peng Xu, Hamidreza Saghir, Jin Sung Kang,
Teng Long, Avishek Joey Bose, Yanshuai Cao,
and Jackie Chi Kit Cheung. 2019. A cross-
domain transferable neural coherence model.
In Proceedings of the 57th Annual Meeting of
la Asociación de Lingüística Computacional,
pages 678–687.

Puyudi Yang, Jianbo Chen, Cho-Jui Hsieh,
Jane-Ling Wang, and Michael I. Jordán. 2020.
Greedy attack and gumbel attack: Generating
adversarial examples for discrete data. Diario
de Investigación sobre Aprendizaje Automático, 21:43:1–43:36.

Yi Yang, Wen-tau Yih, and Christopher Meek.
2015. WikiQA: A challenge dataset for open-
domain question answering. En procedimientos
on Empirical
de
Métodos en el procesamiento del lenguaje natural,
pages 2013–2018. DOI: https://doi.org
/10.18653/v1/D15-1237

2015 Conferencia

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime
Carbonell, Russ R Salakhutdinov, and Quoc V
Le. 2019. Xlnet: Generalized autoregressive
En
language understanding.
pretraining for
Actas de
the Thirty-third Conference
on Neural Information Processing Systems,
pages 5754–5764.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua
bengio, William Cohen, Ruslan Salakhutdinov,
and Christopher D. Manning. 2018. HotpotQA:
A dataset for diverse, explainable multi-hop
question answering. En Actas de la 2018
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, pages 2369–2380. DOI:
https://doi.org/10.18653/v1/D18
-1259

idioma

code-switching
pair.

and G¨uls¸en Eryi˘git.
Zeynep Yirmibes¸o˘glu
entre
2018. Detecting
En profesional-
Turkish-English
el 2018 EMNLP Workshop
cesiones de
on Noisy
W-NUT: El
User-generated Text, pages 110–115. DOI:
https://doi.org/10.18653/v1/W18
-6115

4th Workshop

Rowan Zellers, Yonatan Bisk, Roy Schwartz,
and Yejin Choi. 2018. SWAG: A large-scale

639

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

adversarial dataset
for grounded common-
sense inference. En Actas de la 2018
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, pages 93–104. DOI:
https://doi.org/10.18653/v1/D18
-1009

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Alí
Farhadi, and Yejin Choi. 2019. HellaSwag: Can
a machine really finish your sentence? En profesional-
ceedings of the 57th Annual Meeting of the
Asociación de Lingüística Computacional,

4791–4800. DOI: https://doi

paginas
.org/10.18653/v1/P19-1472

Yukun Zhu, Ryan Kiros, Richard S. Zemel,
Ruslan
Salakhutdinov, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015.
Aligning books and movies: Towards story-
like visual explanations by watching movies
and reading books. En procedimientos de
el
2015 IEEE Conference on Computer Vision
and Pattern Recognition, pages 19–27. DOI:
https://doi.org/10.1109/ICCV.2015.11

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
8
8
1
9
2
9
6
9
9

/
t

a
C
_
a
_
0
0
3
8
8
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

640
Descargar PDF