How Furiously Can Colorless Green Ideas Sleep?
Sentence Acceptability in Context
Jey Han Lau1,7 Carlos Armendariz2 Shalom Lappin2,3,4
Matthew Purver2,5 Chang Shu6,7
1The University of Melbourne
2Queen Mary University of London
3University of Gothenburg
4King’s College London
5Joˇzef Stefan Institute
6University of Nottingham Ningbo China
7DeepBrain
jeyhan.lau@gmail.com, c.santosarmendariz@qmul.ac.uk
shalom.lappin@gu.se, m.purver@qmul.ac.uk, scxcs1@nottingham.edu.cn
Abstracto
We study the influence of context on sentence
acceptability. First we compare the acceptabil-
ity ratings of sentences judged in isolation,
with a relevant context, and with an irrelevant
contexto. Our results show that context induces
a cognitive load for humans, which com-
presses the distribution of ratings. Además,
in relevant contexts we observe a discourse
coherence effect
that uniformly raises ac-
ceptability. Próximo, we test unidirectional and
bidirectional language models in their ability to
predict acceptability ratings. The bidirectional
models show very promising results, con el
best model achieving a new state-of-the-art for
unsupervised acceptability prediction. The two
sets of experiments provide insights into the
cognitive aspects of sentence processing and
central issues in the computational modeling
of text and discourse.
1 Introducción
Sentence acceptability is the extent to which a
sentence appears natural to native speakers of a
idioma. Linguists have often used this property
to motivate grammatical theories. computacional
language processing has traditionally been more
concerned with likelihood—the probability of a
sentence being produced or encountered. El
question of whether and how these properties
are related is a fundamental one. Lau et al.
(2017b) experiment with unsupervised language
models to predict acceptability, and they obtained
an encouraging correlation with human ratings.
296
This raises foundational questions about the nature
of linguistic knowledge: If probabilistic models
can acquire knowledge of sentence acceptability
from raw texts, we have prima facie support for
an alternative view of language acquisition that
does not rely on a categorical grammaticality
component.
It is generally assumed that our perception of
sentence acceptability is influenced by context.
Sentences that may appear odd in isolation can
become natural in some environments, and sen-
tences that seem perfectly well formed in some
contexts are odd in others. On the computational
lado, much recent progress in language modeling
has been achieved through the ability to incor-
porate more document context, using broader
and deeper models (p.ej., Devlin et al., 2019;
Yang et al., 2019). While most language modeling
is restricted to individual sentences, models can
benefit from using additional context (Khandelwal
et al., 2018). Sin embargo, despite the importance of
contexto, few psycholinguistic or computational
studies systematically investigate how context
affects acceptability, or the ability of language
models to predict human acceptability judgments.
Two recent studies that explore the impact of doc-
ument context on acceptability judgments both
identify a compression effect (Bernardy et al.,
2018; Bizzoni and Lappin, 2019). Sentences per-
ceived to be low in acceptability when judged
without context receive a boost in acceptability
when judged within context. En cambio, those
with high out-of-context acceptability see a reduc-
tion in acceptability when context is presented. Él
is unclear what causes this compression effect. Es
it a result of cognitive load, imposed by additional
Transacciones de la Asociación de Lingüística Computacional, volumen. 8, páginas. 296–310, 2020. https://doi.org/10.1162/tacl a 00315
Editor de acciones: George Foster. Lote de envío: 10/2019; Lote de revisión: 1/2020; Publicado 6/2020.
C(cid:13) 2020 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
processing demands, or is it the consequence of
an attempt to identify a discourse relation between
context and sentence?
We address these questions in this paper. A
understand the influence of context on human
perceptions, we ran three crowdsourced experi-
ments to collect acceptability ratings from human
annotators. We develop a methodology to ensure
comparable ratings for each target sentence in
isolation (without any context), in a relevant three-
sentence context, and in the context of sentences
randomly sampled from another document. Nuestro
results replicate the compression effect, y
careful analyses reveal that both cognitive load
and discourse coherence are involved.
(left-to-right)
To understand the relationship between sen-
tence acceptability and probability, nosotros llevamos a cabo
experiments with unsupervised language models
to predict acceptability. We explore traditional
recurrent neural
unidirectional
network models,
and modern bidirectional
transformer models (p.ej., BERT). We found that
bidirectional models consistently outperform
unidirectional models by a wide margin, calling
into question the suitability of left-to-right bias for
sentence processing. Our best bidirectional model
achieves simulated human performance on the
prediction task, establishing a new state-of-the-art.
2 Acceptability in Context
2.1 Recopilación de datos
To understand how humans interpret acceptability,
we require a set of sentences with varying degrees
of well-formedness. Following previous studies
(Lau et al., 2017b; Bernardy et al., 2018), nosotros
use round-trip machine translation to introduce a
wide range of infelicities into naturally occurring
oraciones.
We sample 50 Inglés (objetivo) sentences and
their contexts (three preceding sentences) desde el
English Wikipedia.1 We use Moses to translate
the target sentences into four languages (checo,
Español, Alemán, y francés) and then back to
1We preprocess the raw dump with WikiExtractor
(https://github.com/attardi/wikiextractor),
and collect paragraphs that have ≥ 4 sentences with each
sentence having ≥ 5 palabras. Sentences and words are tok-
enized with spaCy (https://spacy.io/) to check for
these constraints.
English.2 This produces 250 sentences in total
(5 languages including English) for our test set.
Note that we only do round-trip translation for the
target sentences; the contexts are not modified.
We use Amazon Mechanical Turk (AMT) a
collect acceptability ratings for the target sen-
tences.3 We run three experiments where we
expose users to different types of context. Para el
experimentos, we split the test set into 25 HITs of
10 oraciones. Each HIT contains 2 original English
sentences and 8 round-trip translated sentences,
which are different from each other and not de-
rived from either of the originals. Users are asked
to rate the sentences for naturalness on a 4-point
ordinal scale: bad (1.0), not very good (2.0),
mostly good (3.0), and good (4.0). We recruit 20
annotators for each HIT.
In the first experiment we present only the tar-
get sentences, without any context. In the second
experimento, we first show the context paragraph
(three preceding sentences of the target sentence),
and ask users to select
the most appropriate
description of its topic from a list of four candi-
date topics. Each candidate topic is represented by
three words produced by a topic model.4 Note that
the context paragraph consists of original English
sentences which did not undergo translation. Una vez
the users have selected the topic, they move to the
next screen where they rate the target sentence for
naturalness.5 The third experiment has the same
format as the second, except that the three sen-
tences presented prior to rating are randomly sam-
pled from another Wikipedia article.6 We require
annotators to perform a topic identification task
prior to rating the target sentence to ensure that
they read the context before making acceptability
judgments.
For each sentence, we aggregate the ratings
from multiple annotators by taking the mean.
Henceforth we refer to the mean ratings collected
from the first (no context), segundo (real context),
∅,
and third (random context) experiments as H
2We use the pre-trained Moses models from http://
www.statmt.org/moses/RELEASE-4.0/models/
for translation.
3https://www.mturk.com/.
4We train a topic model with 50 topics on 15 K Wikipedia
documents with Mallet (McCallum, 2002) and infer topics
for the context paragraphs based on the trained model.
5Note that we do not ask the users to judge the naturalness
of the sentence in context; the instructions they see for the
naturalness rating task is the same as the first experiment.
6Sampled sentences are sequential, running sentences.
297
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
H+, and H−, respectivamente. We rolled out
el
experiments on AMT over several weeks and pre-
vented users from doing more than one exper-
mento. Therefore a disjoint group of annotators
performed each experiment.
To control for quality, we check that users are
rating the English sentences ≥ 3.0 consistently.
For the second and third experiments, nosotros también
check that users are selecting the topics appro-
priately. In each HIT one context paragraph has
one real topic (from the topic model), and three
fake topics with randomly sampled words as the
candidate topics. Users who fail to identify the
real topic above a confidence level are filtered out.
Across the three experiments, over three quarters
of workers passed our filtering conditions.
To calibrate for the differences in rating scale
between users, we follow the postprocessing
procedure of Hill et al. (2015), where we calculate
the average rating for each user and the overall
promedio (by taking the mean of all average ratings),
and decrease (increase) the ratings of a user by 1.0
if their average rating is greater (smaller) than the
overall average by 1.0.7 To reduce the impact of
outliers, for each sentence we also remove ratings
that are more than 2 standard deviations away
from the mean.8
2.2 Results and Discussion
We present scatter plots to compare the mean
∅, H+,
ratings for the three different contexts (h
and H−) En figura 1. The black line represents the
diagonal, and the red line represents the regression
line. En general, the mean ratings correlate strongly
∅ = 0.940,
with each other. Pearson’s r for H+ vs. h
H− vs. h
∅ = 0.911, and H− vs. H+ = 0.891.
The regression (rojo) and diagonal (negro) líneas
∅ (Figura 1a) show a compression
in H+ vs. h
efecto. Bad sentences appear a little more natural,
and perfectly good sentences become slightly
less natural, when context is introduced.9 This
is the same compression effect observed by
7No worker has an average rating that is greater or smaller
than the overall average by 2.0.
8This postprocessing procedure discarded a total of 504
annotations/ratings (aproximadamente 3.9%) encima 3 experi-
mentos. The final average number of annotations for a sentence
in the first, segundo, and third experiments is 16.4, 17.8, y
15.3, respectivamente.
9De término medio, good sentences (ratings ≥ 3.5) observe a
rating reduction of 0.08 and bad sentences (ratings ≤ 1.5) un
increase of 0.45.
Bernardy et al. (2018). It is also present in the
∅ (Figura 1b).
graph for H− vs. h
Two explanations of the compression effect
seem plausible to us. The first is a discourse
coherence hypothesis that takes this effect to be
caused by a general tendency to find infelicitous
sentences more natural in context. This hypothesis,
sin embargo, does not explain why perfectly natural
sentences appear less acceptable in context. El
second hypothesis is a variant of a cognitive load
cuenta. In this view, interpreting context imposes
a significant burden on a subject’s processing
resources, and this reduces their focus on the
sentence presented for acceptability judgments. En
the extreme ends of the rating scale, as they require
all subjects to be consistent in order to achieve the
minimum/maximum mean rating, the increased
cognitive load increases the likelihood of a subject
making a mistake. This increases/lowers the mean
rating, and creates a compression effect.
The discourse coherence hypothesis would
imply that the compression effect should appear
with real contexts, but not with random ones,
as there is little connection between the target
sentence and a random context. Por el contrario, el
cognitive load account predicts that the effect
should be present in both types of context, as it
depends only on the processing burden imposed
by interpreting the context. We see compression
in both types of contexts, which suggests that
the cognitive load hypothesis is the more likely
cuenta.
Sin embargo,
these two hypotheses are not
mutually exclusive. Es, in principle, possible that
both effects—discourse coherence and cognitive
load—are exhibited when context is introduced.
To better understand the impact of discourse
coherencia, consider Figure 1c, where we compare
H− vs. H+. Here the regression line is parallel to
and below the diagonal, implying that there is a
consistent decrease in acceptability ratings from
H+ to H−. As both ratings are collected with some
form of context, the cognitive load confound is
removed. What remains is a discourse coherence
efecto. Sentences presented in relevant contexts
increase in acceptability
undergo a consistent
rating.
To analyze the significance of this effect, nosotros
use the non-parametric Wilcoxon signed-rank test
(una cola) to compare the difference between
H+ and H−. This gives a p-value of 1.9 × 10−8,
298
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 1: Scatter plots comparing human acceptability ratings.
indicating that the discourse coherence effect is
significativo.
Returning to Figures 1a and 1b, we can see
eso (1) the offset of the regression line, y (2)
the intersection point of the diagonal and the
regression line, is higher in Figure 1a than in
Figura 1b. This suggests that there is an increase
of ratings, y entonces, in addition to the cognitive load
efecto, a discourse coherence effect is also at work
in the real context setting.
We performed hypothesis tests to compare the
regression lines in Figures 1a and 1b to see if
their offsets (constants) and slopes (coefficients)
are statistically different.10 The p-value for the
offset is 1.7 × 10−2, confirming our qualitative
observation that there is a significant discourse
coherence effect. The p-value for the slope,
sin embargo, es 3.6 × 10−1, suggesting that cognitive
load compresses the ratings in a consistent way
for both H+ and H−, relative to H
∅.
Para concluir, our experiments reveal that con-
text induces a cognitive load for human process-
En g, and this has the effect of compressing the
acceptability distribution. It moderates the ex-
tremes by making very unnatural sentences appear
more acceptable, and perfectly natural sentences
slightly less acceptable. If the context is relevant to
the target sentence, then we also have a discourse
coherence effect, where sentences are perceived
to be generally more acceptable.
10We follow the procedure detailed in https://
statisticsbyjim.com/regression/comparing-
regression-lines/ where we collate the data points
in Figures 1a and 1b and treat the in-context ratings (H+
and H−) as the dependent variable, the out-of-context ratings
∅) as the first independent variable, and the type of the
(h
contexto (real or random) as the second independent variable,
to perform regression analyses. The significance of the offset
and slope can be measured by interpreting the p-values of
the second independent variable, and the interaction between
the first and second independent variables, respectivamente.
3 Modeling Acceptability
En esta sección, we explore computational models
to predict human acceptability ratings. Somos
interested in models that do not rely on explicit
to use the
supervision (es decir., we do not want
acceptability ratings as labels in the training data).
Our motivation here is to understand the extent
to which sentence probability, estimated by an
unsupervised model, can provide the basis for
predicting sentence acceptability.
Para tal fin, we train language models
(Sección 3.1) using unsupervised objectives (p.ej.,
next word prediction), and use these models
to infer the probabilities of our test sentences.
To accommodate sentence length and lexical
frequency we experiment with several simple
normalization methods, converting probabilities
to acceptability measures (Sección 3.2). El
acceptability measures are the final output of our
modelos; they are what we use to compare to human
acceptability ratings.
3.1 Language Models
Our first model is an LSTM language model (LSTM:
Hochreiter and Schmidhuber, 1997; Mikolov
et al., 2010). Recurrent neural network models
(RNNs) have been shown to be competitive in this
tarea (Lau et al., 2015; Bernardy et al., 2018), y
they serve as our baseline.
Our second model is a joint topic and language
modelo (TDLM: Lau et al., 2017a). TDLM combines
topic model with language model in a single
modelo, drawing on the idea that the topical con-
text of a sentence can help word prediction in
the language model. The topic model is fashioned
as an auto-encoder, where the input is the docu-
ment’s word sequence and it is processed by
convolutional layers to produce a topic vector
to predict the input words. The language model
299
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
functions like a standard LSTM model, pero
incorporates the topic vector (generated by its
document context) into the current hidden state to
predict the next word.
We train LSTM and TDLM on 100K uncased
English Wikipedia articles containing approxi-
mately 40M tokens with a vocabulary of 66K
words.11
Next we explore transformer-based models, como
they have become the benchmark for many NLP
tasks in recent years (Vaswani et al., 2017; Devlin
et al., 2019; Yang et al., 2019). The transformer
models that we use are trained on a much larger
cuerpo, and they are four to five times larger with
respect to their model parameters.
Our first transformer is GPT2 (Radford et al.,
2019). Given a target word, the input is a sequence
of previously seen words, which are then mapped
to embeddings (along with their positions) y
fed to multiple layers of ‘‘transformer blocks’’
before the target word is predicted. Much of its
power resides in these transformer blocks: Cada
provides a multi-headed self-attention unit over
all input words, allowing it to capture multiple
dependencies between words, while avoiding the
need for recurrence. With no need to process a
sentence in sequence, the model parallelizes more
efficiently, and scales in a way that RNNs cannot.
GPT2 is trained on WebText, which consists of
encima 8 million web documents, and uses Byte
Pair Encoding (BPE: Sennrich et al., 2016) para
tokenization (casing preserved). BPE produces
sub-word units, a middle ground between word
and character, and it provides better coverage for
unseen words. We use the released medium-sized
modelo (‘‘Medium’’) for our experiments.12
Our second transformer is BERT (Devlin et al.,
2019). Unlike GPT2, BERT is not a typical language
modelo, in the sense that it has access to both
left and right context words when predicting the
target word.13 Hence, it encodes context in a
bidirectional manner.
To train BERT, Devlin et al. (2019) propose
a masked language model objective, where a
random proportion of input words are masked
11We use Stanford CoreNLP (Manning et al., 2014) a
tokenize words and sentences. Rare words are replaced by a
special UNK symbol.
12https://github.com/openai/gpt-2.
13Note that context is burdened with two senses in the
paper. It can mean the preceding sentences of a target sen-
tence, or the neighbouring words of a target word. El
intended sense should be apparent from the usage.
and the model is tasked to predict them based on
non-masked words. In addition to this objective,
BERT is trained with a next sentence prediction
objetivo, where the input is a pair of sentences,
and the model’s goal is to predict whether the
latter sentence follows the former. This objective
is added to provide pre-training for downstream
tasks that involve understanding the relationship
between a pair of sentences (p.ej., machine com-
prehension and textual entailment).
The bidirectionality of BERT is the core feature
that produces its state-of-the-art performance on
a number of tasks. The flipside of this encoding
style, sin embargo, is that BERT lacks the ability to
generate left-to-right and compute sentence prob-
capacidad. We discuss how we use BERT to produce
a probability estimate for sentences in the next
sección (Sección 3.2).
En nuestros experimentos, we use the largest pre-
trained model (‘‘BERT-Large’’),14 que tiene un
similar number of parameters (340METRO) to GPT2. Es
trained on Wikipedia and BookCorpus (Zhu et al.,
2015), where the latter is a collection of fiction
books. Like GPT2, BERT also uses sub-word token-
ización (WordPiece). We experiment with two
variants of BERT: one trained on cased data (BERTCS),
and another on uncased data (BERTUCS). As our
test sentences are uncased, a comparison between
these two models allows us to gauge the impact of
casing in the training data.
Our last transformer model is XLNET (Yang et al.,
2019). XLNET is unique in that it applies a novel
permutation language model objective, allowing it
to capture bidirectional context while preserving
key aspects of unidirectional language models
(p.ej., left-to-right generation).
The permutation language model objective
works by first generating a possible permutation
(also called ‘‘factorization order’’) of a sequence.
When predicting a target word in the sequence,
the context words that the model has access to are
determined by the factorization order. To illustrate
este, imagine we have the sequence x = [x1, x2,
x3, x4]. One possible factorization order is: x3 →
x2 → x4 → x1. Given this order, if predicting
target word x4, the model only has access to
context words {x3, x2}; if the target word is x2,
it sees only {x3}. En la práctica, the target word is
set to be the last few words in the factorization
14https://github.com/google-research/bert.
300
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Modelo
Configuration
Training Data
Architecture Encoding #Param. Casing
Size
Tokenization
corpus
LSTM
RNN
RNN
TDLM
GPT2
Transformador
BERTCS Transformer
BERTUCS Transformer
Unidir.
Unidir.
Unidir.
Bidir.
Bidir.
60M Uncased
80M Uncased
340M Cased
340M Cased
340M Uncased
0.2ES
0.2ES
40ES
13ES
13ES
XLNET Transformer
Hybrid
340M Cased
126ES
Word
Word
BPE
WordPiece
WordPiece
Oración-
Piece
Wikipedia
Wikipedia
WebText
Wikipedia, BookCorpus
Wikipedia, BookCorpus
Wikipedia, BookCorpus, Giga5
ClueWeb, Common Crawl
Mesa 1: Language models and their configurations.
orden (p.ej., x4 and x1), and so the model always
sees some context words for prediction.
As XLNET is trained to work with different
factorization orders during training, it has expe-
rienced both full/bidirectional context and partial/
unidirectional context, allowing it to adapt to tasks
that have access to full context (p.ej., most language
understanding tasks), as well as those that do not
(p.ej., left-to-right generation).
él
Another innovation of XLNET is that
en-
corporates the segment recurrence mechanism of
Dai et al. (2019). This mechanism is inspired by
truncated backpropagation through time used for
training RNNs, where the initial state of a sequence
is initialized with the final state from the previous
secuencia. The segment recurrence mechanism
works in a similar way, by caching the hidden
states of the transformer blocks from the previous
secuencia, and allowing the current sequence to
attend to them during training. This permits XLNET
long-range dependencies beyond its
to model
maximum sequence length.
We use the largest pre-trained model (‘‘XLNet-
Large’’),15 which has a similar number of param-
eters to our BERT and GPT2 models (340METRO). XLNET
is trained on a much larger corpus combining
Wikipedia, BookCorpus, news and web articles.
tokenization, XLNET uses SentencePiece
Para
(Kudo and Richardson, 2018), another sub-word
tokenization technique. Like GPT2, XLNET is trained
on cased data.
Mesa 1 summarizes the language models. En
general, the RNN models are orders of magnitude
smaller than the transformers in both model
parameters and training data, although they are
trained on the same domain (Wikipedia), and use
uncased data as the test sentences. The RNN
models also operate on a word level, mientras que el
transformers use sub-word units.
15https://github.com/zihangdai/xlnet.
301
3.2 Probability and Acceptability Measure
Given a unidirectional language model, podemos
infer the probability of a sentence by multiplying
the estimated probabilities of each token using
previously seen (izquierda) words as context (bengio
et al., 2003):
→
PAG (s) =
|s|
Y
i=0
PAG (Wisconsin|w i)
(2)
With this formulation, we allow BERT to have
access to both left and right context words
when predicting each target word, since this
is consistent with the way in which it was
entrenado. It is important to note, sin embargo, eso
sentence probability computed this way is not
a true probability value: These probabilities do
not sum to 1.0 over all sentences. Ecuación (1),
in contrast, does guarantee true probabilities.
Intuitivamente,
the sentence probability computed
with this bidirectional formulation is a measure
16Technically we can mask all right context words and
predict the target words one at a time, but because the model
is never trained in this way, we found that it performs poorly
in preliminary experiments.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
of the model’s confidence in the likelihood of the
oración.
To compute the true probability, Wang and
Dar (2019) show that we need to sum the
pre-softmax weights for each token to score a
oración, and then divide the score by the total
score of all sentences. As it is impractical to
compute the total score of all sentences (un
infinite set), the true sentence probabilities for
these bidirectional models are intractable. Usamos
our non-normalized confidence scores as stand-ins
for these probabilities.
For XLNET, we also compute sentence probab-
ility this way, applying bidirectional context, y
we denote it as XLNETBI. Note that XLNETUNI and
XLNETBI are based on the same trained model.
They differ only in how they estimate sentence
probability at test time.
Sentence probability (estimated either using
unidirectional or bidirectional context) is affected
by its length (p.ej., longer sentences have lower
probabilities), and word frequency (p.ej., the cat is
big vs. the yak is big). To modulate for these
factors we introduce simple normalization tech-
niques. Mesa 2 presents five methods to map
sentence probabilities to acceptability measures:
LP, MeanLP, PenLP, NormLP, and SLOR.
LP is the unnormalized log probability. Ambos
MeanLP and PenLP are normalized on sentence
length, but PenLP scales length with an exponent
(a) to dampen the impact of large values (Wu et al.,
2016; Vaswani et al., 2017). We set α = 0.8 en nuestro
experimentos. NormLP normalizes using unigram
|s|
sentence probability (es decir., PU(s) = Q
i=0 P (Wisconsin)),
while SLOR utilizes both length and unigram
probabilidad (Pauls and Klein, 2012).
When computing sentence probability we have
the option of including the context paragraph that
the human annotators see (Sección 2). We use the
superscripts ∅, +, − to denote a model using no
contexto, real context, and random context, respeto-
∅, LSTM+, and LSTM−). Tenga en cuenta que
ively (p.ej., LSTM
these variants are created at test time, and are all
based on the same trained model (p.ej., LSTM).
For all models except TDLM, incorporating the
context paragraph is trivial. We simply prepend it
to the target sentence before computing the latter’s
probabilidad. For TDLM+ or TDLM−,
the context
paragraph is treated as the document context,
from which a topic vector is inferred and fed to
Acc. Measure Equation
LP log P (s)
registro P (s)
|s|
MeanLP
PenLP
registro P (s)
((5 + |s|)/(5 + 1))a
NormLP −
registro P (s)
log Pu(s)
SLOR
registro P (s) − log Pu(s)
|s|
Mesa 2: Acceptability measures for predicting
the acceptability of a sentence; PAG (s) is the sen-
tence probability, computed using Equa-
ción (1) or Equation (2) depending on the
modelo; PU(s) is the sentence probability esti-
mated by a unigram language model; y
α =0.8.
the language model for next-word prediction. Para
TDLM
∅, we set the topic vector to zeros.
3.3 Implementation
For the transformer models (GPT2, BERT, y
XLNET), we use the implementation of pytorch-
transformers.17
XLNET requires a long dummy context prepended
to the target sentence for it to compute the sentence
probability properly.18 Other researchers have
found a similar problem when using XLNET for
generation.19 We think that this is likely due
to XLNET’s recurrence mechanism (Sección 3.1),
where it has access to context from the previous
sequence during training.
For TDLM, we use the implementation provided
by Lau et al. (2017a),20 following their optimal
hyper-parameter configuration without tuning.
We implement LSTM based on Tensorflow’s
Penn Treebank language model.21 In terms of
17https://github.com/huggingface/pytorch-
transformadores. Específicamente, we employ the following
pre-trained models: gpt2-medium for GPT2, bert-large-
cased for BERTCS, bert-large-uncased for BERTUCS,
and xlnet-large-cased for XLNETUNI/XLNETBI.
(p.ej., XLNET
18In the scenario where we include the context paragraph
+
UNI), the dummy context is added before it.
19https://medium.com/@amanrusia/xlnet-speaks-
comparison-to-gpt-2-ea1a4e9ba39e.
20https://github.com/jhlau/topically-driven-
language-model.
21https://github.com/tensorflow/models/
blob/master/tutorials/rnn/ptb/ptb word lm.py.
302
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
hyper-parameters, we follow the configuration of
TDLM where applicable. TDLM uses Adam as the
optimizer (Kingma and Ba, 2014), but for LSTM
we use Adagrad (Duchi et al., 2011), as it produces
better development perplexity.
For NormLP and SLOR, we need to compute
PU(s), the sentence probability based on a unigram
modelo de lenguaje. As the language models are
trained on different corpora, we collect unigram
counts based on their original training corpus. Eso
es, for LSTM and TDLM, we use the 100K English
Wikipedia corpus. For GPT2, we use an open
source implementation that reproduces the origi-
nal WebText data.22 For BERT we use the full
Wikipedia collection and crawl smashwords.
com to reproduce BookCorpus.23 Finally, para
XLNET we use the combined set of Wikipedia,
WebText, and BookCorpus.24
Source code for our experiments is publicly
available at: https://github.com/jhlau/
acceptability-prediction-in-context.
3.4 Results and Discussion
We use Pearson’s r to assess how well the models’
acceptability measures predict mean human ac-
ceptability ratings, following previous studies
(Lau et al., 2017b; Bernardy et al., 2018).
Recall that for each model (p.ej., LSTM), hay
three variants with which we infer the sentence
probability at test time. These are distinguished
∅), real
by whether we include no context (LSTM
contexto (LSTM+), or random context (LSTM−). Allá
are also three types of human acceptability ratings
(ground truth), where sentences are judged with
∅), real context (H+), and random
no context, (h
contexto (H−). We present the full results in Table 3.
the correlation
figures indicate for these models, we compute
two human performance estimates to serve as
upper bounds on the accuracy of a model. El
the one-vs-rest
first upper bound (UB1)
a
annotator
select
random annotator’s rating and compare it
a
the mean rating of the rest, using Pearson’s
r. We repeat this for a large number of trials
To get a sense of what
correlation, where we
es
22https://skylion007.github.io/OpenWebTextCorpus/.
23We use the scripts in https://github.com/
soskek/bookcorpus to reproduce BookCorpus.
24
XLNET also uses Giga5 and ClueWeb as part of its training
datos, but we think that our combined collection is sufficiently
large to be representative of the original training data.
303
(1,000) to get a robust estimate of the mean
correlation. UB1 can be interpreted as the average
human performance working in isolation. El
second upper bound (UB2) is the half-vs.-half
annotator correlation. For each sentence we ran-
domly split the annotators into two groups, y
compare the mean rating between groups, de nuevo
using Pearson’s r and repeating it (1,000 veces)
to get a robust estimate. UB2 can be taken as
the average human performance working collab-
oratively. En general, the simulated human perfor-
mance
contexto
fairly
es
Por ejemplo, UB1 = 0.75,
types (Mesa 3),
H−,
H+,
0.73,
0.75
y
respectivamente.
consistent
∅,
h
encima
y
para
When we postprocess the user ratings, re-
ratings
that we remove the outlier
member
(≥ 2 standard deviation)
for each sentence
(Sección 2.1). Although this produces a cleaner set
of annotations, this filtering step does (artificially)
increase the human agreement or upper bound
correlations. For completeness we also present
upper bound variations where we do not remove
∅
1 y
the outlier ratings, and denote them as UB
∅
2 . In this setup, the one-vs.-rest correlations
UB
drop to 0.62–0.66 (Mesa 3). Note that all model
performances are reported based on the outlier-
filtered ratings, although there are almost no
perceivable changes to the performances when
they are evaluated on the outlier-preserved ground
truth.
Looking at Table 3, the models’ performances
are fairly consistent over different types of ground
∅, H+, and H−). This is perhaps not
truths (h
very surprising, as the correlations among the
human ratings for these context types are very
alto (Sección 2).
We now focus on the results with H
∅ as ground
∅). SLOR is generally the best
truth (‘‘Rtg’’ = H
acceptability measure for unidirectional models,
with NormLP not far behind (the only exception
∅). The recurrent models (LSTM and TDLM)
is GPT2
are very strong compared with the much larger
transformer models (GPT2 and XLNETUNI). De hecho
TDLM has the best performance when context is
∅, SLOR = 0.61), sugerencia
not considered (TDLM
that model architecture may be more important
than number of parameters and amount of training
datos.
For bidirectional models, the unnormalized LP
works very well. The clear winner here, sin embargo,
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Rtg
Encod.
∅
h
H+
H−
Unidir.
Bidir.
—
Unidir.
Bidir.
—
Unidir.
Bidir.
—
Modelo
∅
LSTM
LSTM+
∅
TDLM
TDLM+
∅
GPT2
GPT2+
∅
UNI
XLNET
XLNET+
UNI
∅
CS
BERT
BERT+
CS
∅
UCS
BERT
BERT+
UCS
∅
BI
BI
XLNET
XLNET+
∅
1
∅
2
UB1 / UB
UB2 / UB
∅
LSTM
LSTM+
∅
TDLM
TDLM+
∅
GPT2
GPT2+
∅
UNI
XLNET
XLNET+
UNI
∅
CS
BERT
BERT+
CS
∅
CS
BERT
BERT+
CS
∅
XLNET
BI
XLNET+
∅
1
∅
2
UB1 / UB
UB1 / UB
∅
BI
LSTM
LSTM−
∅
TDLM
TDLM−
∅
GPT2
GPT2−
∅
UNI
XLNET
XLNET−
UNI
∅
CS
BERT
BERT−
CS
∅
UCS
BERT
BERT−
UCS
∅
XLNET
BI
XLNET−
∅
1
∅
2
UB1 / UB
UB2 / UB
BI
LP
MeanLP
PenLP
NormLP
SLOR
0.29
0.30
0.30
0.30
0.33
0.38
0.31
0.36
0.51
0.53
0.59
0.60
0.52
0.57
0.29
0.31
0.30
0.30
0.32
0.38
0.30
0.35
0.49
0.52
0.58
0.60
0.51
0.57
0.28
0.27
0.29
0.28
0.32
0.30
0.30
0.29
0.48
0.49
0.56
0.56
0.49
0.50
0.42
0.49
0.49
0.50
0.34
0.59
0.42
0.56
0.54
0.63
0.63
0.68
0.51
0.65
0.44
0.51
0.50
0.50
0.33
0.60
0.42
0.56
0.53
0.63
0.63
0.68
0.50
0.65
0.44
0.41
0.52
0.49
0.34
0.42
0.44
0.40
0.53
0.52
0.61
0.58
0.48
0.51
0.52
0.61
0.60
0.59
0.38
0.63
0.51
0.61
0.55
0.64
0.63
0.67
0.53
0.66
0.52
0.62
0.59
0.58
0.36
0.63
0.49
0.60
0.54
0.63
0.63
0.67
0.52
0.65
0.50
0.47
0.59
0.56
0.35
0.44
0.49
0.46
0.53
0.51
0.60
0.57
0.49
0.51
0.42
0.45
0.45
0.45
0.56
0.58
0.51
0.55
0.63
0.67
0.70
0.72
0.66
0.73
0.75 / 0.66
0.92 / 0.88
0.43
0.46
0.45
0.46
0.56
0.59
0.50
0.55
0.62
0.66
0.70
0.73
0.65
0.74
0.73 / 0.66
0.92 / 0.89
0.43
0.40
0.46
0.44
0.55
0.51
0.51
0.49
0.62
0.61
0.68
0.66
0.62
0.64
0.75 / 0.68
0.92 / 0.88
0.53
0.63
0.61
0.60
0.38
0.60
0.52
0.61
0.53
0.60
0.60
0.63
0.53
0.65
0.52
0.62
0.59
0.58
0.37
0.60
0.51
0.61
0.51
0.58
0.60
0.63
0.53
0.65
0.50
0.47
0.58
0.55
0.35
0.41
0.49
0.46
0.49
0.47
0.56
0.53
0.48
0.50
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Mesa 3: Modeling results. Boldface indicates optimal performance in each row.
304
is PenLP. It substantially and consistently out-
performs all other acceptability measures. El
strong performance of PenLP that we see here
illuminates its popularity in machine translation
for beam search decoding (Vaswani et al., 2017).
With the exception of PenLP,
the gain from
normalization for the bidirectional models is
pequeño, but we don’t think this can be attributed
to the size of models or training corpora, como el
large unidirectional models (GPT2 and XLNETUNI)
still benefit from normalization. The best model
∅
without considering context
UCS with a
correlation of 0.70 (PenLP), which is very close
to the idealized single-annotator performance UB1
(0.75) and surpasses the unfiltered performance
(0.66), creating a new state-of-the-art for
UB
unsupervised acceptability prediction (Lau et al.,
2015, 2017b; Bernardy et al., 2018). There is still
room to improve, sin embargo, relative to the collab-
∅
2 (0.88) upper bounds.
orative UB2 (0.92) or UB
is BERT
∅
1
We next look at the impact of incorporating
∅ vs.
context at test time for the models (p.ej., LSTM
∅
UCS vs. BERT+
LSTM+ or BERT
UCS). To ease interpret-
ability we will focus on SLOR for unidirectional
modelos, and PenLP for bidirectional models.
Generally, we see that
incorporating context
always improves correlation, for both cases where
∅ and H+ as ground truths, suggesting that
we use H
context is beneficial when it comes to sentence
modelado. The only exception is TDLM, dónde
∅ and TDLM+ perform very similarly. Nota,
TDLM
sin embargo, that context is only beneficial when it
is relevant. Incorporating random contexts (p.ej.,
∅
∅ vs. LSTM− or BERT
UCS with H− as
UCS vs. BERT−
LSTM
ground truth) reduces the performance for all
models.25
Recordar
that our test sentences are uncased
(an artefact of Moses, the machine translation
system that we use). Whereas the recurrent models
are all
trained on uncased data, most of the
transformer models are trained with cased data.
BERT is the only transformer that is pre-trained
on both cased (BERTCS) and uncased data (BERTUCS).
To understand the impact of casing, we look
at the performance of BERTCS and BERTUCS with
∅ as ground truth. We see an improvement
h
∅
BI (0.62) vs. XLNET−
25There is one exception: XLNET
BI (0.64).
As we saw previously in Section 3.3, XLNET requires a long
dummy context to work, and so this observation is perhaps
unsurprising, because it appears that context—whether it is
relevant or not—seems to always benefit XLNET.
305
BI already outperforms BERT+
of 5–7 points (depending on whether context is
incorporated), which suggests that casing has a
significant impact on performance. Given that
XLNET+
(0.73 vs.
0.72), even though XLNET+
BI is trained with cased
datos, we conjecture that an uncased XLNET is
∅
UCS when context is not
likely to outperform BERT
consideró.
UCS
To summarize, our first important result is the
exceptional performance of bidirectional models.
It raises the question of whether left-to-right bias is
an appropriate assumption for predicting sentence
acceptability. One could argue that this result
may be due to our experimental setup. Users
are presented with the sentence in text, y ellos
have the opportunity to read it multiple times,
thereby creating an environment that may simulate
bidirectional context. We could test this conjecture
by changing the presentation of the sentence,
displaying it one word at a time (with older
words fading off), or playing an audio version
(p.ej., via a text-to-speech system). Sin embargo, estos
likely introduce other confounds
changes will
(p.ej., prosody), but we believe it is an interesting
avenue for future work.
Our second result is more tentative. Our experi-
ments seem to indicate that model architecture is
more important than training or model size. Nosotros
see that TDLM, which is trained on data orders
of magnitude smaller and has model parameters
four times smaller in size (Mesa 1), outperforms
the large unidirectional transformer models. A
establish this conclusion more firmly we will need
to rule out the possibility that the relatively good
performance of LSTM and TDLM is not due to a
cleaner (p.ej., lowercased) or more relevant (p.ej.,
Wikipedia) training corpus. With that said, nosotros
contend that our findings motivate the construc-
tion of better language models, instead of increas-
ing the number of parameters, or the amount of
training data. It would be interesting to examine
the effect of extending TDLM with a bidirectional
objetivo.
Our final result is that our best model, BERTUCS,
attains a human-level performance and achieves
a new state-of-the-art performance in the task of
unsupervised acceptability prediction. Given this
level of accuracy, we expect it would be suitable
for tasks like assessing student essays and the
quality of machine translations.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
4 Linguists’ Examples
One may argue that our dataset is potentially
biased, as round-trip machine translation may in-
troduce particular types of infelicities or unusual
features to the sentences (Graham et al., 2019).
Lau et al. (2017b) addressed this by creating a
dataset where they sample 50 grammatical and
50 ungrammatical sentences from Adger (2003)'s
syntax textbook, and run a crowdsourced ex-
ratings. Lau
perimento
et al.
their unsupervised
language models (p.ej., simple recurrent networks)
predict the acceptability of these sentences with
similar performances, providing evidence that
their modeling results are robust.
(2017b) found that
their user
to collect
We test our pre-trained models using this
linguist-constructed dataset, and found similar
observaciones: GPT2, BERTCS, and XLNETBI produce a
PenLP correlation of 0.45, 0.53, y 0.58, respetar-
activamente. These results indicate that these language
models are able to predict the acceptability of
these sentences reliably, consistent with our mod-
eling results with round-trip translated sentences
(Sección 3.4). Although the correlations are gen-
erally lower, we want
estos
linguists’ examples are artificially constructed to
illustrate specific syntactic phenomena, y entonces
this constitutes a particularly strong case of out-
of-domain prediction. These texts are substantially
different in nature from the natural text that the
pre-trained language models are trained on (p.ej.,
the linguists’ examples are much shorter—less
than 7 words on average—than the natural texts).
to highlight
eso
5 Trabajo relacionado
Acceptability is closely related to the concept
of grammaticality. The latter is a theoretical
construction corresponding to syntactic well-
formedness, and it is typically interpreted as a
binary property (es decir., a sentence is either gram-
matical or ungrammatical). Acceptability, sobre el
other hand, includes syntactic, semantic, prag-
matic, and non-linguistic factors, such as sentence
length. It is gradient, rather than binary, in nature
(Denison, 2004; Sorace and Keller, 2005; Sprouse,
2007).
Linguists and other theorists of language have
traditionally assumed that context affects our per-
ception of both grammaticality (bolinger, 1968)
and acceptability (bever, 1970), but surprisingly
306
little work investigates this effect systematically,
or on a large scale. Most formal linguists rely
heavily on the analysis of sentences taken in
isolation. Sin embargo, many linguistic frameworks
seek to incorporate aspects of context-dependence.
Dynamic theories of semantics (Heim, 1982;
Kamp and Reyle, 1993; Groenendijk and Stokhof,
1990) attempt to capture intersentential corefer-
ence, binding, and scope phenomena. Dynamic
Syntax (Cann et al., 2007) uses incremental
tree construction and semantic type projection to
render parsing and interpretation discourse depen-
mella. Theories of discourse structure characterize
sentence coherence in context through rhetori-
cal relations (Mann y Thompson, 1988; Asher
and Lascarides, 2003), or by identifying open
questions and common ground (Ginzburg, 2012).
While these studies offer valuable insights into a
variety of context related linguistic phenomena,
much of it takes grammaticality and acceptabil-
ity to be binary properties. Además, it is not
formulated in a way that permits fine-grained
psychological experiments, or wide coverage
computational modeling.
Psycholinguistic work can provide more ex-
perimentally grounded approaches. Greenbaum
(1976) found that combinations of particular syn-
tactic constructions in context affect human judg-
ments of acceptability, although the small scale
of the experiments makes it difficult to draw
general conclusions. More recent work investi-
gates related effects, but it tends to focus on very
restricted aspects of the phenomenon. Para examen-
por ejemplo, Zlogar and Davidson (2018) investigate the
influence of context on the acceptability of ges-
tures with speech, focussing on interaction with
semantic content and presupposition. The prim-
ing literature shows that exposure to lexical and
syntactic items leads to higher likelihood of their
repetition in production (Reitter et al., 2011), y
to quicker processing in parsing under certain cir-
cumstances (Giavazzi et al., 2018). Frameworks
such as ACT-R (anderson, 1996) explain these
effects through the impact of cognitive activation
on subsequent processing. Most of these studies
suggest that coherent or natural contexts should
increase acceptability ratings, given that the lin-
guistic expressions used in processing become
more activated. Warner and Glass (1987) espectáculo
that such syntactic contexts can indeed affect
grammaticality judgments in the expected way for
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
garden path sentences. Cowart (1994) uses com-
parison between positive and negative contexts,
investigating the effect of contexts containing
alternative more or less acceptable sentences. Pero
he restricts the test cases to specific pronoun
binding phenomena. None of the psycholinguistic
work investigates acceptability judgments in real
textual contexts, over large numbers of test cases
and human subjects.
Some recent computational work explores the
relation of acceptability judgments to sentence
probabilities. Lau et al. (2015, 2017b) muestra esa
the output of unsupervised language models
can correlate with human acceptability ratings.
Warstadt et al. (2018) treat
this as a semi-
supervised problem, training a binary classifier
on top of a pre-trained sentence encoder to
predict acceptability ratings with greater accuracy.
Bernardy et al.
(2018) explore incorporating
into such models, eliciting human
contexto
judgments of sentence acceptability when the
sentences were presented both in isolation and
within a document context. They find a compres-
in the distribution of the human
sion effect
acceptability ratings. Bizzoni and Lappin (2019)
observe a similar effect in a paraphrase accept-
ability task.
One possible explanation for this compression
effect is to take it as the expression of cognitive
load. Psychological research on the cognitive load
efecto (Sweller, 1988; Ito et al., 2018; Causse et al.,
2016; Park et al., 2013) indicates that performing
a secondary task can degrade or distort subjects’
performance on a primary task. This could cause
judgments to regress towards the mean. Sin embargo,
the experiments of Bernardy et al. (2018) y
Bizzoni and Lappin (2019) do not allow us to
distinguish this possibility from a coherence or
priming effect, as only coherent contexts were
consideró. Our experimental setup improves on
this by introducing a topic identification task and
incoherent (aleatorio) contexts in order to tease the
effects apart.
6 Conclusions and Future Work
We found that processing context
induces a
cognitive load for humans, which creates a
compression effect on the distribution of accept-
ability ratings. We also showed that if the context
is relevant to the sentence, a discourse coherence
effect uniformly boosts sentence acceptability.
Our language model experiments indicate that
bidirectional models achieve better results than
unidirectional models. The best bidirectional
model performs at a human level, defining a new
state-of-the art for this task.
In future work we will explore alternative ways
to present sentences for acceptability judgments.
We plan to extend TDLM, incorporating a bidi-
significativo
rectional objective, como
promise. It will also be interesting to see if our
observations generalize to other languages, y
to different sorts of contexts, both linguistic and
non-linguistic.
muestra
él
Expresiones de gratitud
We are grateful to three anonymous reviewers for
helpful comments on earlier drafts of this paper.
Some of the work described here was presented
in talks in the seminar of the Centre for Linguistic
Theory and Studies in Probability (CLASP),
University of Gothenburg, December 2019, and in
the Cambridge University Language Technology
Seminar, Febrero 2020. We thank the participants
of both events for useful discussion.
Lappin’s work on the project was supported
by grant 2014-39 from the Swedish Research
Council, which funds CLASP. Armendariz and
Purver were partially supported by the European
El horizonte de la Unión 2020 investigación e innovación
programme under grant agreement no. 825153,
project EMBEDDIA (Cross-Lingual Embeddings
for Less-Represented Languages in European
News Media). The results of this publication
reflect only the authors’ views and the Com-
mission is not responsible for any use that may be
made of the information it contains.
Referencias
David Adger. 2003. Core Syntax: A Minimalist
Acercarse, prensa de la Universidad de Oxford, United
Kingdom.
John R. anderson. 1996. ACT: A simple theory
of complex cognition. Psicólogo americano,
51:355–365.
Nicholas Asher and Alex Lascarides. 2003. Logics
of Conversation, Prensa de la Universidad de Cambridge.
Yoshua Bengio, R´ejean Ducharme, Pascal
Vincent, and Christian Janvin. 2003. A neural
307
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
probabilistic language model. The Journal of
Machine Learning Research, 3:1137–1155.
Jean-Philippe Bernardy, Shalom Lappin, y
Jey Han Lau. 2018. The influence of context on
sentence acceptability judgements. En curso-
ings of the 56th Annual Meeting of the Asso-
ciation for Computational Linguistics (LCA
2018), pages 456–461. Melbourne, Australia.
Thomas G. bever. 1970, The cognitive basis
for linguistic structures, j. R. Hayes, editor,
Cognition and the Development of Language,
wiley, Nueva York, pages 279–362.
Yuri Bizzoni and Shalom Lappin. 2019. El
effect of context on metaphor paraphrase
aptness judgments. In Proceedings of the 13th
International Conference on Computational
Semántica – Artículos largos, pages 165–175.
Gothenburg, Suecia.
Dwight Bolinger. 1968. Judgments of grammati-
cality. Lingua, 21:34–40.
Ronnie Cann, Ruth Kempson, and Matthew
Purver. 2007. Context and well-formedness:
the dynamics of ellipsis. Research on Language
and Computation, 5(3):333–358.
Micka¨el Causse, Vsevolod Peysakhovich, y
Eve F. Fabre. 2016. High working memory load
impairs language processing during a simulated
piloting task: An ERP and pupillometry study.
Frontiers in Human Neuroscience, 10:240.
Wayne Cowart. 1994. Anchoring and grammar
effects in judgments of sentence acceptability.
Perceptual and Motor Skills, 79(3):1171–1182.
Zihang Dai, Zhilin Yang, Yiming Yang,
Jaime G. Carbonell, Quoc V. Le,
y
Ruslan Salakhutdinov. 2019. Transformador-
XL: Attentive language models beyond a
fixed-length context. CORR, abs/1901.02860.
David Denison. 2004. Fuzzy Grammar: A Reader,
prensa de la Universidad de Oxford, Reino Unido.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
el 2019
comprensión. En procedimientos de
Conference of the North American Chapter of
la Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano, Volumen 1
(Artículos largos y cortos), páginas 4171–4186.
Mineápolis, Minnesota.
John Duchi, Elad Hazan, and Yoram Singer.
2011. Adaptive subgradient methods for online
learning and stochastic optimization. Diario de
Machine Learning Research, 12:2121–2159.
Maria Giavazzi, Sara Sambin, Ruth de Diego-
Balaguer, Lorna Le Stanc, Anne-Catherine
Bachoud-L´evi, and Charlotte Jacquemot. 2018.
Structural priming in sentence comprehen-
sión: A single prime is enough. PLoS ONE,
13(4):e0194959.
Jonathan Ginzburg. 2012. The Interactive Stance:
Meaning for Conversation, Universidad de Oxford
Prensa.
Yvette Graham, Barry Haddow, and Philipp
Koehn. 2019. Translationese in machine trans-
lation evaluation. CORR, abs/1906.09833.
Sidney Greenbaum. 1976. Contextual
influ-
ence on acceptability judgements. Lingüística,
15(187):5–12.
Jeroen Groenendijk and Martin Stokhof. 1990.
Dynamic Montague grammar. l. Kalman and
el
l. Polos, editores,
2nd Symposium on Logic and Language,
pages 3–48. Budapest.
En procedimientos de
Irene Heim. 1982. The Semantics of Definite and
Indefinite Noun Phrases. Doctor. tesis, universidad-
sity of Massachusetts at Amherst.
Felix Hill, Roi Reichart, and Anna Korhonen.
2015. SimLex-999: Evaluating semantic mod-
els with (genuine) similarity estimation. Com-
Lingüística putacional, 41:665–695.
Sepp Hochreiter y Jürgen Schmidhuber. 1997.
Memoria larga a corto plazo. Computación neuronal,
9:1735–1780.
Aine Ito, Martin Corley, and Martin J. Pickering.
2018. A cognitive load delays predictive eye
movements similarly during L1 and L2 compre-
hension. Bilingualism: Language and Cogni-
ción, 21(2):251–264.
Hans Kamp and Uwe Reyle. 1993. From Dis-
course To Logic, Kluwer Academic Publishers.
308
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Urvashi Khandelwal, He He, Peng Qi, and Dan
Jurafsky. 2018. Sharp nearby, fuzzy far away:
How neural language models use context. En
Proceedings of the 56th Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 284–294.
Asociación de Lingüística Computacional,
Melbourne, Australia.
Diederik P. Kingma and Jimmy Ba. 2014. Adán:
A method for stochastic optimization. CORR,
abs/1412.6980.
Taku Kudo
y
John Richardson.
2018.
SentencePiece: A simple and language inde-
pendent subword tokenizer and detokenizer
for neural text processing. En procedimientos de
el 2018 Conference on Empirical Methods in
Natural Language Processing: System Dem-
onstrations, pages 66–71. Bruselas, Bélgica.
Jey Han Lau, Timothy Baldwin, and Trevor Cohn.
2017a. Topically driven neural language model.
In Proceedings of the 55th Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 355–365.
vancouver, Canada.
Jey Han Lau, Alejandro Clark, and Shalom
Lappin. 2015. Unsupervised prediction of
acceptability judgements. En Actas de la
Joint conference of the 53rd Annual Meeting of
la Asociación de Lingüística Computacional
and the 7th International Joint Conference on
Natural Language Processing of
the Asian
Federation of Natural Language Processing
(ACL-IJCNLP 2015),
1618–1628.
Beijing, Porcelana.
paginas
Jey Han Lau, Alejandro Clark, and Shalom
Lappin. 2017b. Grammaticality, Acceptability,
and Probability: A Probabilistic View of
Linguistic Knowledge. Ciencia cognitiva,
41:1202–1241.
William Mann and Sandra Thompson. 1988.
Teoría de la estructura retórica: Toward a func-
text organization. Texto,
tional
8(3):243–281.
theory of
Cristóbal D.. Manning, Mihai Surdeanu, John
Bauer, Jenny Finkel, Steven J. Bethard, y
David McClosky. 2014. The Stanford CoreNLP
natural language processing toolkit. In Asso-
ciation for Computational Linguistics (LCA)
Demostraciones del sistema, pages 55–60.
Andrew Kachites McCallum. 2002. Mallet: A
machine learning for language toolkit. http://
mallet.cs.umass.edu.
Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget,
Jan ˇCernock´y, and Sanjeev Khudanpur. 2010.
Recurrent neural network based language
modelo. In INTERSPEECH 2010, 11th Annual
Conference of the International Speech Com-
munication Association, pages 1045–1048.
Makuhari, Japón.
Hyangsook Park, Jun-Su Kang, Sungmook Choi,
and Minho Lee. 2013. Analysis of cognitive
load for language processing based on brain
activities. In Neural Information Processing,
pages 561–568. Springer Berlin Heidelberg,
Berlina, Heidelberg.
Adam Pauls and Dan Klein. 2012. Large-scale
syntactic language modeling with treelets. En
Proceedings of the 50th Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 959–968. Jeju
Island, Korea.
Alec Radford, Jeff Wu, niño rewon, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners.
Daivd Reitter, Frank Keller, and Johanna D.
moore. 2011. A computational cognitive
model of syntactic priming. Ciencia cognitiva,
35(4):587–637.
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of rare
words with subword units. En procedimientos de
the 54th Annual Meeting of the Association
para Lingüística Computacional (Volumen 1: Largo
Documentos), pages 1715–1725. Berlina, Alemania.
Antonella Sorace and Frank Keller. 2005.
Lingua,
linguistic
datos.
Gradience
en
115:1497–1524.
Jon Sprouse. 2007. Continuous acceptability,
categorical grammaticality, and experimental
syntax. Biolinguistics, 1123–134.
John Sweller. 1988. Cognitive load during
problem solving: Effects on learning. Cognitivo
Ciencia, 12(2):257–285.
309
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N.. Gómez,
lucas káiser, y Illia Polosukhin. 2017.
Attention is all you need. In Advances in
Neural Information Processing Systems 30,
pages 5998–6008.
Alex Wang and Kyunghyun Cho. 2019. BERT has
a mouth, and it must speak: BERT as a Markov
random field language model. En procedimientos
of the Workshop on Methods for Optimizing
and Evaluating Neural Language Generation,
pages 30–36. Asociación de Computación
Lingüística, Mineápolis, Minnesota.
John Warner and Arnold L. Glass. 1987. Context
and distance-to-disambiguation effects in ambi-
guity resolution: Evidence from grammaticality
judgments of garden path sentences. Diario de
Memory and Language, 26(6):714 – 738.
Alex Warstadt, Amanpreet Singh, and Samuel R.
Bowman. 2018. Neural network acceptability
judgments. CORR, abs/1805.12471.
Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin
gao, Klaus Macherey, Jeff Klingner, Apurva
Shah, Melvin Johnson, Xiaobing Liu, lucas
Kaiser, Stephan Gows, Yoshikiyo Kato, Taku
Kudo, Hideto Kazawa, Keith Stevens, Jorge
Kurian, Nishant Patil, Wei Wang, Cliff Young,
Jason Smith, Jason Riesa, Alex Rudnick,
Oriol Vinyals, Greg Corrado, Macduff Hughes,
and Jeffrey Dean. 2016. Google’s neural
machine translation system: Bridging the gap
between human and machine translation. CORR,
abs/1609.08144.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. 2019. XLNet: Generalized autoregressive
pretraining for language understanding. CORR,
abs/1906.08237.
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan
Salakhutdinov, Raquel Urtasun, Antonio
Torralba, and Sanja Fidler. 2015. Aligning
books and movies: Towards story-like visual
explanations by watching movies and reading
books. En Actas de la 2015 IEEE Inter-
national Conference on Computer Vision
(ICCV), pages 19–27. Washington, corriente continua, EE.UU.
Christina Zlogar and Kathryn Davidson. 2018.
Effects of linguistic context on the acceptability
of co-speech gestures. Glossa, 3(1):73.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
3
1
5
1
9
2
3
6
1
0
/
/
t
yo
a
C
_
a
_
0
0
3
1
5
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3