How Furiously Can Colorless Green Ideas Sleep? - 麻省理工学院人工智能研究专业

How Furiously Can Colorless Green Ideas Sleep?
Sentence Acceptability in Context

Jey Han Lau1,7 Carlos Armendariz2 Shalom Lappin2,3,4
Matthew Purver2,5 Chang Shu6,7

1The University of Melbourne

2Queen Mary University of London

3University of Gothenburg

4King’s College London

5Joˇzef Stefan Institute

6University of Nottingham Ningbo China

7DeepBrain

jeyhan.lau@gmail.com, c.santosarmendariz@qmul.ac.uk
shalom.lappin@gu.se, m.purver@qmul.ac.uk, scxcs1@nottingham.edu.cn

抽象的

We study the influence of context on sentence
acceptability. First we compare the acceptabil-
ity ratings of sentences judged in isolation,
with a relevant context, and with an irrelevant
语境. Our results show that context induces
a cognitive load for humans, which com-
presses the distribution of ratings. 而且,
in relevant contexts we observe a discourse
coherence effect
that uniformly raises ac-
ceptability. 下一个, we test unidirectional and
bidirectional language models in their ability to
predict acceptability ratings. The bidirectional
models show very promising results, 与
best model achieving a new state-of-the-art for
unsupervised acceptability prediction. 他们俩
sets of experiments provide insights into the
cognitive aspects of sentence processing and
central issues in the computational modeling
of text and discourse.

1 介绍

Sentence acceptability is the extent to which a
sentence appears natural to native speakers of a
语言. Linguists have often used this property
to motivate grammatical theories. 计算型
language processing has traditionally been more
concerned with likelihood—the probability of a
sentence being produced or encountered. 这
question of whether and how these properties
are related is a fundamental one. Lau et al.
(2017乙) experiment with unsupervised language
models to predict acceptability, and they obtained
an encouraging correlation with human ratings.

296

This raises foundational questions about the nature
of linguistic knowledge: If probabilistic models
can acquire knowledge of sentence acceptability
from raw texts, we have prima facie support for
an alternative view of language acquisition that
does not rely on a categorical grammaticality
成分.

It is generally assumed that our perception of
sentence acceptability is influenced by context.
Sentences that may appear odd in isolation can
become natural in some environments, and sen-
tences that seem perfectly well formed in some
contexts are odd in others. On the computational
边, much recent progress in language modeling
has been achieved through the ability to incor-
porate more document context, using broader
and deeper models (例如, Devlin et al., 2019;
杨等人。, 2019). While most language modeling
is restricted to individual sentences, models can
benefit from using additional context (Khandelwal
等人。, 2018). 然而, despite the importance of
语境, few psycholinguistic or computational
studies systematically investigate how context
affects acceptability, or the ability of language
models to predict human acceptability judgments.
Two recent studies that explore the impact of doc-
ument context on acceptability judgments both
identify a compression effect (Bernardy et al.,
2018; Bizzoni and Lappin, 2019). Sentences per-
ceived to be low in acceptability when judged
without context receive a boost in acceptability
when judged within context. 反过来, 那些
with high out-of-context acceptability see a reduc-
tion in acceptability when context is presented. 它
is unclear what causes this compression effect. Is
it a result of cognitive load, imposed by additional

计算语言学协会会刊, 卷. 8, PP. 296–310, 2020. https://doi.org/10.1162/tacl 00315
动作编辑器: George Foster. 提交批次: 10/2019; 修改批次: 1/2020; 已发表 6/2020.
C(西德:13) 2020 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

processing demands, or is it the consequence of
an attempt to identify a discourse relation between
context and sentence?

We address these questions in this paper. 到
understand the influence of context on human
perceptions, we ran three crowdsourced experi-
ments to collect acceptability ratings from human
annotators. We develop a methodology to ensure
comparable ratings for each target sentence in
isolation (without any context), in a relevant three-
sentence context, and in the context of sentences
randomly sampled from another document. 我们的
results replicate the compression effect, 和
careful analyses reveal that both cognitive load
and discourse coherence are involved.

(left-to-right)

To understand the relationship between sen-
tence acceptability and probability, we conduct
experiments with unsupervised language models
to predict acceptability. We explore traditional
recurrent neural
unidirectional
network models,
and modern bidirectional
transformer models (例如, BERT). We found that
bidirectional models consistently outperform
unidirectional models by a wide margin, calling
into question the suitability of left-to-right bias for
sentence processing. Our best bidirectional model
achieves simulated human performance on the
prediction task, establishing a new state-of-the-art.

2 Acceptability in Context

2.1 Data Collection

To understand how humans interpret acceptability,
we require a set of sentences with varying degrees
of well-formedness. Following previous studies
(Lau et al., 2017乙; Bernardy et al., 2018), 我们
use round-trip machine translation to introduce a
wide range of infelicities into naturally occurring
句子.

We sample 50 英语 (目标) sentences and
their contexts (three preceding sentences) 从
English Wikipedia.1 We use Moses to translate
the target sentences into four languages (Czech,
西班牙语, 德语, and French) and then back to

1We preprocess the raw dump with WikiExtractor
(https://github.com/attardi/wikiextractor),
and collect paragraphs that have ≥ 4 sentences with each
sentence having ≥ 5 字. Sentences and words are tok-
enized with spaCy (https://spacy.io/) to check for
these constraints.

English.2 This produces 250 sentences in total
(5 languages including English) for our test set.
Note that we only do round-trip translation for the
target sentences; the contexts are not modified.

We use Amazon Mechanical Turk (AMT) 到
collect acceptability ratings for the target sen-
tences.3 We run three experiments where we
expose users to different types of context. 为了
实验, we split the test set into 25 HITs of
10 句子. Each HIT contains 2 original English
sentences and 8 round-trip translated sentences,
which are different from each other and not de-
rived from either of the originals. Users are asked
to rate the sentences for naturalness on a 4-point
ordinal scale: 坏的 (1.0), not very good (2.0),
mostly good (3.0), and good (4.0). We recruit 20
annotators for each HIT.

In the first experiment we present only the tar-
get sentences, without any context. In the second
实验, we first show the context paragraph
(three preceding sentences of the target sentence),
and ask users to select
the most appropriate
description of its topic from a list of four candi-
date topics. Each candidate topic is represented by
three words produced by a topic model.4 Note that
the context paragraph consists of original English
sentences which did not undergo translation. 一次
the users have selected the topic, they move to the
next screen where they rate the target sentence for
naturalness.5 The third experiment has the same
format as the second, except that the three sen-
tences presented prior to rating are randomly sam-
pled from another Wikipedia article.6 We require
annotators to perform a topic identification task
prior to rating the target sentence to ensure that
they read the context before making acceptability
判断.

For each sentence, we aggregate the ratings
from multiple annotators by taking the mean.
Henceforth we refer to the mean ratings collected
from the first (no context), 第二 (real context),
∅,
第三个 (random context) experiments as H

2We use the pre-trained Moses models from http://
www.statmt.org/moses/RELEASE-4.0/models/
for translation.

3https://www.mturk.com/.
4We train a topic model with 50 topics on 15 K Wikipedia
documents with Mallet (麦卡勒姆, 2002) and infer topics
for the context paragraphs based on the trained model.

5Note that we do not ask the users to judge the naturalness
of the sentence in context; the instructions they see for the
naturalness rating task is the same as the first experiment.
6Sampled sentences are sequential, running sentences.

297

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

H+, and H−, 分别. We rolled out
这
experiments on AMT over several weeks and pre-
vented users from doing more than one exper-
iment. Therefore a disjoint group of annotators
performed each experiment.

To control for quality, we check that users are
rating the English sentences ≥ 3.0 consistently.
For the second and third experiments, we also
check that users are selecting the topics appro-
priately. In each HIT one context paragraph has
one real topic (from the topic model), 和三个
fake topics with randomly sampled words as the
candidate topics. Users who fail to identify the
real topic above a confidence level are filtered out.
Across the three experiments, over three quarters
of workers passed our filtering conditions.

To calibrate for the differences in rating scale
between users, we follow the postprocessing
procedure of Hill et al. (2015), where we calculate
the average rating for each user and the overall
average (by taking the mean of all average ratings),
and decrease (增加) the ratings of a user by 1.0
if their average rating is greater (较小) 比
overall average by 1.0.7 To reduce the impact of
异常值, for each sentence we also remove ratings
that are more than 2 standard deviations away
from the mean.8

2.2 Results and Discussion

We present scatter plots to compare the mean
∅, H+,
ratings for the three different contexts (H
and H−) 图中 1. The black line represents the
diagonal, and the red line represents the regression
线. 一般来说, the mean ratings correlate strongly
∅ = 0.940,
with each other. Pearson’s r for H+ vs. H
H− vs. H

∅ = 0.911, and H− vs. H+ = 0.891.
The regression (红色的) and diagonal (黑色的) 线
∅ (图1a) show a compression
in H+ vs. H
影响. Bad sentences appear a little more natural,
and perfectly good sentences become slightly
less natural, when context is introduced.9 This
is the same compression effect observed by

7No worker has an average rating that is greater or smaller

than the overall average by 2.0.

8This postprocessing procedure discarded a total of 504
annotations/ratings (大约 3.9%) 超过 3 experi-
评论. The final average number of annotations for a sentence
in the first, 第二, and third experiments is 16.4, 17.8, 和
15.3, 分别.

9平均而言, good sentences (ratings ≥ 3.5) observe a
rating reduction of 0.08 and bad sentences (ratings ≤ 1.5) 一个
increase of 0.45.

Bernardy et al. (2018). It is also present in the
∅ (Figure 1b).
graph for H− vs. H
Two explanations of the compression effect
seem plausible to us. The first is a discourse
coherence hypothesis that takes this effect to be
caused by a general tendency to find infelicitous
sentences more natural in context. This hypothesis,
然而, does not explain why perfectly natural
sentences appear less acceptable in context. 这
second hypothesis is a variant of a cognitive load
帐户. In this view, interpreting context imposes
a significant burden on a subject’s processing
资源, and this reduces their focus on the
sentence presented for acceptability judgments. 在
the extreme ends of the rating scale, as they require
all subjects to be consistent in order to achieve the
minimum/maximum mean rating, the increased
cognitive load increases the likelihood of a subject
making a mistake. This increases/lowers the mean
rating, and creates a compression effect.

The discourse coherence hypothesis would
imply that the compression effect should appear
with real contexts, but not with random ones,
as there is little connection between the target
sentence and a random context. 相比之下, 这
cognitive load account predicts that the effect
should be present in both types of context, as it
depends only on the processing burden imposed
by interpreting the context. We see compression
in both types of contexts, which suggests that
the cognitive load hypothesis is the more likely
帐户.

然而,

these two hypotheses are not
mutually exclusive. 这是, 原则, possible that
both effects—discourse coherence and cognitive
load—are exhibited when context is introduced.

To better understand the impact of discourse
连贯性, consider Figure 1c, where we compare
H− vs. H+. Here the regression line is parallel to
and below the diagonal, implying that there is a
consistent decrease in acceptability ratings from
H+ to H−. As both ratings are collected with some
form of context, the cognitive load confound is
已删除. What remains is a discourse coherence
影响. Sentences presented in relevant contexts
increase in acceptability
undergo a consistent
rating.

To analyze the significance of this effect, 我们
use the non-parametric Wilcoxon signed-rank test
(one-tailed) to compare the difference between
H+ and H−. This gives a p-value of 1.9 × 10−8,

298

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

数字 1: Scatter plots comparing human acceptability ratings.

indicating that the discourse coherence effect is
重要的.

Returning to Figures 1a and 1b, we can see
那 (1) the offset of the regression line, 和 (2)
the intersection point of the diagonal and the
regression line, is higher in Figure 1a than in
Figure 1b. This suggests that there is an increase
of ratings, 所以, in addition to the cognitive load
影响, a discourse coherence effect is also at work
in the real context setting.

We performed hypothesis tests to compare the
regression lines in Figures 1a and 1b to see if
their offsets (constants) and slopes (系数)
are statistically different.10 The p-value for the
offset is 1.7 × 10−2, confirming our qualitative
observation that there is a significant discourse
coherence effect. The p-value for the slope,
然而, 是 3.6 × 10−1, suggesting that cognitive
load compresses the ratings in a consistent way
for both H+ and H−, relative to H

∅.

To conclude, our experiments reveal that con-
text induces a cognitive load for human process-
英, and this has the effect of compressing the
acceptability distribution. It moderates the ex-
tremes by making very unnatural sentences appear
more acceptable, and perfectly natural sentences
slightly less acceptable. If the context is relevant to
the target sentence, then we also have a discourse
coherence effect, where sentences are perceived
to be generally more acceptable.

10We follow the procedure detailed in https://
statisticsbyjim.com/regression/comparing-
regression-lines/ where we collate the data points
in Figures 1a and 1b and treat the in-context ratings (H+
and H−) as the dependent variable, the out-of-context ratings
∅) as the first independent variable, and the type of the
(H
语境 (real or random) as the second independent variable,
to perform regression analyses. The significance of the offset
and slope can be measured by interpreting the p-values of
the second independent variable, and the interaction between
the first and second independent variables, 分别.

3 Modeling Acceptability

在这个部分, we explore computational models
to predict human acceptability ratings. 我们是
interested in models that do not rely on explicit
to use the
supervision (IE。, we do not want
acceptability ratings as labels in the training data).
Our motivation here is to understand the extent
to which sentence probability, estimated by an
unsupervised model, can provide the basis for
predicting sentence acceptability.

为此, we train language models
(部分 3.1) using unsupervised objectives (例如,
next word prediction), and use these models
to infer the probabilities of our test sentences.
To accommodate sentence length and lexical
frequency we experiment with several simple
normalization methods, converting probabilities
to acceptability measures (部分 3.2). 这
acceptability measures are the final output of our
型号; they are what we use to compare to human
acceptability ratings.

3.1 Language Models

Our first model is an LSTM language model (LSTM:
Hochreiter and Schmidhuber, 1997; 米科洛夫
等人。, 2010). Recurrent neural network models
(RNNs) have been shown to be competitive in this
任务 (Lau et al., 2015; Bernardy et al., 2018), 和
they serve as our baseline.

Our second model is a joint topic and language
模型 (TDLM: Lau et al., 2017A). TDLM combines
topic model with language model in a single
模型, drawing on the idea that the topical con-
text of a sentence can help word prediction in
the language model. The topic model is fashioned
as an auto-encoder, where the input is the docu-
ment’s word sequence and it is processed by
convolutional layers to produce a topic vector
to predict the input words. The language model

299

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

functions like a standard LSTM model, 但它
incorporates the topic vector (generated by its
document context) into the current hidden state to
predict the next word.

We train LSTM and TDLM on 100K uncased
English Wikipedia articles containing approxi-
mately 40M tokens with a vocabulary of 66K
words.11

Next we explore transformer-based models, 作为
they have become the benchmark for many NLP
tasks in recent years (Vaswani et al., 2017; Devlin
等人。, 2019; 杨等人。, 2019). The transformer
models that we use are trained on a much larger
语料库, and they are four to five times larger with
respect to their model parameters.

Our first transformer is GPT2 (Radford et al.,
2019). Given a target word, the input is a sequence
of previously seen words, which are then mapped
to embeddings (along with their positions) 和
fed to multiple layers of ‘‘transformer blocks’’
before the target word is predicted. Much of its
power resides in these transformer blocks: 每个
provides a multi-headed self-attention unit over
all input words, allowing it to capture multiple
dependencies between words, while avoiding the
need for recurrence. With no need to process a
sentence in sequence, the model parallelizes more
efficiently, and scales in a way that RNNs cannot.
GPT2 is trained on WebText, which consists of
超过 8 million web documents, and uses Byte
Pair Encoding (BPE: Sennrich et al., 2016) 为了
tokenization (casing preserved). BPE produces
sub-word units, a middle ground between word
and character, and it provides better coverage for
unseen words. We use the released medium-sized
模型 (‘‘Medium’’) for our experiments.12

Our second transformer is BERT (Devlin et al.,
2019). Unlike GPT2, BERT is not a typical language
模型, in the sense that it has access to both
left and right context words when predicting the
target word.13 Hence, it encodes context in a
bidirectional manner.

To train BERT, Devlin et al. (2019) propose
a masked language model objective, where a
random proportion of input words are masked

11We use Stanford CoreNLP (Manning et al., 2014) 到
tokenize words and sentences. Rare words are replaced by a
special UNK symbol.

12https://github.com/openai/gpt-2.
13Note that context is burdened with two senses in the
纸. It can mean the preceding sentences of a target sen-
张力, or the neighbouring words of a target word. 这
intended sense should be apparent from the usage.

and the model is tasked to predict them based on
non-masked words. In addition to this objective,
BERT is trained with a next sentence prediction
客观的, where the input is a pair of sentences,
and the model’s goal is to predict whether the
latter sentence follows the former. This objective
is added to provide pre-training for downstream
tasks that involve understanding the relationship
between a pair of sentences (例如, machine com-
prehension and textual entailment).

The bidirectionality of BERT is the core feature
that produces its state-of-the-art performance on
a number of tasks. The flipside of this encoding
style, 然而, is that BERT lacks the ability to
generate left-to-right and compute sentence prob-
能力. We discuss how we use BERT to produce
a probability estimate for sentences in the next
部分 (部分 3.2).

In our experiments, we use the largest pre-
trained model (‘‘BERT-Large’’),14 其中有一个
similar number of parameters (340中号) to GPT2. 这是
trained on Wikipedia and BookCorpus (Zhu et al.,
2015), where the latter is a collection of fiction
图书. Like GPT2, BERT also uses sub-word token-
化 (WordPiece). We experiment with two
variants of BERT: one trained on cased data (BERTCS),
and another on uncased data (BERTUCS). As our
test sentences are uncased, a comparison between
these two models allows us to gauge the impact of
casing in the training data.

Our last transformer model is XLNET (杨等人。,
2019). XLNET is unique in that it applies a novel
permutation language model objective, allowing it
to capture bidirectional context while preserving
key aspects of unidirectional language models
(例如, left-to-right generation).

The permutation language model objective
works by first generating a possible permutation
(also called ‘‘factorization order’’) of a sequence.
When predicting a target word in the sequence,
the context words that the model has access to are
determined by the factorization order. To illustrate
这, imagine we have the sequence x = [x1, x2,
x3, x4]. One possible factorization order is: x3 →
x2 → x4 → x1. Given this order, if predicting
target word x4, the model only has access to
context words {x3, x2}; if the target word is x2,
it sees only {x3}. 在实践中, the target word is
set to be the last few words in the factorization

14https://github.com/google-research/bert.

300

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

模型

Configuration

Training Data

Architecture Encoding #Param. Casing

尺寸

Tokenization

语料库

LSTM

RNN
RNN
TDLM
GPT2
Transformer
BERTCS Transformer
BERTUCS Transformer

Unidir.
Unidir.
Unidir.
Bidir.
Bidir.

60M Uncased
80M Uncased

340M Cased
340M Cased
340M Uncased

0.2GB
0.2GB
40GB
13GB
13GB

XLNET Transformer

Hybrid

340M Cased

126GB

Word
Word
BPE
WordPiece
WordPiece
句子-
Piece

维基百科
维基百科
WebText
维基百科, BookCorpus
维基百科, BookCorpus
维基百科, BookCorpus, Giga5
ClueWeb, Common Crawl

桌子 1: Language models and their configurations.

命令 (例如, x4 and x1), and so the model always
sees some context words for prediction.

As XLNET is trained to work with different
factorization orders during training, it has expe-
rienced both full/bidirectional context and partial/
unidirectional context, allowing it to adapt to tasks
that have access to full context (例如, most language
understanding tasks), as well as those that do not
(例如, left-to-right generation).

它

Another innovation of XLNET is that

在-
corporates the segment recurrence mechanism of
Dai et al. (2019). This mechanism is inspired by
truncated backpropagation through time used for
training RNNs, where the initial state of a sequence
is initialized with the final state from the previous
顺序. The segment recurrence mechanism
works in a similar way, by caching the hidden
states of the transformer blocks from the previous
顺序, and allowing the current sequence to
attend to them during training. This permits XLNET
long-range dependencies beyond its
to model
maximum sequence length.

We use the largest pre-trained model (‘‘XLNet-
Large’’),15 which has a similar number of param-
eters to our BERT and GPT2 models (340中号). XLNET
is trained on a much larger corpus combining
维基百科, BookCorpus, news and web articles.
tokenization, XLNET uses SentencePiece
为了
(Kudo and Richardson, 2018), another sub-word
tokenization technique. Like GPT2, XLNET is trained
on cased data.

桌子 1 summarizes the language models. 在
一般的, the RNN models are orders of magnitude
smaller than the transformers in both model
parameters and training data, although they are
trained on the same domain (维基百科), and use
uncased data as the test sentences. The RNN
models also operate on a word level, whereas the
transformers use sub-word units.

15https://github.com/zihangdai/xlnet.

301

3.2 Probability and Acceptability Measure

Given a unidirectional language model, 我们可以
infer the probability of a sentence by multiplying
the estimated probabilities of each token using
previously seen (左边) words as context (本吉奥
等人。, 2003):

→
磷 (s) =

|s|

是
i=0

磷 (wi|w 我)

(2)

With this formulation, we allow BERT to have
access to both left and right context words
when predicting each target word, since this
is consistent with the way in which it was
trained. It is important to note, 然而, 那
sentence probability computed this way is not
a true probability value: These probabilities do
not sum to 1.0 over all sentences. 方程 (1),
相比之下, does guarantee true probabilities.
直观地,
the sentence probability computed
with this bidirectional formulation is a measure

16Technically we can mask all right context words and
predict the target words one at a time, but because the model
is never trained in this way, we found that it performs poorly
in preliminary experiments.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

of the model’s confidence in the likelihood of the
句子.

To compute the true probability, Wang and
给 (2019) show that we need to sum the
pre-softmax weights for each token to score a
句子, and then divide the score by the total
score of all sentences. As it is impractical to
compute the total score of all sentences (一个
infinite set), the true sentence probabilities for
these bidirectional models are intractable. We use
our non-normalized confidence scores as stand-ins
for these probabilities.

For XLNET, we also compute sentence probab-
ility this way, applying bidirectional context, 和
we denote it as XLNETBI. Note that XLNETUNI and
XLNETBI are based on the same trained model.
They differ only in how they estimate sentence
probability at test time.

Sentence probability (estimated either using
unidirectional or bidirectional context) is affected
by its length (例如, longer sentences have lower
probabilities), and word frequency (例如, the cat is
big vs. the yak is big). To modulate for these
factors we introduce simple normalization tech-
好的. 桌子 2 presents five methods to map
sentence probabilities to acceptability measures:
LP, MeanLP, PenLP, NormLP, and SLOR.

LP is the unnormalized log probability. 两个都
MeanLP and PenLP are normalized on sentence
length, but PenLP scales length with an exponent
(A) to dampen the impact of large values (Wu et al.,
2016; Vaswani et al., 2017). We set α = 0.8 in our
实验. NormLP normalizes using unigram
|s|
sentence probability (IE。, Pu(s) = Q
i=0 P (wi)),
while SLOR utilizes both length and unigram
probability (Pauls and Klein, 2012).

When computing sentence probability we have
the option of including the context paragraph that
the human annotators see (部分 2). We use the
superscripts ∅, +, − to denote a model using no
语境, real context, and random context, 尊重-
∅, LSTM+, and LSTM−). 注意
ively (例如, LSTM
these variants are created at test time, and are all
based on the same trained model (例如, LSTM).

For all models except TDLM, incorporating the
context paragraph is trivial. We simply prepend it
to the target sentence before computing the latter’s
probability. For TDLM+ or TDLM−,
the context
paragraph is treated as the document context,
from which a topic vector is inferred and fed to

Acc. Measure Equation

LP log P (s)
log P (s)
|s|

MeanLP

PenLP

log P (s)
((5 + |s|)/(5 + 1))A

NormLP −

log P (s)
log Pu(s)

SLOR

log P (s) − log Pu(s)
|s|

桌子 2: Acceptability measures for predicting
the acceptability of a sentence; 磷 (s) is the sen-
tence probability, computed using Equa-
的 (1) or Equation (2) depending on the
模型; Pu(s) is the sentence probability esti-
mated by a unigram language model; 和
α =0.8.

the language model for next-word prediction. 为了
TDLM

∅, we set the topic vector to zeros.

3.3 Implementation

For the transformer models (GPT2, BERT, 和
XLNET), we use the implementation of pytorch-
transformers.17

XLNET requires a long dummy context prepended
to the target sentence for it to compute the sentence
probability properly.18 Other researchers have
found a similar problem when using XLNET for
generation.19 We think that this is likely due
to XLNET’s recurrence mechanism (部分 3.1),
where it has access to context from the previous
sequence during training.

For TDLM, we use the implementation provided
by Lau et al. (2017A),20 following their optimal
hyper-parameter configuration without tuning.

We implement LSTM based on Tensorflow’s
Penn Treebank language model.21 In terms of

17https://github.com/huggingface/pytorch-
transformers. 具体来说, we employ the following
pre-trained models: gpt2-medium for GPT2, bert-large-
cased for BERTCS, bert-large-uncased for BERTUCS,
and xlnet-large-cased for XLNETUNI/XLNETBI.

(例如, XLNET

18In the scenario where we include the context paragraph
+
UNI), the dummy context is added before it.
19https://medium.com/@amanrusia/xlnet-speaks-

comparison-to-gpt-2-ea1a4e9ba39e.

20https://github.com/jhlau/topically-driven-

language-model.

21https://github.com/tensorflow/models/
blob/master/tutorials/rnn/ptb/ptb word lm.py.

302

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

hyper-parameters, we follow the configuration of
TDLM where applicable. TDLM uses Adam as the
optimizer (Kingma and Ba, 2014), but for LSTM
we use Adagrad (Duchi et al., 2011), as it produces
better development perplexity.

For NormLP and SLOR, we need to compute
Pu(s), the sentence probability based on a unigram
language model. As the language models are
trained on different corpora, we collect unigram
counts based on their original training corpus. 那
是, for LSTM and TDLM, we use the 100K English
Wikipedia corpus. For GPT2, we use an open
source implementation that reproduces the origi-
nal WebText data.22 For BERT we use the full
Wikipedia collection and crawl smashwords.
com to reproduce BookCorpus.23 Finally, 为了
XLNET we use the combined set of Wikipedia,
WebText, and BookCorpus.24

Source code for our experiments is publicly
可以在: https://github.com/jhlau/
acceptability-prediction-in-context.

3.4 Results and Discussion

We use Pearson’s r to assess how well the models’
acceptability measures predict mean human ac-
ceptability ratings, following previous studies
(Lau et al., 2017乙; Bernardy et al., 2018).
Recall that for each model (例如, LSTM), 有
three variants with which we infer the sentence
probability at test time. These are distinguished
∅), 真实的
by whether we include no context (LSTM
语境 (LSTM+), or random context (LSTM−). 那里
are also three types of human acceptability ratings
(ground truth), where sentences are judged with
∅), real context (H+), and random
no context, (H
语境 (H−). We present the full results in Table 3.
the correlation
figures indicate for these models, we compute
two human performance estimates to serve as
upper bounds on the accuracy of a model. 这
the one-vs-rest
first upper bound (UB1)
A
annotator
选择
random annotator’s rating and compare it
到
the mean rating of the rest, using Pearson’s
r. We repeat this for a large number of trials

To get a sense of what

correlation, where we

是

22https://skylion007.github.io/OpenWebTextCorpus/.
23We use the scripts in https://github.com/

soskek/bookcorpus to reproduce BookCorpus.

XLNET also uses Giga5 and ClueWeb as part of its training
数据, but we think that our combined collection is sufficiently
large to be representative of the original training data.

303

(1,000) to get a robust estimate of the mean
correlation. UB1 can be interpreted as the average
human performance working in isolation. 这
second upper bound (UB2) is the half-vs.-half
annotator correlation. For each sentence we ran-
domly split the annotators into two groups, 和
compare the mean rating between groups, 再次
using Pearson’s r and repeating it (1,000 次)
to get a robust estimate. UB2 can be taken as
the average human performance working collab-
oratively. 全面的, the simulated human perfor-
曼斯
语境
相当
是
例如, UB1 = 0.75,
类型 (桌子 3),
H−,
H+,
0.73,
0.75
和
分别.

consistent

∅,
H

超过

和

为了

When we postprocess the user ratings, 关于-
ratings
that we remove the outlier
member
(≥ 2 标准差)
for each sentence
(部分 2.1). Although this produces a cleaner set
of annotations, this filtering step does (artificially)
increase the human agreement or upper bound
correlations. For completeness we also present
upper bound variations where we do not remove
∅
1 和
the outlier ratings, and denote them as UB
∅
2 . In this setup, the one-vs.-rest correlations
UB
drop to 0.62–0.66 (桌子 3). Note that all model
performances are reported based on the outlier-
filtered ratings, although there are almost no
perceivable changes to the performances when
they are evaluated on the outlier-preserved ground
truth.

Looking at Table 3, the models’ performances
are fairly consistent over different types of ground
∅, H+, and H−). This is perhaps not
真相 (H
very surprising, as the correlations among the
human ratings for these context types are very
高的 (部分 2).

We now focus on the results with H

∅ as ground
∅). SLOR is generally the best
truth (‘‘Rtg’’ = H
acceptability measure for unidirectional models,
with NormLP not far behind (the only exception
∅). The recurrent models (LSTM and TDLM)
is GPT2
are very strong compared with the much larger
transformer models (GPT2 and XLNETUNI). 实际上
TDLM has the best performance when context is
∅, SLOR = 0.61), suggesting
not considered (TDLM
that model architecture may be more important
than number of parameters and amount of training
数据.

For bidirectional models, the unnormalized LP
works very well. The clear winner here, 然而,

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Rtg

Encod.

∅

H+

H−

Unidir.

Bidir.

—

Unidir.

Bidir.

—

Unidir.

Bidir.

—

模型
∅

LSTM
LSTM+
∅

TDLM
TDLM+
∅

GPT2
GPT2+
∅
UNI

XLNET
XLNET+
UNI
∅
CS

BERT
BERT+
CS
∅
UCS

BERT
BERT+
UCS
∅
双

双

XLNET
XLNET+
∅
1
∅
2

UB1 / UB
UB2 / UB
∅

LSTM
LSTM+
∅

TDLM
TDLM+
∅

GPT2
GPT2+
∅
UNI

XLNET
XLNET+
UNI
∅
CS

BERT
BERT+
CS
∅
CS

BERT
BERT+
CS
∅
XLNET
双
XLNET+
∅
1
∅
2

UB1 / UB
UB1 / UB
∅

双

LSTM
LSTM−
∅

TDLM
TDLM−
∅
GPT2
GPT2−
∅
UNI

XLNET
XLNET−
UNI
∅
CS

BERT
BERT−
CS
∅
UCS

BERT
BERT−
UCS
∅
XLNET
双
XLNET−
∅
1
∅
2

UB1 / UB
UB2 / UB

双

MeanLP

PenLP

NormLP

SLOR

0.29
0.30
0.30
0.30
0.33
0.38
0.31
0.36
0.51
0.53
0.59
0.60
0.52
0.57

0.29
0.31
0.30
0.30
0.32
0.38
0.30
0.35

0.49
0.52
0.58
0.60
0.51
0.57

0.28
0.27
0.29
0.28
0.32
0.30
0.30
0.29

0.48
0.49
0.56
0.56
0.49
0.50

0.42
0.49
0.49
0.50
0.34
0.59
0.42
0.56
0.54
0.63
0.63
0.68
0.51
0.65

0.44
0.51
0.50
0.50
0.33
0.60
0.42
0.56

0.53
0.63
0.63
0.68
0.50
0.65

0.44
0.41
0.52
0.49
0.34
0.42
0.44
0.40

0.53
0.52
0.61
0.58
0.48
0.51

0.52
0.61
0.60
0.59
0.38
0.63
0.51
0.61
0.55
0.64
0.63
0.67
0.53
0.66

0.52
0.62
0.59
0.58
0.36
0.63
0.49
0.60

0.54
0.63
0.63
0.67
0.52
0.65

0.50
0.47
0.59
0.56
0.35
0.44
0.49
0.46

0.53
0.51
0.60
0.57
0.49
0.51

0.42
0.45
0.45
0.45
0.56
0.58
0.51
0.55
0.63
0.67
0.70
0.72
0.66
0.73

0.75 / 0.66
0.92 / 0.88

0.43
0.46
0.45
0.46
0.56
0.59
0.50
0.55

0.62
0.66
0.70
0.73
0.65
0.74

0.73 / 0.66
0.92 / 0.89

0.43
0.40
0.46
0.44
0.55
0.51
0.51
0.49

0.62
0.61
0.68
0.66
0.62
0.64

0.75 / 0.68
0.92 / 0.88

0.53
0.63
0.61
0.60
0.38
0.60
0.52
0.61
0.53
0.60
0.60
0.63
0.53
0.65

0.52
0.62
0.59
0.58
0.37
0.60
0.51
0.61

0.51
0.58
0.60
0.63
0.53
0.65

0.50
0.47
0.58
0.55
0.35
0.41
0.49
0.46

0.49
0.47
0.56
0.53
0.48
0.50

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

桌子 3: Modeling results. Boldface indicates optimal performance in each row.

304

is PenLP. It substantially and consistently out-
performs all other acceptability measures. 这
strong performance of PenLP that we see here
illuminates its popularity in machine translation
for beam search decoding (Vaswani et al., 2017).
With the exception of PenLP,
the gain from
normalization for the bidirectional models is
小的, but we don’t think this can be attributed
to the size of models or training corpora, 作为
large unidirectional models (GPT2 and XLNETUNI)
still benefit from normalization. The best model
∅
without considering context
UCS with a
correlation of 0.70 (PenLP), which is very close
to the idealized single-annotator performance UB1
(0.75) and surpasses the unfiltered performance
(0.66), creating a new state-of-the-art for
UB
unsupervised acceptability prediction (Lau et al.,
2015, 2017乙; Bernardy et al., 2018). There is still
room to improve, 然而, relative to the collab-
∅
2 (0.88) upper bounds.
orative UB2 (0.92) or UB

is BERT

∅
1

We next look at the impact of incorporating
∅ vs.
context at test time for the models (例如, LSTM
∅
UCS vs. BERT+
LSTM+ or BERT
UCS). To ease interpret-
ability we will focus on SLOR for unidirectional
型号, and PenLP for bidirectional models.
一般来说, we see that
incorporating context
always improves correlation, for both cases where
∅ and H+ as ground truths, suggesting that
we use H
context is beneficial when it comes to sentence
造型. The only exception is TDLM, 在哪里
∅ and TDLM+ perform very similarly. 笔记,
TDLM
然而, that context is only beneficial when it
is relevant. Incorporating random contexts (例如,
∅
∅ vs. LSTM− or BERT
UCS with H− as
UCS vs. BERT−
LSTM
ground truth) reduces the performance for all
models.25
Recall

that our test sentences are uncased
(an artefact of Moses, the machine translation
system that we use). Whereas the recurrent models
are all
trained on uncased data, 大部分的
transformer models are trained with cased data.
BERT is the only transformer that is pre-trained
on both cased (BERTCS) and uncased data (BERTUCS).
To understand the impact of casing, 我们看
at the performance of BERTCS and BERTUCS with
∅ as ground truth. We see an improvement
H

∅
双 (0.62) 与. XLNET−

25There is one exception: XLNET

双 (0.64).
As we saw previously in Section 3.3, XLNET requires a long
dummy context to work, and so this observation is perhaps
unsurprising, because it appears that context—whether it is
relevant or not—seems to always benefit XLNET.

305

BI already outperforms BERT+

of 5–7 points (depending on whether context is
incorporated), which suggests that casing has a
significant impact on performance. 鉴于
XLNET+
(0.73 与.
0.72), even though XLNET+
BI is trained with cased
数据, we conjecture that an uncased XLNET is
∅
UCS when context is not
likely to outperform BERT
经过考虑的.

UCS

总结一下, our first important result is the
exceptional performance of bidirectional models.
It raises the question of whether left-to-right bias is
an appropriate assumption for predicting sentence
acceptability. One could argue that this result
may be due to our experimental setup. Users
are presented with the sentence in text, 和他们
have the opportunity to read it multiple times,
thereby creating an environment that may simulate
bidirectional context. We could test this conjecture
by changing the presentation of the sentence,
displaying it one word at a time (with older
words fading off), or playing an audio version
(例如, via a text-to-speech system). 然而, 这些
likely introduce other confounds
changes will
(例如, prosody), but we believe it is an interesting
avenue for future work.

Our second result is more tentative. Our experi-
ments seem to indicate that model architecture is
more important than training or model size. 我们
see that TDLM, which is trained on data orders
of magnitude smaller and has model parameters
four times smaller in size (桌子 1), outperforms
the large unidirectional transformer models. 到
establish this conclusion more firmly we will need
to rule out the possibility that the relatively good
performance of LSTM and TDLM is not due to a
cleaner (例如, lowercased) or more relevant (例如,
维基百科) training corpus. With that said, 我们
contend that our findings motivate the construc-
tion of better language models, instead of increas-
ing the number of parameters, or the amount of
training data. It would be interesting to examine
the effect of extending TDLM with a bidirectional
客观的.

Our final result is that our best model, BERTUCS,
attains a human-level performance and achieves
a new state-of-the-art performance in the task of
unsupervised acceptability prediction. 鉴于这种
level of accuracy, we expect it would be suitable
for tasks like assessing student essays and the
quality of machine translations.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

4 Linguists’ Examples

One may argue that our dataset is potentially
biased, as round-trip machine translation may in-
troduce particular types of infelicities or unusual
features to the sentences (Graham et al., 2019).
Lau et al. (2017乙) addressed this by creating a
dataset where they sample 50 grammatical and
50 ungrammatical sentences from Adger (2003)’s
syntax textbook, and run a crowdsourced ex-
ratings. Lau
periment
等人.
their unsupervised
language models (例如, simple recurrent networks)
predict the acceptability of these sentences with
similar performances, providing evidence that
their modeling results are robust.

(2017乙) found that

their user

to collect

We test our pre-trained models using this
linguist-constructed dataset, and found similar
observations: GPT2, BERTCS, and XLNETBI produce a
PenLP correlation of 0.45, 0.53, 和 0.58, 重新指定-
主动地. These results indicate that these language
models are able to predict the acceptability of
these sentences reliably, consistent with our mod-
eling results with round-trip translated sentences
(部分 3.4). Although the correlations are gen-
erally lower, we want
这些
linguists’ examples are artificially constructed to
illustrate specific syntactic phenomena, 所以
this constitutes a particularly strong case of out-
of-domain prediction. These texts are substantially
different in nature from the natural text that the
pre-trained language models are trained on (例如,
the linguists’ examples are much shorter—less
比 7 words on average—than the natural texts).

to highlight

那

5 相关工作

Acceptability is closely related to the concept
语法性. The latter is a theoretical
construction corresponding to syntactic well-
formedness, and it is typically interpreted as a
binary property (IE。, a sentence is either gram-
matical or ungrammatical). Acceptability, 在
另一方面, includes syntactic, semantic, prag-
matic, and non-linguistic factors, such as sentence
length. It is gradient, rather than binary, in nature
(Denison, 2004; Sorace and Keller, 2005; Sprouse,
2007).

Linguists and other theorists of language have
traditionally assumed that context affects our per-
ception of both grammaticality (Bolinger, 1968)
and acceptability (贝弗, 1970), but surprisingly

306

little work investigates this effect systematically,
or on a large scale. Most formal linguists rely
heavily on the analysis of sentences taken in
isolation. 然而, many linguistic frameworks
seek to incorporate aspects of context-dependence.
Dynamic theories of semantics (Heim, 1982;
Kamp and Reyle, 1993; Groenendijk and Stokhof,
1990) attempt to capture intersentential corefer-
恩斯, binding, and scope phenomena. 动态的
Syntax (Cann et al., 2007) uses incremental
tree construction and semantic type projection to
render parsing and interpretation discourse depen-
凹痕. Theories of discourse structure characterize
sentence coherence in context through rhetori-
cal relations (曼和汤普森, 1988; 亚瑟
and Lascarides, 2003), or by identifying open
questions and common ground (Ginzburg, 2012).
While these studies offer valuable insights into a
variety of context related linguistic phenomena,
much of it takes grammaticality and acceptabil-
ity to be binary properties. 而且, 它不是
formulated in a way that permits fine-grained
psychological experiments, or wide coverage
计算建模.

Psycholinguistic work can provide more ex-
perimentally grounded approaches. Greenbaum
(1976) found that combinations of particular syn-
tactic constructions in context affect human judg-
ments of acceptability, although the small scale
of the experiments makes it difficult to draw
general conclusions. More recent work investi-
gates related effects, but it tends to focus on very
restricted aspects of the phenomenon. 考试用-
普莱, Zlogar and Davidson (2018) investigate the
influence of context on the acceptability of ges-
tures with speech, focussing on interaction with
semantic content and presupposition. The prim-
ing literature shows that exposure to lexical and
syntactic items leads to higher likelihood of their
repetition in production (Reitter et al., 2011), 和
to quicker processing in parsing under certain cir-
情况 (Giavazzi et al., 2018). Frameworks
such as ACT-R (安德森, 1996) explain these
effects through the impact of cognitive activation
on subsequent processing. Most of these studies
suggest that coherent or natural contexts should
increase acceptability ratings, given that the lin-
guistic expressions used in processing become
more activated. Warner and Glass (1987) 展示
that such syntactic contexts can indeed affect
grammaticality judgments in the expected way for

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

garden path sentences. Cowart (1994) uses com-
parison between positive and negative contexts,
investigating the effect of contexts containing
alternative more or less acceptable sentences. 但
he restricts the test cases to specific pronoun
binding phenomena. None of the psycholinguistic
work investigates acceptability judgments in real
textual contexts, over large numbers of test cases
and human subjects.

Some recent computational work explores the
relation of acceptability judgments to sentence
probabilities. Lau et al. (2015, 2017乙) 显示
the output of unsupervised language models
can correlate with human acceptability ratings.
Warstadt et al. (2018) treat
this as a semi-
supervised problem, training a binary classifier
on top of a pre-trained sentence encoder to
predict acceptability ratings with greater accuracy.
Bernardy et al.
(2018) explore incorporating
into such models, eliciting human
语境
judgments of sentence acceptability when the
sentences were presented both in isolation and
within a document context. They find a compres-
in the distribution of the human
sion effect
acceptability ratings. Bizzoni and Lappin (2019)
observe a similar effect in a paraphrase accept-
ability task.

One possible explanation for this compression
effect is to take it as the expression of cognitive
load. Psychological research on the cognitive load
影响 (Sweller, 1988; Ito et al., 2018; Causse et al.,
2016; Park et al., 2013) indicates that performing
a secondary task can degrade or distort subjects’
performance on a primary task. This could cause
judgments to regress towards the mean. 然而,
the experiments of Bernardy et al. (2018) 和
Bizzoni and Lappin (2019) do not allow us to
distinguish this possibility from a coherence or
priming effect, as only coherent contexts were
经过考虑的. Our experimental setup improves on
this by introducing a topic identification task and
incoherent (random) contexts in order to tease the
effects apart.

6 Conclusions and Future Work

We found that processing context
induces a
cognitive load for humans, which creates a
compression effect on the distribution of accept-
ability ratings. We also showed that if the context
is relevant to the sentence, a discourse coherence
effect uniformly boosts sentence acceptability.

Our language model experiments indicate that
bidirectional models achieve better results than
unidirectional models. The best bidirectional
model performs at a human level, defining a new
state-of-the art for this task.

In future work we will explore alternative ways
to present sentences for acceptability judgments.
We plan to extend TDLM, incorporating a bidi-
重要的
rectional objective, 作为
承诺. It will also be interesting to see if our
observations generalize to other languages, 和
to different sorts of contexts, both linguistic and
non-linguistic.

节目

它

致谢

We are grateful to three anonymous reviewers for
helpful comments on earlier drafts of this paper.
Some of the work described here was presented
in talks in the seminar of the Centre for Linguistic
Theory and Studies in Probability (CLASP),
University of Gothenburg, 十二月 2019, 并在
the Cambridge University Language Technology
Seminar, 二月 2020. We thank the participants
of both events for useful discussion.

Lappin’s work on the project was supported
by grant 2014-39 from the Swedish Research
理事会, which funds CLASP. Armendariz and
Purver were partially supported by the European
Union’s Horizon 2020 research and innovation
programme under grant agreement no. 825153,
project EMBEDDIA (Cross-Lingual Embeddings
for Less-Represented Languages in European
News Media). The results of this publication
reflect only the authors’ views and the Com-
mission is not responsible for any use that may be
made of the information it contains.

参考

David Adger. 2003. Core Syntax: A Minimalist
Approach, 牛津大学出版社, 团结的
王国.

约翰·R. 安德森. 1996. ACT: A simple theory
of complex cognition. American Psychologist,
51:355–365.

Nicholas Asher and Alex Lascarides. 2003. Logics
of Conversation, 剑桥大学出版社.

Yoshua Bengio, R´ejean Ducharme, Pascal
Vincent, and Christian Janvin. 2003. A neural

307

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

probabilistic language model. 杂志
Machine Learning Research, 3:1137–1155.

Jean-Philippe Bernardy, Shalom Lappin, 和
Jey Han Lau. 2018. The influence of context on
sentence acceptability judgements. In Proceed-
ings of the 56th Annual Meeting of the Asso-
ciation for Computational Linguistics (前交叉韧带
2018), pages 456–461. 墨尔本, 澳大利亚.

Thomas G. 贝弗. 1970, The cognitive basis
for linguistic structures, J. 右. 海耶斯, 编辑,
Cognition and the Development of Language,
威利, 纽约, pages 279–362.

Yuri Bizzoni and Shalom Lappin. 2019. 这
effect of context on metaphor paraphrase
aptness judgments. 在第 13 届会议记录中
国际计算会议
语义学 – Long Papers, pages 165–175.
Gothenburg, 瑞典.

Dwight Bolinger. 1968. Judgments of grammati-

cality. Lingua, 21:34–40.

Ronnie Cann, Ruth Kempson, and Matthew
Purver. 2007. Context and well-formedness:
the dynamics of ellipsis. Research on Language
and Computation, 5(3):333–358.

Micka¨el Causse, Vsevolod Peysakhovich, 和
Eve F. 法布尔. 2016. High working memory load
impairs language processing during a simulated
piloting task: An ERP and pupillometry study.
Frontiers in Human Neuroscience, 10:240.

Wayne Cowart. 1994. Anchoring and grammar
effects in judgments of sentence acceptability.
Perceptual and Motor Skills, 79(3):1171–1182.

Zihang Dai, Zhilin Yang, Yiming Yang,
Jaime G. Carbonell, Quoc V. Le,
和
Ruslan Salakhutdinov. 2019. Transformer-
XL: Attentive language models beyond a
fixed-length context. CoRR, abs/1901.02860.

David Denison. 2004. Fuzzy Grammar: A Reader,
牛津大学出版社, 英国.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
这 2019
理解. 在诉讼程序中
Conference of the North American Chapter of
the Association for Computational Linguistics:

人类语言技术, 体积 1
(Long and Short Papers), pages 4171–4186.
明尼阿波利斯, Minnesota.

John Duchi, Elad Hazan, and Yoram Singer.
2011. Adaptive subgradient methods for online
learning and stochastic optimization. 杂志
Machine Learning Research, 12:2121–2159.

Maria Giavazzi, Sara Sambin, Ruth de Diego-
Balaguer, Lorna Le Stanc, Anne-Catherine
Bachoud-L´evi, and Charlotte Jacquemot. 2018.
Structural priming in sentence comprehen-
锡安: A single prime is enough. PLoS ONE,
13(4):e0194959.

Jonathan Ginzburg. 2012. The Interactive Stance:
Meaning for Conversation, 牛津大学
按.

Yvette Graham, Barry Haddow, and Philipp
科恩. 2019. Translationese in machine trans-
lation evaluation. CoRR, abs/1906.09833.

Sidney Greenbaum. 1976. 语境化

influ-
ence on acceptability judgements. 语言学,
15(187):5–12.

Jeroen Groenendijk and Martin Stokhof. 1990.
Dynamic Montague grammar. L. Kalman and
这
L. Polos, 编辑,
2nd Symposium on Logic and Language,
pages 3–48. 布达佩斯.

在诉讼程序中

Irene Heim. 1982. The Semantics of Definite and
Indefinite Noun Phrases. 博士. 论文, 大学-
sity of Massachusetts at Amherst.

Felix Hill, Roi Reichart, and Anna Korhonen.
2015. SimLex-999: Evaluating semantic mod-
els with (genuine) similarity estimation. 康姆-
putational Linguistics, 41:665–695.

Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. 神经计算,
9:1735–1780.

Aine Ito, Martin Corley, and Martin J. 皮克林.
2018. A cognitive load delays predictive eye
movements similarly during L1 and L2 compre-
hension. Bilingualism: Language and Cogni-
的, 21(2):251–264.

Hans Kamp and Uwe Reyle. 1993. From Dis-
course To Logic, Kluwer Academic Publishers.

308

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Urvashi Khandelwal, He He, Peng Qi, and Dan
Jurafsky. 2018. Sharp nearby, fuzzy far away:
How neural language models use context. 在
Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 284–294.
计算语言学协会,
墨尔本, 澳大利亚.

Diederik P. Kingma and Jimmy Ba. 2014. 亚当:
A method for stochastic optimization. CoRR,
abs/1412.6980.

Taku Kudo

和

John Richardson.

2018.
SentencePiece: A simple and language inde-
pendent subword tokenizer and detokenizer
for neural text processing. 在诉讼程序中
这 2018 实证方法会议
自然语言处理: System Dem-
onstrations, pages 66–71. 布鲁塞尔, 比利时.

Jey Han Lau, Timothy Baldwin, and Trevor Cohn.
2017A. Topically driven neural language model.
In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 355–365.
Vancouver, 加拿大.

Jey Han Lau, Alexander Clark, and Shalom
Lappin. 2015. Unsupervised prediction of
acceptability judgements. 在诉讼程序中
Joint conference of the 53rd Annual Meeting of
the Association for Computational Linguistics
and the 7th International Joint Conference on
Natural Language Processing of
the Asian
Federation of Natural Language Processing
(ACL-IJCNLP 2015),
1618–1628.
北京, 中国.

页面

Jey Han Lau, Alexander Clark, and Shalom
Lappin. 2017乙. 语法性, Acceptability,
and Probability: A Probabilistic View of
Linguistic Knowledge. 认知科学,
41:1202–1241.

William Mann and Sandra Thompson. 1988.
Rhetorical structure theory: Toward a func-
text organization. Text,
的
8(3):243–281.

theory of

Christopher D. 曼宁, Mihai Surdeanu, 约翰
Bauer, Jenny Finkel, Steven J. Bethard, 和
David McClosky. 2014. The Stanford CoreNLP
natural language processing toolkit. In Asso-

ciation for Computational Linguistics (前交叉韧带)
系统演示, pages 55–60.

Andrew Kachites McCallum. 2002. Mallet: A
machine learning for language toolkit. http://
mallet.cs.umass.edu.

Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget,
Jan ˇCernock´y, and Sanjeev Khudanpur. 2010.
Recurrent neural network based language
模型. In INTERSPEECH 2010, 11th Annual
Conference of the International Speech Com-
munication Association, pages 1045–1048.
Makuhari, 日本.

Hyangsook Park, Jun-Su Kang, Sungmook Choi,
and Minho Lee. 2013. Analysis of cognitive
load for language processing based on brain
活动. In Neural Information Processing,
pages 561–568. 施普林格柏林海德堡,
柏林, Heidelberg.

Adam Pauls and Dan Klein. 2012. Large-scale
syntactic language modeling with treelets. 在
Proceedings of the 50th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 959–968. Jeju
Island, 韩国.

Alec Radford, Jeff Wu, Rewon Child, 大卫
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners.

Daivd Reitter, Frank Keller, and Johanna D.
摩尔. 2011. A computational cognitive
model of syntactic priming. 认知科学,
35(4):587–637.

Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of rare
words with subword units. 在诉讼程序中
the 54th Annual Meeting of the Association
for Computational Linguistics (体积 1: 长的
文件), pages 1715–1725. 柏林, 德国.

Antonella Sorace and Frank Keller. 2005.
Lingua,
语言学的

数据.

Gradience
在
115:1497–1524.

Jon Sprouse. 2007. Continuous acceptability,
categorical grammaticality, and experimental
syntax. Biolinguistics, 1123–134.

John Sweller. 1988. Cognitive load during
problem solving: Effects on learning. 认知的
科学, 12(2):257–285.

309

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In Advances in
Neural Information Processing Systems 30,
pages 5998–6008.

Alex Wang and Kyunghyun Cho. 2019. BERT has
a mouth, and it must speak: BERT as a Markov
random field language model. In Proceedings
of the Workshop on Methods for Optimizing
and Evaluating Neural Language Generation,
pages 30–36. Association for Computational
语言学, 明尼阿波利斯, Minnesota.

John Warner and Arnold L. Glass. 1987. Context
and distance-to-disambiguation effects in ambi-
guity resolution: Evidence from grammaticality
judgments of garden path sentences. 杂志
记忆与语言, 26(6):714 – 738.

Alex Warstadt, Amanpreet Singh, and Samuel R.
Bowman. 2018. Neural network acceptability
判断. CoRR, abs/1805.12471.

Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin
高, Klaus Macherey, Jeff Klingner, Apurva

Shah, Melvin Johnson, Xiaobing Liu, Lukasz
Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku
Kudo, Hideto Kazawa, Keith Stevens, 乔治
Kurian, Nishant Patil, Wei Wang, Cliff Young,
Jason Smith, Jason Riesa, Alex Rudnick,
Oriol Vinyals, Greg Corrado, Macduff Hughes,
and Jeffrey Dean. 2016. Google’s neural
machine translation system: Bridging the gap
between human and machine translation. CoRR,
abs/1609.08144.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. 2019. XLNet: Generalized autoregressive
pretraining for language understanding. CoRR,
abs/1906.08237.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan
Salakhutdinov, Raquel Urtasun, Antonio
Torralba, and Sanja Fidler. 2015. Aligning
books and movies: Towards story-like visual
explanations by watching movies and reading
图书. 在诉讼程序中 2015 IEEE Inter-
national Conference on Computer Vision
(ICCV), pages 19–27. 华盛顿, 直流, 美国.

Christina Zlogar and Kathryn Davidson. 2018.
Effects of linguistic context on the acceptability
of co-speech gestures. Glossa, 3(1):73.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
1
5
1
9
2
3
6
1
0

/
t

我

A
C
_
A
_
0
0
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

310
下载pdf