Supervised and Unsupervised Neural - 麻省理工学院人工智能研究专业

Supervised and Unsupervised Neural
Approaches to Text Readability

Matej Martinc
Joˇzef Stefan Institute, Ljubljana, Slovenia
Joˇzef Stefan International Postgraduate
学校, Ljubljana, Slovenia
matej.martinc@ijs.si

Senja Pollak
Joˇzef Stefan Institute, Ljubljana, Slovenia
senja.pollak@ijs.si

Marko Robnik-ˇSikonja
University of Ljubljana, Faculty of
Computer and Information Science,
Ljubljana, Slovenia
marko.robnik@fri.uni-lj.si

We present a set of novel neural supervised and unsupervised approaches for determining the
readability of documents. In the unsupervised setting, we leverage neural language models,
whereas in the supervised setting, three different neural classiﬁcation architectures are tested. 我们
show that the proposed neural unsupervised approach is robust, transferable across languages,
and allows adaptation to a speciﬁc readability task and data set. By systematic comparison of
several neural architectures on a number of benchmark and new labeled readability data sets in
two languages, this study also offers a comprehensive analysis of different neural approaches to
readability classiﬁcation. We expose their strengths and weaknesses, compare their performance
to current state-of-the-art classiﬁcation approaches to readability, which in most cases still rely
on extensive feature engineering, and propose possibilities for improvements.

1. 介绍

Readability is concerned with the relation between a given text and the cognitive load
of a reader to comprehend it. This complex relation is inﬂuenced by many factors, 这样的
as a degree of lexical and syntactic sophistication, discourse cohesion, and background
知识 (Crossley et al. 2017). In order to simplify the problem of measuring read-
能力, traditional readability formulas focused only on lexical and syntactic features

提交材料已收到: 26 七月 2019; 收到修订版: 22 十一月 2020; 接受出版:
18 十二月 2020.

https://doi.org/10.1162/COLI 00398

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 47, 数字 1

expressed with statistical measurements, such as word length, sentence length, 和
word difﬁculty (Davison and Kantor 1982). These approaches have been criticized
because of their reductionism and weak statistical bases (Crossley et al. 2017). 其他
problem is their objectivity and cultural transferability, since children from different en-
vironments master different concepts at different ages. 例如, a word television is
quite long and contains many syllables but is well-known to most young children who
live in families with a television.

With the development of novel natural language processing (自然语言处理) 技巧,
several studies attempted to eliminate deﬁciencies of traditional readability formulas.
These attempts include leveraging high-level textual features for readability modeling,
such as semantic and discursive properties of texts. Among them, cohesion and co-
herence received the most attention, and several readability predictors based on these
text features have been proposed (参见章节 2). 尽管如此, none of them seems to
predict the readability of the text as well as much simpler readability formulas men-
上面提到的 (Todirascu et al. 2016).

With the improvements in machine learning, the focus shifted once again, and most
newer approaches consider readability as being a classiﬁcation, regression, or a ranking
任务. Machine learning approaches build prediction models to predict human assigned
readability scores based on several attributes and manually built features that cover
as many text dimensions as possible (Schwarm and Ostendorf 2005; Petersen and
Ostendorf 2009; Vajjala and Meurers 2012). They generally yield better results than the
traditional readability formulas and text cohesion–based methods but require addi-
tional external resources, such as labeled readability data sets, which are scarce. 其他
problem is the transferability of these approaches between different corpora and lan-
guages, because the resulting feature sets do not generalize well to different types of
文本 (Xia, Kochmar, and Briscoe 2016; Filighera, Steuer, and Rensing 2019).

最近, 深度神经网络 (好人, 本吉奥, and Courville 2016) 有
shown impressive performance on many language-related tasks. 实际上, 他们有
achieved state-of-the-art performance in all semantic tasks where sufﬁcient amounts of
data were available (Collobert et al. 2011; 张, 赵, and LeCun 2015). Even though
very recently some neural approaches toward readability prediction have been pro-
摆出姿势 (Nadeem and Ostendorf 2018; Filighera, Steuer, and Rensing 2019), these types
of studies are still relatively scarce, and further research is required in order to establish
what type of neural architectures are the most appropriate for distinct readability tasks
and data sets. 此外, language model features designed to measure lexical and
semantic properties of text, which can be found in many of the readability studies
(Schwarm and Ostendorf 2005; Petersen and Ostendorf 2009; Xia, Kochmar, and Briscoe
2016), are generated with traditional n-gram language models, even though language
modeling has been drastically improved with the introduction of neural language mod-
这 (米科洛夫等人. 2011).

The aim of the present study is two-fold. 第一的, we propose a novel approach to read-
ability measurement that takes into account neural language model statistics. This ap-
proach is unsupervised and requires no labeled training set but only a collection of
texts from the given domain. We demonstrate that the proposed approach is capable
of contextualizing the readability because of the trainable nature of neural networks
and that it is transferable across different languages. In this scope, we propose a new
measure of readability, RSRS (ranked sentence readability score), with good correlation
with true readability scores.

第二, we experiment to ﬁnd how different neural architectures with automa-
tized feature generation can be used for readability classiﬁcation and compare their

142

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

performance to state-of-the-art classiﬁcation approaches. Three distinct branches of neu-
ral architectures—recurrent neural networks (RNN), hierarchical attention networks
(HAN), and transfer learning techniques—are tested on four gold standard readability
corpora with good results.

The article is structured as follows. 部分 2 addresses the related work on readabil-
ity prediction. 部分 3 offers a thorough analysis of data sets used in our experiments,
and in Section 4, we present the methodology and results for the proposed unsupervised
approach to readability prediction. The methodology and experimental results for the
supervised approach are presented in Section 5. We present conclusions and directions
for further work in Section 6.

2. 相关工作

Approaches to the automated measuring of readability try to ﬁnd and assess factors that
correlate well with human perception of readability. Several indicators, which measure
different aspects of readability, have been proposed in the past and are presented in
部分 2.1. These measures are used as features in newer approaches, which train ma-
chine learning models on texts with human-annotated readability levels so that they can
predict readability levels on new unlabeled texts. Approaches that rely on an extensive
set of manually engineered features are described in Section 2.2. 最后, 部分 2.3
covers the approaches that tackle readability prediction with neural classiﬁers. Besides
tackling the readability as a classiﬁcation problem, several other supervised statistical
approaches for readability prediction have been proposed in the past. 他们包括
regression (Sheehan et al. 2010), Support Vector Machine (支持向量机) ranking (Ma, Fosler-
Lussier, and Lofthus 2012), and graph-based methods (Jiang, Xun, and Qi 2015), 之中
好多其它的. We do not cover these methods in the related work because they are not
directly related to the proposed approach.

2.1 Readability Features

Classical readability indicators can be roughly divided into ﬁve distinct groups: tradi-
的, discourse cohesion, lexico-semantic, 句法的, and language model features. 我们
describe them below.

2.1.1 Traditional Features. 传统上, readability in texts was measured by statistical
readability formulas, which try to construct a simple human-comprehensible formula
with a good correlation to what humans perceive as the degree of readability. The sim-
plest of them is average sentence length (ASL), though they take into account various
other statistical factors, such as word length and word difﬁculty. Most of these formulas
were originally developed for the English language but are also applicable to other
languages with some modiﬁcations (ˇSkvorc et al. 2019).

The Gunning fog index (Gunning 1952) (GFI) estimates the years of formal educa-
tion a person needs to understand the text on the ﬁrst reading. It is calculated with the
following expression:

(西德:18)

GFI = 0.4

totalWords
totalSentences

+ 100

longWords
totalSentences

(西德:19)

where longWords are words longer than 7 characters. Higher values of the index indicate
lower readability.

143

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 47, 数字 1

Flesch reading ease (Kincaid et al. 1975) (FRE) assigns higher values to more read-

able texts. It is calculated in the following way:

FRE = 206.835 - 1.015

(西德:16) totalWords
totalSentences

(西德:17)

- 84.6

(西德:18) totalSyllables
totalWords

(西德:19)

The values returned by the Flesch-Kincaid grade level (Kincaid et al. 1975) (FKGL)
correspond to the number of years of education generally required to understand the
text for which the formula was calculated. The formula is deﬁned as follows:

FKGL = 0.39

(西德:16) totalWords
totalSentences

(西德:17)

+ 11.8

(西德:18) totalSyllables
totalWords

(西德:19)

- 15.59

Another readability formula that returns values corresponding to the years of edu-
cation required to understand the text is the Automated Readability Index (Smith and
Senter 1967) (ARI):

ARI = 4.71

(西德:16) totalCharacters
totalWords

(西德:17)

+ 0.5

(西德:16) totalWords
totalSentences

(西德:17)

- 21.43

The Dale-Chall readability formula (Dale and Chall 1948) (DCRF) requires a list of
3,000 words that fourth-grade US students could reliably understand. Words that do
not appear in this list are considered difﬁcult. If the list of words is not available, 这是
possible to use the GFI approach and consider all the words longer than 7 characters as
difﬁcult. The following expression is used in calculation:

DCRF = 0.1579

(西德:18) difﬁcultWords
totalWords

(西德:19)

∗ 100

+ 0.0496

(西德:16) totalWords
totalSentences

(西德:17)

The SMOG grade (Simple Measure of Gobbledygook) (McLaughlin 1969) is a read-
ability formula originally used for checking health messages. Similar to FKGL and ARI,
it roughly corresponds to the years of education needed to understand the text. 这是
calculated with the following expression:

SMOG = 1.0430

numberOfPolysyllables

(西德:114)

30
totalSentences

+ 3.1291

where the numberOfPolysyllables is the number of words with three or more syllables.

We are aware of one study that explored the transferability of these formulas across
流派 (Sheehan, Flor, and Napolitano 2013), and one study that explored transferability
across languages (Madrazo Azpiazu and Pera 2020). The study by Sheehan, Flor, 和
Napolitano (2013) concludes that, mostly due to vocabulary speciﬁcs of different genres,
traditional readability measures are not appropriate for cross-genre prediction, 因为
they underestimate the complexity levels of literary texts and overestimate that of
educational texts. The study by Madrazo Azpiazu and Pera (2020), 另一方面,

144

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

concludes that the readability level predictions for translations of the same text are
rarely consistent when using these formulas.

All of the above-mentioned readability measures were designed for the speciﬁc use
on English texts. There are some rare attempts to adapt these formulas to other lan-
guages (Kandel and Moles 1958) or to create new formulas that could be used on lan-
guages other than English (安德森 1981).

To show a multilingual potential of our approach, we address two languages in this
学习, English and Slovenian, a Slavic language with rich morphology and orders of
magnitude fewer resources compared to English. For Slovenian, readability studies are
scarce. ˇSkvorc et al. (2019) researched how well the above statistical readability formulas
work on Slovenian text by trying to categorize text from three distinct sources: children’s
杂志, newspapers and magazines for adults, and transcriptions of sessions of
the National Assembly of Slovenia. Results of this study indicate that formulas that
consider the length of words and/or sentences work better than formulas that rely
on word lists. They also noticed that simple indicators of readability, such as percent-
age of adjectives and average sentence length, work quite well for Slovenian. To our
知识, the only other study that employed readability formulas on Slovenian texts
was done by Zwitter Vitez (2014). In that study, the readability formulas were used as
features in the author recognition task.

2.1.2 Discourse Cohesion Features. 在文献中, we can ﬁnd at least two distinct
notions of discourse cohesion (Todirascu et al. 2016). First is the notion of coherence,
deﬁned as the “semantic property of discourse, based on the interpretation of each
sentence relative to the interpretation of other sentences” (Van Dijk 1977). 以前的
research that investigates this notion tries to determine whether a text can be interpreted
as a coherent message and not just as a collection of unrelated sentences. This can
be done by measuring certain observable features of the text, such as the repetition
of content words or by analysis of words that explicitly express connectives (因为,
最后, 因此, ETC。) (Sheehan et al. 2014). A somewhat more investigated notion,
due to its easier operationalization, is the notion of cohesion, deﬁned as “a property of
text represented by explicit formal grammatical ties (discourse connectives) and lexical
ties that signal how utterances or larger text parts are related to each other.”

According to Todirascu et al. (2016), we can divide cohesion features into ﬁve
distinct classes, outlined below: co-reference and anaphoric chain properties, 实体
density and entity cohesion features, lexical cohesion measures, and part of speech
(销售点) tag-based cohesion features. Co-reference and anaphoric chain properties were
ﬁrst proposed by Bormuth (1969), who measured various characteristics of anaphora.
These features include statistics, such as the average length of reference chains or the
proportion of various types of mention (noun phrases, proper names, ETC。) in the chain.
Entity density features include statistics such as the total number of all/unique entities
per document, the average number of all/unique entities per sentence, 等等.
These features were ﬁrst proposed in Feng, Elhadad, and Huenerfauth (2009) and Feng
等人. (2010), who followed the theoretical line from Halliday and Hasan (1976) 和
威廉姆斯 (2006). Entity cohesion features assess relative frequency of possible tran-
sitions between syntactic functions played by the same entity in adjacent sentences
(Pitler and Nenkova 2008). Lexical cohesion measures include features such as the
frequency of content word repetition across adjacent sentences (Sheehan et al. 2014),
a Latent Semantic Analysis (LSA)-based feature for measuring the similarity of words
and passages to each other proposed by Landauer (2011), or a measure called Lexical

145

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 47, 数字 1

Tightness (LT), suggested by Flor, Klebanov, and Sheehan (2013), deﬁned as the mean
value of the Positive Normalized Pointwise Mutual Information (采购经理人指数) for all pairs of
content-word tokens in a text. The last category is POS tag-based cohesion features,
which measure the ratio of pronoun and article parts-of-speech, two crucial elements of
cohesion (Todirascu et al. 2016).

Todirascu et al. (2016), who analyzed 65 discourse features found in the readability
文学, concluded that they generally do not contribute much to the predictive power
of text readability classiﬁers when compared with the traditional readability formulas
or simple statistics such as sentence length.

2.1.3 Lexico-semantic Features. According to Collins-Thompson (2014), vocabulary knowl-
edge is an important aspect of reading comprehension, and lexico-semantic features
measure the difﬁculty of vocabulary in the text. A common feature is Type-token ratio
(TTR), which measures the ratio between the number of unique words and the total
number of words in a text. The length of the text inﬂuences TTR; 所以, several
corrections, which produce a more unbiased representation, such as Root TTR and
Corrected TTR, are also used for readability prediction.

Other frequently used features in classiﬁcation approaches to readability are n-gram
lexical features, such as word and character n-grams (Vajjala and Meurers 2012; Xia,
Kochmar, and Briscoe 2016). While POS-based lexical features measure lexical vari-
化 (IE。, TTR of lexical items such as nouns, 形容词, 动词, adverbs, and preposi-
系统蒸发散) and density (例如, the percentage of content words and function words), word list-
based features use external psycholinguistic and Second Language Acquisition (SLA)
资源, which contain information about which words and phrases are acquired at
the speciﬁc age or English learning class.

2.1.4 Syntactic Features. Syntactic features measure the grammatical complexity of the
text and can be divided into several categories. Parse tree features include features
such as an average parse tree height or an average number of noun- or verb-phrases per
句子. Grammatical relations features include measures of grammatical relations
between constituents in a sentence, such as the longest/average distance in the gram-
matical relation sets generated by the parser. Complexity of syntactic unit features
measure the length of a syntactic unit at the sentence, 条款 (any structure with a
subject and a ﬁnite verb), and T-unit level (one main clause plus any subordinate clause).
最后, coordination and subordination features measure the amount of coordination
and subordination in the sentence and include features such as a number of clauses per
T-unit or number of coordinate phrases per clause, 等等.

2.1.5 Language Model Features. The standard task of language modeling can be formally
deﬁned as predicting a probability distribution of words from the ﬁxed size vocabulary
V, for word wt+1, given the historical sequence w1:t = [w1, . . . , wt]. To measure its perfor-
曼斯, traditionally a metric called perplexity is used. A language model m is evaluated
according to how well it predicts a separate test sequence of words w1:N= [w1, . . . , wN].
For this case, the perplexity (PPL) of the language model m() is deﬁned as:

PPL = 2− 1

氮

(西德:80)氮

i=1 log2 m(wi )

(1)

where m(wi) is the probability assigned to word wi by the language model m, and N is
the length of the sequence. The lower the perplexity score, the better the language model

146

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

predicts the words in a document—that is, the more predictable and aligned with the
training set the text is.

All past approaches for readability detection that use language modeling leverage
older n-gram language models rather than the newer neural language models. Schwarm
and Ostendorf (2005) train one n-gram language model for each readability class c in the
training data set. For each text document d, they calculate the likelihood ratio according
to the following formula:

LR(d, C) =

(西德:80)

磷(d|C)磷(C)
¯c(西德:54)=c P(d|¯c)磷( ¯c)

where P(d|C) denotes the probability returned by the language model trained on texts
labeled with class c, 和P(d|¯c) denotes probability of d returned by the language model
trained on the class ¯c. Uniform prior probabilities of classes are assumed. The likelihood
ratios are used as features in the classiﬁcation model, along with perplexities achieved
by all the models.

In Petersen and Ostendorf (2009), three statistical language models (unigram, 双-
gram and trigram) are trained on four external data resources: Britannica (adult), Bri-
tannica Elementary, CNN (adult), and CNN abridged. The resulting 12 n-gram language
models are used to calculate perplexities of each target document. It is assumed that low
perplexity scores calculated by language models trained on the adult level texts and
high perplexity scores of language models trained on the elementary/abridged levels
would indicate a high reading level, and high perplexity scores of language models
trained on the adult level texts and low perplexity scores of language models trained on
the elementary/abridged levels would indicate a low reading level.

Xia, Kochmar, and Briscoe (2016) 火车 1- to 5-gram word-based language models on
the British National Corpus, 和 25 POS-based 1- to 5-gram models on the ﬁve classes
of the WeeBit corpus. Language models’ log-likelihood and perplexity scores are used
as features for the classiﬁer.

2.2 Classiﬁcation Approaches Based on Feature Engineering

The above approaches measure readability in an unsupervised way, using the described
特征. 或者, we can predict the level of readability in a supervised way. 这些
approaches usually require extensive feature engineering and also leverage many of the
features described earlier.

One of the ﬁrst classiﬁcation approaches to readability was proposed by Schwarm
and Ostendorf (2005). It relies on a SVM classiﬁer trained on a WeeklyReader corpus,1
containing articles grouped into four classes according to the age of the target audi-
恩斯. Traditional, 句法的, and language model features are used in the model. 这
approach was extended and improved upon in Petersen and Ostendorf (2009).

Altogether, 155 traditional, discourse cohesion, lexico-semantic, and syntactic fea-
tures were used in an approach proposed by Vajjala and Luˇci´c (2018), tested on a
recently published OneStopEnglish corpus. Sequential Minimal Optimization (SMO)
classiﬁer with the linear kernel achieved the classiﬁcation accuracy of 78.13% for three
readability classes (elementary, intermediate, and advanced reading level).

1 http://www.weeklyreader.com.

147

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 47, 数字 1

A successful classiﬁcation approach to readability was proposed by Vajjala and
Meurers (2012). Their multilayer perceptron classiﬁer is trained on the WeeBit corpus
(Vajjala and Meurers 2012) (参见章节 3 for more information on WeeBit and other men-
tioned corpora). The texts were classiﬁed into ﬁve classes according to the age group
they are targeting. For classiﬁcation, the authors use 46 manually crafted traditional,
lexico-semantic, and syntactic features. For the evaluation, they trained the classiﬁer
on a train set consisting of 500 documents from each class and tested it on a balanced
test set of 625 文件 (containing 125 documents per each class). They report 93.3%
accuracy on the test set.2

Another set of experiments on the WeeBit corpus was conducted by Xia, Kochmar,
and Briscoe (2016), who conducted additional cleaning of the corpus because it con-
tained some texts with broken sentences and additional meta-information about the
source of the text, such as copyright declaration and links, strongly correlated with the
target labels. They use similar lexical, 句法的, and traditional features as Vajjala and
Meurers (2012) but add language modeling (参见章节 2.1.5 欲了解详情) and discourse
cohesion-based features. Their SVM classiﬁer achieves 80.3% accuracy using the 5-
fold crossvalidation. This is one of the studies where the transferability of the classi-
ﬁcation models is tested. The authors used an additional CEFR (Common European
Framework of Reference for Languages) 语料库. This small data set of CEFR-graded
texts is tailored for learners of English (Council of Europe 2001) and also contains 5
readability classes. The SVM classiﬁer trained on the WeeBit corpus and tested on the
CEFR corpus achieved the classiﬁcation accuracy of 23.3%, hardly beating the majority
classiﬁer baseline. This low result was attributed to the differences in readability classes
in both corpora, since WeeBit classes are targeting children of different age groups, 和
CEFR corpus classes are targeting mostly adult foreigners with different levels of En-
glish comprehension. 然而, this result is a strong indication that transferability of
readability classiﬁcation models across different types of texts is questionable.

Two other studies that deal with the multi-genre prospects of readability prediction
were conducted by Sheehan, Flor, and Napolitano (2013) and Napolitano, Sheehan, 和
Mundkowsky (2015). Both studies describe the problem in the context of the TextEvalu-
ator Tool (Sheehan et al. 2010), an online system for text complexity analysis. The system
supports multi-genre readability prediction with the help of a two-stage prediction
workﬂow, in which ﬁrst the genre of the text is determined (as being informational, 点亮-
erary, or mixed) and after that its readability level is predicted with an appropriate
genre-speciﬁc readability prediction model. Similarly to the study above, this work also
indicates that using classiﬁcation models for cross-genre prediction is not feasible.

When it comes to multi- and crosslingual classiﬁcation, Madrazo Azpiazu and Pera
(2020) explore the possibility of a crosslingual readability assessment and show that
their methodology called CRAS (Crosslingual Readability Assessment Strategy), 哪个
includes building a classiﬁer that uses a set of traditional, lexico-semantic, 句法的,
and discourse cohesion-based features works well in a multilingual setting. 他们还
show that classiﬁcation for some low resource languages can be improved by including
documents from a different language into the train set for a speciﬁc language.

2 Later research by Xia, Kochmar, and Briscoe (2016) called the validity of the published experimental

results into question; 所以, the reported 93.3% accuracy might not be the objective state-of-the-art
result for readability classiﬁcation.

148

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

2.3 Neural Classiﬁcation Approaches

最近, several neural approaches for readability prediction have been proposed.
Nadeem and Ostendorf (2018) tested two different architectures on the WeeBit cor-
pus regression task, 即, sequential Gated Recurrent Unit (GRU) (Cho et al. 2014)
based RNN with the attention mechanism and hierarchical RNNs (Yang et al. 2016)
with two distinct attention types: a more classic attention mechanism proposed by
Bahdanau, 给, and Bengio (2014), and multi-head attention proposed by Vaswani
等人. (2017). The results of the study indicate that hierarchical RNNs generally perform
better than sequential. Nadeem and Ostendorf (2018) also show that neural networks
can be a good alternative to more traditional feature-based models for readability pre-
diction on texts shorter than 100 字, but do not perform that competitively on longer
文本.

Another version of a hierarchical RNN with the attention mechanism was proposed
by Azpiazu and Pera (2019). Their system, named Vec2Read, is a multi-attentive RNN
capable of leveraging hierarchical text structures with the help of word and sentence
level attention mechanisms and a custom-built aggregation mechanism. They used
the network in a multilingual setting (on corpora containing Basque, Catalan, Dutch,
英语, 法语, Italian, and Spanish texts). Their conclusion was that although the
number of instances used for training has a strong effect on the overall performance of
系统, no language-speciﬁc patterns emerged that would indicate that prediction
of readability in some languages is harder than in others.

An even more recent neural approach for readability classiﬁcation on the cleaned
WeeBit corpus (Xia, Kochmar, and Briscoe 2016) was proposed by Filighera, Steuer, 和
Rensing (2019), who tested a set of different embedding models, word2vec (米科洛夫
等人. 2013), the uncased Common Crawl GloVe (Pennington, Socher, and Manning
2014), ELMo (Peters et al. 2018), and BERT (Devlin et al. 2019). The embeddings were
fed to either a recurrent or a convolutional neural network. The BERT-based approach
from their work is somewhat similar to the BERT-based supervised classiﬁcation ap-
proach proposed in this work. 然而, one main distinction is that no ﬁne-tuning is
conducted on the BERT model in their experiments (IE。, the extraction of embeddings
is conducted on the pretrained BERT language model). Their best ELMo-based model
with a bidirectional LSTM achieved an accuracy of 79.2% on the development set,
slightly lower than the accuracy of 80.3% achieved by Xia, Kochmar, and Briscoe (2016)
in the 5-fold crossvalidation scenario. 然而, they did manage to improve on the
state of the art by an ensemble of all their models, achieving the accuracy of 81.3%,
and the macro averaged F1-score of 80.6%.

A somewhat different neural approach to readability classiﬁcation was proposed
by Mohammadi and Khasteh (2019), who tackled the problem with deep reinforcement
学习, or more speciﬁcally, with a deep convolutional recurrent double dueling Q
网络 (Wang et al. 2016) using a limited window of 5 adjacent words. GloVe embed-
dings and statistical language models were used to represent the input text in order to
eliminate the need for sophisticated NLP features. The model was used in a multilingual
环境 (on English and Persian data sets) and achieved performance comparable to the
state of the art on all of the data sets, among them also on the Weebit corpus (准确性
的 91%).

最后, a recent study by Deutsch, Jasbi, and Shieber (2020) used predictions of
HAN and BERT models as additional features in their SVM model that also utilized a
set of syntactic and lexico-semantic features. Although they did manage to improve the
performance of their SVM classiﬁers with the additional neural features, they concluded

149

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 47, 数字 1

that additional syntactic and lexico-semantic features did not generally improve the
predictions of the neural models.

3. 数据集

在这个部分, we ﬁrst present the data sets used in the experiments (部分 3.1) 和
then conduct their preliminary analysis (部分 3.2) in order to assess the feasibility of
the proposed experiments. Data set statistics are presented in Table 1.

3.1 Data Set Presentation

All experiments are conducted on four corpora labeled with readability scores:

•

The WeeBit corpus: The articles from WeeklyReader3 and BBC-Bitesize4
are classiﬁed into ﬁve classes according to the age group they are
瞄准. The classes correspond to age groups 7–8, 8–9, 9–10, 10–14, 和
14–16 years. Three classes targeting younger audiences consist of articles
from WeeklyReader, an educational newspaper that covers a wide range of
nonﬁction topics, from science to current affairs. Two classes targeting
older audiences consist of material from the BBC-Bitesize Web site,
containing educational material categorized into topics that roughly match
school subjects in the UK. In the original corpus of Vajjala and Meurers
(2012), the classes are balanced and the corpus contains altogether
3,125 文件, 625 per class. In our experiments, we followed
recommendations of Xia, Kochmar, and Briscoe (2016) to ﬁx broken
sentences and remove additional meta information, such as copyright
declaration and links, strongly correlated with the target labels. 我们
reextracted the corpus from the HTML ﬁles according to the procedure
described in Xia, Kochmar, and Briscoe (2016) and discarded some
documents because of the lack of content after the extraction and cleaning
过程. The ﬁnal corpus used in our experiments contains altogether
3,000 文件, 600 per class.

The OneStopEnglish corpus (Vajjala and Luˇci´c 2018) contains aligned
texts of three distinct reading levels (beginner, intermediate, 和
先进的) that were written speciﬁcally for English as Second Language
(ESL) learners. The corpus was compiled over the period 2013–2016 from
the weekly news lessons section of the language learning resources
onestopenglish.com. The section contains articles sourced from the
Guardian newspaper, which were rewritten by English teachers to target
three levels of adult ESL learners (elementary, intermediate, 和
先进的). 全面的, the document-aligned parallel corpus consists of 189
文本, each written in three versions (567 in total). The corpus is freely
available.5

3 http://www.weeklyreader.com.
4 http://www.bbc.co.uk/bitesize.
5 https://zenodo.org/record/1219041.

150

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

•

The Newsela corpus (徐, Callison-Burch, and Napoles 2015): We use
the version of the corpus from 29 一月 2016 consisting of altogether
10,786 文件, out of which we only used 9,565 English documents.
The corpus contains 1,911 original English news articles and up to four
simpliﬁed versions for every original article, 那是, each original news
article has been manually rewritten up to 4 times by editors at Newsela, A
company that produces reading materials for pre-college classroom use, 在
order to target children at different grade levels and help teachers prepare
curricula that match the English language skills required at each grade
等级. The data set is a document-aligned parallel corpus of original and
simpliﬁed versions corresponding to altogether eleven different
imbalanced grade levels (from 2nd to 12th grade).

Corpus of Slovenian school books (Slovenian SB): In order to test the
transferability of the proposed approaches to other languages, a corpus of
Slovenian school books was compiled. The corpus contains 3,639,665
words in 125 school books for nine grades of primary schools and four
grades of secondary school. It was created with several aims, like studying
different quality aspects of school books, extraction of terminology, 和
linguistic analysis. The corpus contains school books for 16 distinct
subjects with very different topics ranging from literature, 音乐, 和
history to math, 生物学, and chemistry, but not in equal proportions, 和
readers being the largest type of school books included.

Whereas some texts were extracted from the Gigaﬁda reference corpus
of written Slovene (Logar et al. 2012), most of the texts were extracted from
PDF ﬁles. After the extraction, we ﬁrst conduct some light manual
cleaning on the extracted texts (removal of indices, copyright statements,
参考, ETC。). 下一个, in order to remove additional noise (tips, 方程,
ETC。), we apply a ﬁltering script that relies on manually written rules for
sentence extraction (例如, a text is a sentence if it starts with an uppercase
and ends with an end-of-sentence punctuation) to obtain only passages
containing sentences. Final extracted texts come without structural
信息 (where does a speciﬁc chapter end or start, which sentences
constitute a paragraph, where are questions, ETC。), since labeling the
document structure would require a large amount of manual effort;
therefore we did not attempt it for this research.

For supervised classiﬁcation experiments, we split the school books
into chunks 25 sentences long, in order to build a train and test set with a
sufﬁcient number of documents.6 The length of 25 sentences was chosen
due to size limitations of the BERT classiﬁer, which can be fed documents
that contain up to 512 byte-pair tokens (Kudo and Richardson 2018),7
which on average translates to slightly less than 25 句子.

6 Note that this chunking procedure might break the text cohesion and that topical similarities between
chunks from the same chapter (or paragraphs) might have a positive effect on the performance of the
分类. 然而, because the corpus does not contain any high-level structural information
(例如, the information about paragraph or chapter structure of a speciﬁc school book), no other more
reﬁned chunking method is possible.

7 Note that the BERT tokenizer uses byte-pair tokenization (Kudo and Richardson 2018), which in some
cases generates tokens that correspond to sub-parts of words rather than entire words. In the case of
Slovenian SB, 512 byte-pair tokens correspond to 306 word tokens on average.

151

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 47, 数字 1

桌子 1
Readability classes, 文件数量, tokens per speciﬁc readability class, and average
tokens per document in each readability corpus.

Readability class

#文件

#代币

#tokens per doc.

维基百科

130,000
130,000
130,000

10,933,710
10,847,108
10,719,878

OneStopEnglish

189
189
189
567

WeeBit

600
600
600
600
600
3,000

Newsela

100,800
127,934
155,253
383,987

77,613
100,491
159,719
89,548
152,402
579,773

224
500
1,569
1,342
1,058
1,210
1,037
750
20
2
1,853
9,565

74,428
197,992
923,828
912,411
802,057
979,471
890,358
637,784
19,012
1,130
1,833,781
7,272,252

KRES-balanced

/
Slovenian SB

2,402,263

69
146
268
1,007
1,186
959
1,470
1,844
2,154
1,663
590
529
45
11,930

12,921
30,296
62,241
265,242
330,039
279,461
462,551
540,944
688,149
578,694
206,147
165,845
14,313
3,636,843

84.11
83.44
82.46

533.33
676.90
820.49
677.23

129.35
167.49
266.20
149.25
254.00
193.26

332.27
395.98
588.80
679.89
758.09
809.48
858.59
850.38
950.60
565.00
989.63
760.30

187.26
207.51
232.24
263.40
278.28
291.41
314.66
293.35
319.47
347.98
349.40
313.51
318.07
304.85

简单的
balanced
普通的

beginner
intermediate
先进的
全部

age 7–8
age 8–9
age 9–10
age 10–14
age 14–16
全部

2nd grade
3rd grade
4th grade
5th grade
6th grade
7th grade
8th grade
9th grade
10th grade
11th grade
12th
全部

balanced

1st-ps
2nd-ps
3rd-ps
4th-ps
5th-ps
6th-ps
7th-ps
8th-ps
9th-ps
1st-hs
2nd-hs
3rd-hs
4th-hs
全部

152

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

Language models are trained on large corpora of texts. For this purpose, we used

the following corpora.

•

Corpus of English Wikipedia and Corpus of Simple Wikipedia articles:
We created three corpora for the use in our unsupervised English
实验:8

– Wiki-normal contains 130,000 randomly selected articles from the

Wikipedia dump, which comprise 489,976 sentences and 10,719,878
代币.

– Wiki-simple contains 130,000 randomly selected articles from the

Simple Wikipedia dump, which comprise 654,593 sentences and
10,933,710 代币.

– Wiki-balanced contains 65,000 randomly selected articles from the
Wikipedia dump (dated 26 一月 2018) 和 65,000 randomly
selected articles from the Simple Wikipedia dump. Altogether the
corpus comprises 571,964 sentences and 10,847,108 代币.

•

KRES-balanced: The KRES corpus (Logar et al. 2012) 是一个 100 million word
balanced reference corpus of Slovenian language: 35% of its content is
图书, 40% periodicals, 和 20% Internet texts. From this corpus we took
all the available documents from two children’s magazines (Ciciban and
Cicido), all documents from four teenager magazines (Cool, Frka, PIL
加, and Smrklja), and documents from three magazines targeting adult
audiences ( ˇZivljenje in tehnika, Radar, City magazine). With these texts,
we built a corpus with approximately 2.4 万字. The corpus is
balanced in a sense that about one-third of the sentences come from
documents targeting children, one-third is targeting teenagers, 和最后一个
third is targeting adults.

3.2 Data Set Analysis

全面的, there are several differences between our data sets:

•

语言: As already mentioned, we have three English (Newsela,
OneStopEnglish and WeeBit), and one Slovenian (Slovenian SB) test data
放.

Parallel corpora vs. unaligned corpora: Newsela and OneStopEnglish
data sets are parallel corpora, which means that articles from different
readability classes are semantically similar to each other. On the other
手, WeeBit and Slovenian SB data sets contain completely different
articles in each readability class. Although this might not affect traditional
readability measures, which do not take semantic information into
帐户, it might prove substantial for the performance of classiﬁers and
the proposed language model-based readability measures.

8 English Wikipedia and Simple Wikipedia dumps from 26 一月 2018 were used for the corpus

建造.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

153

计算语言学

体积 47, 数字 1

•

Length of documents: Another difference between Newsela and
OneStopEnglish data sets on one side, and WeeBit and Slovenian SB data
set on the other, is the length of data set documents. Newsela and
OneStopEnglish data sets contain longer documents, on average about 760
和 677 words long, and documents in the WeeBit and Slovenian SB
corpora are on average about 193 和 305 words long, 分别.

类型: OneStopEnglish and Newsela data sets contain news articles,
WeeBit is made of educational articles, and the Slovenian SB data set is
composed of school books. For training of the English language models,
we use Wikipedia and Simple Wikipedia, which contain encyclopedia
文章, and for Slovene language model training, 我们使用
KRES-balanced corpus, which contains magazine articles.

Target audience: OneStopEnglish is the only test data set that speciﬁcally
targets adult ESL learners and not children, as do other test data sets.
When it comes to data sets used for language model training,
KRES-balanced corpus is made of articles that target both adults and
孩子们. The problem with Wikipedia and Simple Wikipedia is that no
speciﬁc target audience is addressed because articles are written by
volunteers. 实际上, using Simple Wikipedia as a data set for the training of
simpliﬁcation algorithms has been criticized in the past because of its lack
of speciﬁc simpliﬁcation guidelines, which are based only on the
declarative statement that Simple Wikipedia was created for “children and
adults who are learning the English language” (徐, Callison-Burch, 和
Napoles 2015). This lack of guidelines also contributes to the decrease in
the quality of simpliﬁcation according to Xu, Callison-Burch, and Napoles
(2015), who found that the corpus can be noisy and that half of its
sentences are not actual simpliﬁcations but rather copied from the original
维基百科.

This diversity of the data sets limits ambitions of the study to offer general con-
clusions true across genres, 语言, or data sets. 另一方面, it offers an op-
portunity to determine how the speciﬁcs of each data set affect each of the proposed
readability predictors and also to determine the overall robustness of the applied
方法.

Although many aspects differ from one data set to another, there are also some
common characteristics across all the data sets, which allow using the same prediction
methods on all of them. These are mostly connected to the common techniques used
in the construction of the readability data sets, no matter the language, 类型, or target
audience of the speciﬁc data set. The creation of parallel simpliﬁcation corpora (IE。,
Newsela, OneStopEnglish, and Simple Wikipedia) generally involves three techniques,
splitting (breaking a long sentence into shorter ones), deletion (removing unimportant
parts of a sentence), and paraphrasing (rewriting a text into a simpler version via
reordering, substitution, and occasionally expansion) (冯 2008). Even though there
might be some subtleties involved (because what constitutes simpliﬁcation for one
type of user may not be appropriate for another), how these techniques are applied is
rather general. 还, although there is no simpliﬁcation used in the non-parallel corpora
(WeeBit, Slovenian SB), the contributing authors were nevertheless instructed to write
the text for a speciﬁc target group and adapt the writing style accordingly. In most

154

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

案例, this leads to the same result (例如, 更短, less complex sentences and simpler
vocabulary used in texts intended for younger or less ﬂuently speaking audiences).

The claim of commonality between data sets can be backed up by the fact that even
traditional readability indicators correlate quite well with human assigned readability,
no matter the speciﬁc genre, 语言, or purpose of each data set. Results in Table 2
demonstrate this point by showcasing readability scores of traditional readability for-
mulas from Section 2.1.1. We can see that the general pattern of increased difﬁculty
on all data sets and for all indicators—larger readability scores (or in the case of FRE,
较小) are assigned to those classes of the data set that contain texts written for older
children or more advanced ESL learners. This suggests that multi-data set, multi-genre,
and even multilingual readability prediction is feasible on the set of chosen data sets,
even if only the shallow traditional readability indicators are used.

然而, the results do indicate that cross-genre or even cross-data set readability
prediction might be problematic because the data sets do not cover the same readability
range according to the shallow prediction formulas (and also ground truth readabil-
ity labels). 例如, documents in the WeeBit 14–16 age group have scores very
similar to the Newsela 6th grade documents, which means that a classiﬁer trained on
the WeeBit corpus might have a hard time classifying documents belonging to higher
Newsela grades since the readability of these documents is lower than for the most
complex documents in the WeeBit corpus according to all of the shallow readability
指标. 为此原因, we opted not to perform any supervised cross-data set or
cross-genre experiments. 尽管如此, the problem of cross-genre prediction is im-
portant in the context of the proposed unsupervised experiments, because the genre
discrepancy between the data sets used for training the language models and the data
sets on which the models are used might inﬂuence the performance of the proposed lan-
guage model-based measures. A more detailed discussion on this topic is presented in
部分 4.2.

The analysis in Table 2 also conﬁrms the ﬁndings by Madrazo Azpiazu and Pera
(2020), who have shown that crosslingual readability prediction with shallow read-
ability indicators is problematic. 例如, if we compare the Newsela corpus and
Slovenian SB corpus, which both cover roughly the same age group, we can see that
for some readability indicators (FRE, FKGL, DCRF, and ASL) the values are on entirely
different scales.

4. Unsupervised Neural Approach

在这个部分, we explore how neural language models can be used for determining
the readability of the text in an unsupervised way. 在部分 4.1, we present the neural
architectures used in our experiments; in Section 4.2, we describe the methodology of
the proposed approach; and in Section 4.3, we present the conducted experiments.

4.1 Neural Language Model Architectures

米科洛夫等人. (2011) have shown that neural language models outperform n-gram
language models by a high margin on large and also relatively small (少于 1 百万
代币) data sets. The achieved differences in perplexity (see Equation (1)) are attributed
to a richer historical contextual information available to neural networks, which are not
limited to a small contextual window (usually of up to 5 previous words) as is the case
of n-gram language models. 在部分 2.1.5, we mentioned some approaches that use

155

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 47, 数字 1

桌子 2
Scores of traditional readability indicators from Section 2.1.1 for speciﬁc classes in the readability
data sets.

班级

GFI

FRE

FKGL

ARI

DCRF

SMOG ASL

维基百科

简单的
balanced
普通的

11.80
13.49
15.53

62.20
56.17
49.16

8.27
9.70
11.47

14.08
15.86
18.06

WeeBit

age 7–8
age 8–9
age 9–10
age 10–14
age 14–16

6.91
8.45
10.30
9.94
11.76

83.41
76.68
69.88
75.35
66.61

3.82
5.34
6.93
6.34
8.09

8.83
10.33
12.29
11.20
13.56
OneStopEnglish

beginner
intermediate
先进的

11.79
13.83
15.35

66.69
59.68
54.84

8.48
10.19
11.54
Newsela

13.93
15.98
17.65

2nd grade
3rd grade
4th grade
5th grade
6th grade
7th grade
8th grade
9th grade
10th grade
11th grade
12th grade

6.11
7.24
8.58
9.79
11.00
12.11
13.05
14.20
14.15
15.70
14.52

85.69
80.92
76.05
71.76
67.46
62.71
60.37
55.00
55.70
56.41
55.58

3.27
4.27
5.40
6.47
7.53
8.54
9.38
10.46
10.60
11.05
10.71

8.09
9.30
10.50
11.73
12.99
14.12
15.19
16.37
16.50
16.96
16.70

KRES-balanced

11.40
12.53
13.89

7.83
8.87
10.01
9.67
10.81

11.05
12.30
13.22

7.26
7.94
8.88
9.68
10.47
11.26
11.83
12.70
12.83
12.77
12.79

11.40
12.53
13.89

7.83
8.87
10.01
9.67
10.81

11.05
12.30
13.22

7.26
7.94
8.88
9.68
10.47
11.26
11.83
12.70
12.83
12.77
12.79

16.90
19.54
23.10

10.23
12.89
15.69
16.64
18.86

20.74
23.98
26.90

9.26
10.72
12.72
14.81
16.92
18.46
20.81
22.17
23.33
24.75
23.69

balanced

12.72

29.20

12.43

14.88

14.08

15.81

Slovenian SB

9.54
9.49
10.02
10.96
11.49
13.20
12.94
13.48
13.69
15.12
15.13
14.76
14.66

31.70
34.90
32.89
30.29
28.13
20.10
22.97
18.12
19.26
12.66
15.13
13.09
14.39

10.38
10.11
10.61
11.18
11.62
12.84
12.61
13.09
13.13
14.33
13.90
14.00
13.64

11.72
11.34
11.78
12.84
13.33
14.57
14.52
14.78
15.07
16.22
15.83
15.62
15.54

11.12
11.26
11.80
12.39
12.79
13.61
13.64
13.71
13.94
14.96
14.67
14.44
14.03

7.63
8.37
9.31
10.40
11.02
11.45
12.24
11.32
12.27
13.62
13.49
12.57
11.62

1st-ps
2nd-ps
3rd-ps
4th-ps
5th-ps
6th-ps
7th-ps
8th-ps
9th-ps
1st-hs
2nd-hs
3rd-hs
4th-hs

156

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

n-gram language models for readability prediction. 然而, we are unaware of any
approach that would use deep neural network language models for determining the
readability of a text.

In this research, we utilize three neural architectures for language modeling. 第一的
are RNNs, which are suitable for modeling sequential data. At each time step t, 这
input vector xt, and hidden state vector ht−1 are fed into the network, producing the
next hidden vector state ht with the following recursive equation:

ht = f (Wxt + Uht−1 + 乙)

where f is a nonlinear activation function, W and U are matrices representing weights
of the input layer and hidden layer, and b is the bias vector. Learning long-range input
dependencies with plain RNNs is problematic because of vanishing gradients (本吉奥,
Simard, and Frasconi 1994); 所以, 在实践中, modiﬁed recurrent networks, 例如
Long Short-Term Memory networks (LSTMs) 被使用. In our experiments, 我们使用
LSTM-based language model proposed by Kim et al. (2016). This architecture is adapted
to language modeling of morphologically rich languages, such as Slovenian, by utilizing
an additional character-level convolutional neural network (CNN). The convolutional
level learns a character structure of words and is connected to the LSTM-based layer,
which produces predictions at the word level.

Bai, Kolter, and Koltun (2018) introduced a new sequence modeling architecture
based on convolution, called temporal convolutional network (TCN), which is also used
in our experiments. TCN uses causal convolution operations, which make sure that
there is no information leakage from future time steps to the past. This and the fact
that TCN takes a sequence as an input and maps it into an output sequence of the same
size makes this architecture appropriate for language modeling. TCNs are capable of
leveraging long contexts by using a very deep network architecture and a hierarchy
of dilated convolutions. A single dilated convolution operation F on element s of the
1-dimensional sequence x can be deﬁned with the following equation:

F(s) = (x ∗ d f )(s) =

k−1
(西德:88)

i=0

F (我) · xs−d·i

where f : 0, . . . k − 1 is a ﬁlter of size k, d is a dilation factor, and s − d · i accounts for the
direction of the past. 这样, the context taken into account during the prediction
can be increased by using larger ﬁlter sizes and by increasing the dilation factor. 这
most common practice is to increase the dilation factor exponentially with the depth of
网络.

最近, Devlin et al. (2019) proposed a novel approach to language modeling.
Their BERT uses both left and right context, which means that a word wt in a sequence
is not determined just from its left sequence w1:t−1 = [w1, . . . , wt−1] but also from its
right word sequence wt+1:n = [wt+1, . . . , wt+n]. This approach introduces a new learning
客观的, a masked language model, where a predeﬁned percentage of randomly chosen
words from the input word sequence is masked, and the objective is to predict these
masked words from the unmasked context. BERT uses a transformer neural network
建筑学 (Vaswani et al. 2017), which relies on the self-attention mechanism. 这
distinguishing feature of this approach is the use of several parallel attention layers, 这
so-called attention heads, which reduce the computational cost and allow the system to
attend to several dependencies at once.

157

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 47, 数字 1

All types of neural network language models, TCN, LSTM, and BERT, 输出
softmax probability distribution calculated over the entire vocabulary, and present the
probabilities for each word given its historical (and in the case of BERT also future) 和-
序列. Training of these networks usually minimizes the negative log-likelihood (NLL)
of the training corpus word sequence w1:n = [w1, . . . , wn] by backpropagation through
时间:

NLL = −

n
(西德:88)

我=1

log P(wi|w1:i−1)

(2)

In the case of BERT, the formula for minimizing NLL also uses the right-hand word

顺序:

NLL = −

n
(西德:88)

我=1

log P(wi|w1:i−1, wi+1:n)

where wi are the masked words.

The following equation, which is used for measuring the perplexity of neural
language models, deﬁnes the relationship between perplexity (PPL, see Equation (1))
and NLL (方程 (2)):

4.2 Unsupervised Methodology

PPL = e( NLL
氮 )

Two main questions we wish to investigate in the unsupervised approach are the
following:

•

Can standalone neural language models be used for unsupervised
readability prediction?

Can we develop a robust new readability formula that will outperform
traditional readability formulas by relying not only on shallow lexical
sophistication indicators but also on neural language model statistics?

4.2.1 Language Models for Unsupervised Readability Assessment. The ﬁndings of the related
research suggest that a separate language model should be trained for each readability
class in order to extract features for successful readability prediction (Petersen and
Ostendorf 2009; Xia, Kochmar, and Briscoe 2016). 另一方面, we test the pos-
sibility of using a neural language model as a standalone unsupervised readability
predictor.

Two points that support this kind of usage are based on the fact that neural language
models tend to capture much more information compared to the traditional n-gram
型号. 第一的, because n-gram language models used in the previous work on readabil-
ity detection were in most cases limited to a small contextual window of up to ﬁve
字, their learning potential was limited to lexico-semantic information (例如, 在-
formation about the difﬁculty of vocabulary and word n-gram structures in the text),
and information about the text syntax. We argue that due to much larger contextual
information of the neural models (例如, BERT leverages sequences of up to 512 byte-
pair tokens), which spans across sentences, the neural language models also learn
high-level textual properties, such as long-distance dependencies (Jawahar, Sagot, 和
Seddah 2019), in order to minimize NLL during training. 第二, n-gram models in

158

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

past readability research have only been trained on the corpora (或者, more speciﬁcally, 在
parts of the corpora) on which they were later used. 相比之下, by training the neural
models on large general corpora, the model also learns semantic information, 哪个
can be transferred when the model is used on a smaller test corpus. The success of this
knowledge transfer is, to some extent, dependent on the genre compatibility of the train
and test corpora.

A third point favoring greater ﬂexibility of neural language models relies on the fact
that no corpus is a monolithic block of text made out of units (IE。, 句子, 段落,
and articles) of exactly the same readability level. This means that a language model
trained on a large corpus will be exposed to chunks of text with different levels of
复杂. We hypothesize that, due to this fact, the model will to some extent be able
to distinguish between these levels and return a lower perplexity for more standard,
predictable (IE。, readable) 文本. Vice versa, complex and rare language structures and vo-
cabulary of less readable texts would negatively affect the performance of the language
模型, expressed via larger perplexity score. If this hypothesis is correct, then ideally,
the average readability of the training corpus should ﬁt somewhere in the middle of the
readability spectrum of the testing corpus.

To test these statements, we train language models on Wiki-normal, Wiki-simple,
and Wiki-balanced corpora described in Section 3. All three Wiki corpora contain
roughly the same amount of text, in order to make sure that the training set size does
not inﬂuence the results of the experiments. We expect the following results:

•

Hypothesis 1: Training the language models on a corpus with a readability
that ﬁts somewhere in the middle of the readability spectrum of the testing
corpus will yield the best correlation between the language model’s
performance and readability. According to the preliminary analysis of
our corpora conducted in Section 3.2 and results of the analysis in Table 2,
this ideal scenario can be achieved in three cases: (我) if a language model
trained on the Wiki-simple is used on the Newsela corpora, (二) if a
language model trained on the Wiki-balanced corpus is used on
the OneStopEnglish corpus, 和 (三、) if the model trained on the
KRES-balanced corpus is used on the Slovenian SB corpus, 尽管
mismatch of genres in these corpora.

Hypothesis 2: The language models trained only on texts for adults
(Wiki-normal) will show higher perplexity on texts for children (WeeBit
and Newsela) because their training set did not contain such texts;
this will negatively affect the correlation between the language model’s
performance and readability.

Hypothesis 3: Training the language models only on texts for children
(Wiki-simple corpus) will result in a higher perplexity score of the
language model when applied to adult texts (OneStopEnglish). 这
will positively affect the correlation between the language model’s
performance and readability. 然而, this language model will not be
able to reliably distinguish between texts for different levels of adult ESL
learners, which will have a negative effect on the correlation.

To further test the viability of the unsupervised language models as readability pre-
dictors and to test the limits of using a single language model, we also explore the pos-
sibility of using a language model trained on a large general corpus. The English BERT

159

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 47, 数字 1

language model was trained on large corpora (Google Books Corpus [Goldberg and
Orwant 2013] 和维基百科) of about 3,300M words containing mostly texts for adult
English speakers. According to hypothesis 2, this will have a negative effect on the
correlation between the performance of the model and readability.

Because of the large size of the BERT model and its huge training corpus, the se-
mantic information acquired during training is much larger than the information ac-
quired by the models we train on our much smaller corpora, which means that there is
a greater possibility that the BERT model was trained on some text semantically similar
to the content in the test corpora and that this information can be successfully trans-
ferred. 然而, the question remains, exactly what type of semantic content does the
BERT’s training corpus contain? One hypothesis is that its training corpus contains more
content speciﬁc for adult audiences and less content found in the corpora for children.
This would have a negative effect on the correlation between the performance of the
model and readability on the WeeBit corpus. Contrarily, because the two highest read-
ability classes in the WeeBit corpus contain articles from different scientiﬁc ﬁelds used
for the education of high school students, which can contain rather speciﬁc and tech-
nical content that is unlikely to be common in the general training corpus, this might
inﬂuence a positive correlation between the performance of the model and readability.
Newsela and OneStopEnglish, 另一方面, are parallel corpora, 意思是
that the semantic content in all classes is very similar; therefore the success or failure of
semantic transfer will most likely not affect these two corpora.

4.2.2 Ranked Sentence Readability Score. Based on the two considerations below, we pro-
pose a new Ranked Sentence Readability Score (RSRS) for measuring the readability
with language models.

•

The shallow lexical sophistication indicators, such as the length of a
句子, correlate well with the readability of a text. Using them besides
statistics derived from language models could improve the unsupervised
readability prediction.

The perplexity score used for measuring the performance of a language
model is an unweighted sum of perplexities of words in the predicted
顺序. 事实上, a small number of unreadable words might drastically
reduce the readability of the entire text. Assigning larger weights to such
words might improve the correlation of language model scores with the
readability.

The proposed readability score is calculated with the following procedure. 第一的, A
given text is split into sentences with the default sentence tokenizer from the NLTK
library (Bird and Loper 2004). In order to obtain a readability estimation for each word
in a speciﬁc context, we compute, for each word in the sentence, the word negative
log-likelihood (WNLL) according to the following formula:

WNLL = −(yt log yp + (1 − yt) 日志 (1 − yp))

where yp denotes the probability (from the softmax distribution) predicted by the lan-
guage model according to the historical sequence, and yt denotes the empirical distri-
bution for a speciﬁc position in the sentence, 那是, yt has the value 1 for the word in
the vocabulary that actually appears next in the sequence and the value 0 for all the
other words in the vocabulary. 下一个, we sort all the words in the sentence in ascending

160

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

order according to their WNLL score, and the ranked sentence readability score (RSRS)
is calculated with the following expression:

(西德:80)S

i = 1

RSRS =

√

i · WNLL(我)
S

(3)

where S denotes the sentence length and i represents the rank of a word in a sentence ac-
cording to its WNLL value. The square root of the word rank is used for proportionally
weighting words according to their readability because initial experiments suggested
that the use of a square root of a rank represents the best balance between allowing all
words to contribute equally to the overall readability of the sentence and allowing only
the least readable words to affect the overall readability of the sentence. For out-of-
vocabulary words, square root rank weights are doubled, because these rare words are,
in our opinion, good indicators of non-standard text. 最后, in order to obtain the
readability score for the entire text, we calculate the average of all the RSRS scores in the
文本. An example of how RSRS is calculated for a speciﬁc sentence is shown in Figure 1.
The main idea behind the RSRS score is to avoid the reductionism of traditional
readability formulas. We aim to achieve this by including high-level structural and
semantic information through neural language model–based statistics. The ﬁrst as-
sumption is that complex grammatical and lexical structures harm the performance of
the language model. Since WNLL score, which we compute for each word, depends
on the context in which the word appears in, words appearing in more complex
grammatical and lexical contexts will have a higher WNLL. The second assumption
is that the semantic information is included in the readability calculation: Tested docu-
ments with semantics dissimilar to the documents in the language model training set
will negatively affect the performance of the language model, resulting in the higher
WNLL score for words with unknown semantics. The trainable nature of language
models allows for customization and personalization of the RSRS for speciﬁc tasks,

数字 1
The RSRS calculation for the sentence This could make social interactions easier for them.

161

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

WNLL calculationThis could make social interactions easier for them .Sort WNLL scores1.24e-04 1.52e-04 1.09e-04 2.10e-04 1.76e-04 2.40e-04 8.25e-05 8.75e-05 1.19e-04Calculate RSRS[8.25e-05, 8.75e-05, 1.09e-04, 1.19e-04, 1.24e-04, 1.52e-04, 1.76e-04, 2.10e-04, 2.40e-04]RSRS(1×8.25e-05 + 2×8.75e-05 + 3×1.09e-04 + 4×1.19e-04 + 5×1.24e-04 + 6×1.52e-04 +7×1.76e-04 + 8×2.10e-04 + 9×2.40e-04)/90.0034

计算语言学

体积 47, 数字 1

主题, and languages. This means that RSRS will alleviate the problem of cultural non-
transferability of traditional readability formulas.

另一方面, the RSRS also leverages shallow lexical sophistication indicators
through the index weighting scheme, which ensures that less readable words contribute
more to the overall readability score. This is somewhat similar to the counts of long and
difﬁcult words in the traditional readability formulas, such as GFI and DCRF. The value
of RSRS also increases for texts containing longer sentences, since the square roots of the
word rank weights become larger with increased sentence length. This is similar to the
behavior of traditional formulas such as GFI, FRE, FKGL, ARI, and DCRF, where this
effect is achieved by incorporating the ratio between the total number of words and the
total number of sentences into the equation.

4.3 Unsupervised Experiments

For the presented unsupervised readability assessment methodology based on neural
language models, we ﬁrst present the experimental design followed by the results.

4.3.1 Experimental Design. Three different architectures of language models (描述的
in Section 4.1) are used for experiments: a temporal convolutional network (TCN)
proposed by Bai, Kolter, and Koltun (2018), a recurrent language model (RLM) 使用
character-level CNN and LSTM proposed by Kim et al. (2016), and an attention-based
language model, BERT (Devlin et al. 2019). For the experiments on the English language,
we train TCN and RLM on three Wiki corpora.

To explore the possibility of using a language model trained on a general corpus
for the unsupervised readability prediction, we use the BERT-base-uncased English lan-
规格型号, a pretrained uncased language model trained on BooksCorpus (0.8G
字) (Zhu et al. 2015) and English Wikipedia (2.5G words). For the experiments on
Slovenian, the corpus containing just school books is too small for efﬁcient training of
language models; therefore TCN and RLM were only trained on the KRES-balanced
corpus described in Section 3. For exploring the possibility of using a general language
model for the unsupervised readability prediction, a pretrained CroSloEngual BERT
model trained on corpora from three languages, Slovenian (1.26G words), Croatian
(1.95G words), and English (2.69G words) (Ulˇcar and Robnik-ˇSikonja 2020), is used.
The corpora used in training the model are a mix of news articles and a general Web
crawl.

The performance of language models is typically measured with the perplexity
(see Equation (1)). To answer the research question of whether standalone language
models can be used for unsupervised readability prediction, we investigate how the
measured perplexity of language models correlates with the readability labels in the
gold-standard WeeBit, OneStopEnglish, Newsela, and Slovenian SB corpora described
in Section 3. The correlation to these ground truth readability labels is also used to eval-
uate the performance of the RSRS measure. For performance comparison, we calculate
the traditional readability formula values (节中描述 2) for each document
in the gold-standard corpora and measure the correlation between these values and
manually assigned labels. As a baseline, we use the average sentence length (ASL) 在
each document.

The correlation is measured with the Pearson correlation coefﬁcient (ρ). Given a pair
of distributions X and Y, the covariance cov, and the standard deviation σ, the formula
for ρ is:

ρx,y =

cov(X, y)
σxσy

162

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

A larger positive correlation signiﬁes a better performance for all measures except
the FRE readability measure. As this formula assigns higher scores to better-readable
文本, a larger negative correlation suggests a better performance of the FRE measure.

4.3.2 Experimental Results. The results of the experiments are presented in Table 3. 这
ranking of measures on English and Slovenian data sets are presented in Table 4.

The correlation coefﬁcients of all measures vary drastically between different cor-
pora. The highest ρ values are obtained on the Newsela corpus, where the best perform-
ing measure (surprisingly this is our baseline—the average sentence length) achieves
the ρ of 0.906. The highest ρ on the other two English corpora are much lower. 在
the WeeBit corpus, the best performance is achieved by GFI and FKGL measures (ρ of
0.544), and on the OneStopEnglish corpus, the best performance is achieved with the
proposed TCN RSRS-simple (ρ of 0.615). On the Slovenian SB, the ρ values are higher,
and the best performing measure is TCN RSRS score-balanced with ρ of 0.789.

The perplexity-based measures show a much lower correlation with the ground
truth readability scores. 全面的, they perform the worst of all the measures for both
语言 (见表 4), but we can observe large differences in their performance
across different corpora. Although there is either no correlation or low negative corre-
lation between perplexities of all three language models and readability on the WeeBit
语料库, there is some correlation between perplexities achieved by RLM and TCN on
OneStopEnglish and Newsela corpora (the highest being the ρ of 0.566 achieved by
TCN perplexity-simple on the Newsela corpus). The correlation between RLM and

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

桌子 3
Pearson correlation coefﬁcients between manually assigned readability labels and the readability
scores assigned by different readability measures in the unsupervised setting. The highest
correlation for each corpus is marked with bold typeface.

Measure/Data set

WeeBit OneStopEnglish Newsela

Slovenian SB

RLM perplexity-balanced −0.082
−0.115
RLM perplexity-simple
−0.127
RLM perplexity-normal
0.034
TCN perplexity-balanced
0.025
TCN perplexity-simple
−0.015
TCN perplexity-normal
−0.123
BERT perplexity

RLM RSRS-balanced
RLM RSRS-simple
RLM RSRS-normal
TCN RSRS-balanced
TCN RSRS-simple
TCN RSRS-normal
BERT RSRS

GFI
FRE
FKGL
ARI
DCRF
SMOG
ASL

0.497
0.506
0.490
0.393
0.385
0.348
0.279

0.544
−0.433
0.544
0.488
0.420
0.456
0.508

0.405
0.420
0.283
0.476
0.518
0.303
−0.162

0.551
0.569
0.536
0.601
0.615
0.582
0.384

0.550
−0.485
0.533
0.520
0.496
0.498
0.498

0.512
0.470
0.341
0.537
0.566
0.250
−0.673

0.890
0.893
0.886
0.894
0.894
0.886
0.674

0.849
−0.775
0.865
0.875
0.735
0.813
0.906

0.303
/
/
0.173
/
/
−0.563

0.732
/
/
0.789
/
/
0.126

0.730
−0.614
0.697
0.658
0.686
0.770
0.683

163

计算语言学

体积 47, 数字 1

桌子 4
Ranking (lower is better) of measures on English and Slovenian data sets sorted by the average
rank on all data sets for which the measure is available.

Measure

WeeBit OneStopEnglish Newsela

Slovenian SB

RLM RSRS-simple
TCN RSRS-balanced
RLM RSRS-balanced
GFI
TCN RSRS-simple
ASL
FKGL
RLM RSRS-normal
TCN RSRS-normal
ARI
SMOG
DCRF
FRE
TCN perplexity-simple
TCN perplexity-balanced
BERT RSRS
RLM perplexity-balanced
RLM perplexity-simple
TCN perplexity-normal
BERT perplexity
RLM perplexity-normal

4
11
5
1
12
3
2
6
13
7
8
10
9
16
15
14
18
19
17
20
21

4
2
5
6
1
12
8
7
3
9
11
13
14
10
15
18
17
16
19
21
20

4
2
5
10
3
1
9
6
7
8
11
13
12
15
16
14
17
18
20
21
19

/
1
3
4
/
7
5
/
/
8
2
6
9
/
11
12
10
/
/
13
/

TCN perplexity measures and readability classes on the Slovenian SB corpus is low,
with RLM perplexity-balanced showing the ρ of 0.303 and TCN perplexity-balanced
achieving ρ of 0.173.

BERT perplexities are negatively correlated with readability, and the negative corre-
lation is relatively strong on Newsela and Slovenian school books corpora (ρ of −0.673
and −0.563, 分别), and weak on WeeBit and OneStopEnglish corpora. As BERT
was trained on corpora that are mostly aimed at adults, the strong negative correlation
on Newsela and Slovenian SB corpora seem to suggest that BERT language models
might actually be less perplexed by the articles aimed at adults than the documents
aimed at younger audiences. This is supported by the fact that the negative correlation
is weaker on the OneStopEnglish corpus, which is meant for adult audiences, 并为
which our analysis (参见章节 3.2) has shown that it contains more complex texts
according to the shallow readability indicators.

尽管如此, the weak negative correlation on the WeeBit corpus is difﬁcult to
explain as one would expect a stronger negative correlation because the same analysis
showed that WeeBit contains the least complex texts out of all the tested corpora.
If this result is connected with the successful transfer of the semantic knowledge, 它
supports the hypothesis that the two classes containing the most complex texts in the
WeeBit corpus contain articles with rather technical content that perplex the BERT
模型. 然而, the role of the semantic transfer should also dampen the negative
correlation on the Slovenian SB, which is a non-parallel corpus and also contains rather
technical educational content meant for high-school children. Perhaps the transfer is
less successful for Slovenian since the Slovenian corpus on which the CroSloEngual
BERT was trained is smaller than the English corpora used for training of English BERT.

164

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

Although further experiments and data are needed to pinpoint the exact causes for the
discrepancies in the results, we can still conclude that using a single language model
trained on general corpora for unsupervised readability prediction of texts for younger
audiences or English learners is, at least according to our results, not a viable option.

Regarding our expectations that performance of the language model trained on a
corpus with average readability that ﬁts somewhere in the middle of the readability
spectrum of the testing corpus would yield the best correlation with manually labeled
readability scores, it is interesting to look at the differences in performance between
TCN and RLM perplexity measures trained on Wiki-normal, Wiki-simple, and Wiki-
balanced corpora. 正如预期的那样, the correlation scores are worse on the WeeBit corpus,
since all classes in this corpus contain texts that are less complex than texts in any of the
training corpora. On the OneStopEnglish corpus, both Wiki-simple perplexity measures
perform the best, which is unexpected, since we would expect the balanced measure to
perform better. On the Newsela corpus, RLM perplexity-balanced outperforms RLM
perplexity-simple by 0.042 (which is unexpected), and TCN perplexity-simple outper-
forms TCN perplexity-balanced by 0.029, which is according to the expectations. 还,
according to the expectation is the fact, that both Wiki-normal perplexity measures are
outperformed by a large margin by Wiki-simple and Wiki-balanced perplexity measures
on the OneStopEnglish and the Newsela corpora. Similar observations can be made
with regard to RSRS, which also leverages language model statistics. On all corpora,
the performance of Wiki-simple RSRS measures and Wiki-balanced RSRS measures is
comparable, and these measures consistently outperform Wiki-normal RSRS measures.
These results are not entirely compatible with hypothesis 1 in Section 4.2.1 那
Wiki-balanced measures would be most correlated with readability on the OneStop-
English corpus and that Wiki-simple measures would be most correlated with readabil-
ity on the Newsela corpus. 尽管如此, training the language models on the corpora
with readability in the middle of the readability spectrum of the test corpus seems to
be an effective strategy, because the differences in performance between Wiki-balanced
and Wiki-simple measures are not large. 另一方面, the good performance of
the Wiki-simple measures supports our hypothesis 3 in Section 4.2.1, that training the
language models on texts with the readability closer to the bottom of the readability
spectrum of the test corpus for children will result in a higher perplexity score of the
language model when applied to adult texts, which will have a positive effect on the
correlation with readability.

The fact that positive correlation between readability and both Wiki-simple and
Wiki-balanced perplexity measures on the Newsela and OneStopEnglish corpora is
quite strong supports the hypothesis that more complex language structures and vo-
cabularies of less readable texts would result in a higher perplexity on these texts.
有趣的是, strong correlations also indicate that the genre discrepancies between the
language model train and test sets do not appear to have a strong inﬂuence on the
表现. Whereas the choice of a neural architecture for language modeling does
not appear to be that crucial, the readability of the language model training set is of
utmost importance. If the training set on average contains more complex texts than
the majority of texts in the test set, as in the case of language models trained just
on the Wiki-normal corpus (and also BERTs), the correlation between readability and
perplexity disappears or even gets reverted, since language models trained on more
complex language structures learn how to handle these difﬁculties.

The low performance of perplexity measures suggests that neural language model
statistics are not good indicators of readability and should therefore not be used
alone for readability prediction. 尽管如此, the results of TCN RSRS and RLM RSRS

165

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 47, 数字 1

suggest that language models contain quite useful information when combined with
other shallow lexical sophistication indicators, especially when readability analysis
needs to be conducted on a variety of different data sets.

As seen in Table 4, shallow readability predictors can give inconsistent results on
data sets from different genres and languages. 例如, the simplest readability
措施, the average sentence length, ranked ﬁrst on Newsela and twelfth on One-
StopEnglish. It also did not do well on the Slovenian SB corpus, where it ranked sev-
enth. SMOG, 另一方面, ranked very well on the Slovenian SB corpus (rank 2)
but ranked twice as eleventh and once as eighth on the English corpora. Among the
traditional measures, GFI presents the best balance in performance and consistency,
ranking ﬁrst on WeeBit, sixth on OneStopEnglish, tenth on Newsela, and fourth on
Slovenian SB.

另一方面, RSRS-simple and RSRS-balanced measures offer more robust
performance across data sets from different genres and languages according to ranks in
桌子 4. 例如, the RLM RSRS-simple measure ranked fourth on all English cor-
pora. The TCN RSRS-balanced measure, which was also used on Slovenian SB, ranked
ﬁrst on Slovenian SB and second on OneStopEnglish and Newsela. 然而, it did not
do well on WeeBit, where the discrepancy in readability between the language model
train and test sets was too large. RLM RSRS-balanced was more consistent, ranking ﬁfth
on all English corpora and third on Slovenian SB. These results suggest that language
model statistics can improve the consistency of predictions on a variety of different data
套. The robustness of the measure is achieved by training the language model on a
speciﬁc train set, with which one can optimize the RSRS measure for a speciﬁc task and
语言.

5. Supervised Neural Approach

As mentioned in Section 1, recent trends in text classiﬁcation show the domination of
deep learning approaches that internally use automatic feature construction. 现存的
neural approaches to readability prediction (参见章节 2.3) tend to generalize better
across data sets and genres (Filighera, Steuer, and Rensing 2019), and therefore solve
the problem of classical machine learning approaches relying on an extensive feature
工程 (Xia, Kochmar, and Briscoe 2016).

在这个部分, we analyze how different types of neural classiﬁers can predict text
readability. 在部分 5.1, we describe the methodology, and in Section 5.2 we present
experimental scenarios and results of conducted experiments.

5.1 Supervised Methodology

We tested three distinct neural network approaches to text classiﬁcation:

•

Bidirectional long short-term memory network (BiLSTM). We use the
RNN approach proposed by Conneau et al. (2017) for classiﬁcation. 这
BiLSTM layer is a concatenation of forward and backward LSTM layers
that read documents in two opposite directions. The max and mean
pooling are applied to the LSTM output feature matrix in order to get the
maximum and average values of the matrix. The resulting vectors are
concatenated and fed to the ﬁnal linear layer responsible for predictions.

166

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

•

Hierarchical attention networks (HAN). We use the architecture of Yang
等人. (2016) that takes hierarchical structure of text into account with the
two-level attention mechanism (Bahdanau, 给, and Bengio 2014; 徐等.
2015) applied to word and sentence representations encoded by BiLSTMs.

Transfer learning. We use the pretrained BERT transformer architecture
和 12 layers of size 768 和 12 self-attention heads. A linear classiﬁcation
head was added on top of the pretrained language model, and the whole
classiﬁcation model was ﬁne-tuned on every data set for three epochs. 为了
English data sets, the BERT-base-uncased English language model is used,
while for the Slovenian SB corpus, we use the CroSloEngual BERT model
trained on Slovenian, Croatian, and English (Ulˇcar and Robnik-ˇSikonja
2020).9

We randomly shufﬂe all the corpora, and then Newsela and Slovenian SB corpora
are split into a train (80% 语料库的), 验证 (10% 语料库的), and test (10%
语料库的) 套. Because of the small number of documents in OneStopEnglish and
WeeBit corpora (see description in Section 3), we used ﬁve-fold stratiﬁed crossvalidation
on these corpora to get more reliable results. For every fold, the corpora were split
into the train (80% 语料库的), 验证 (10% 语料库的), and test (10% 的
语料库) 套. We employ Scikit StratiﬁedKFold,10 both for train-test splits and ﬁve-fold
crossvalidation splits, in order to preserve the percentage of samples from each class.

BiLSTM and HAN classiﬁers were trained on the train set and tested on the val-
idation set after every epoch (for a maximum of 100 纪元). The best performing
model on the validation set was selected as the ﬁnal model and produced predictions
on the test sets. BERT models are ﬁne-tuned on the train set for three epochs, 和
resulting model is tested on the test set. The validation sets were used in a grid search
to ﬁnd the best hyperparameters of the models. For BiLSTM, all combinations of the
following hyperparameter values were tested before choosing the best combination,
which is written in bold in the list below:

•

Batch size: 8, 16, 32

Learning rates: 0.00005, 0.0001, 0.0002, 0.0004, 0.0008

• Word embedding size: 100, 200, 400

•

LSTM layer size: 128, 256

Number of LSTM layers: 1, 2, 3, 4

Dropout after every LSTM layer: 0.2, 0.3, 0.4

For HAN, we tested all combinations of the following hyperparameter values (这

best combination is written in bold):

•

Batch size: 8, 16, 32

Learning rates: 0.00005, 0.0001, 0.0002, 0.0004, 0.0008

9 Both models are available through the Transformers library https://huggingface.co/transformers/.
10 https://scikit-learn.org/stable/modules/generated/sklearn.model selection.StratifiedKFold

.html.,

167

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 47, 数字 1

• Word embedding size: 100, 200, 400

•

Sentence embedding size: 100, 200, 400

For BERT ﬁne-tuning, we use the default learning rate of 0.00002. The input se-
quence length is limited to 512 byte-pair tokens, which is the maximum supported input
sequence length.

We used the same conﬁguration for all the corpora and performed no corpus
speciﬁc tweaking of classiﬁer parameters. We measured the performance of all the clas-
siﬁers in terms of accuracy (in order to compare their performance to the performance
of the classiﬁers from the related work), weighted average precision, weighted average
记起, and weighted average F1-score.11 Since readability classes are ordinal variables
(in our case ranging from 0 to n = number of classes−1), not all mistakes of classiﬁers are
equal; therefore we also utilize the Quadratic Weighted Kappa (QWK) 措施, 哪个
allows for mispredictions to be weighted differently, according to the cost of a speciﬁc
mistake. Calculation of the QWK involves three matrices containing observed scores,
ground truth scores, and the weight matrix scores, which in our case correspond to the
distance d between the classes ci and cj and is deﬁned as d = |ci − cj|. QWK is therefore
calculated as:

QWK = 1 -

(西德:80)C

我=1

(西德:80)C

j=1 wijxij
j=1 wijmij

(4)

where c is the number of readability classes and wij, xij, and mij are elements in the
重量, observed, and ground truth matrices, 分别.

5.2 Supervised Experimental Results

The results of supervised readability assessment using different architectures of deep
neural networks are presented in Table 5, together with the state-of-the-art baseline
results from the related work (Xia, Kochmar, and Briscoe 2016; Filighera, Steuer, 和
Rensing 2019; Deutsch, Jasbi, and Shieber 2020). We only present the best result reported
by each of the baseline studies; the only exception is Deutsch, Jasbi, and Shieber (2020),
for which we present two results, SVM-BF (SVM with BERT features) and SVM-HF
(SVM with HAN features) that proved the best on the WeeBit and Newsela corpora,
分别.

On the WeeBit corpus, by far the best performance according to all measures was
achieved by BERT. In terms of accuracy, BERT outperforms the second-best BiLSTM
by about 8 百分点, achieving the accuracy of 85.73%. HAN performs the
worst on the WeeBit corpus according to all measures. BERT also outperforms the
accuracy result reported by Xia, Kochmar, and Briscoe (2016), who used the ﬁve-fold
crossvalidation setting and the accuracy result on the development set reported by
Filighera, Steuer, and Rensing (2019).12 In terms of weighted F1-score, both strategies

11 We use the Scikit implementation of the metrics (https://scikit-learn.org/stable/modules/classes

.html#module-sklearn.metrics) and set the “average” parameter to “weighted.”

12 For the study by Filighera, Steuer, and Rensing (2019), we report accuracy on the development set instead
of accuracy on the test set, as the authors claim that this result is more comparable to the results achieved
in the crossvalidation setting. On the test set, Filighera, Steuer, and Rensing (2019) report the best
准确度 74.4%.

168

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

桌子 5
The results of the supervised approach to readability in terms of accuracy, weighted precision,
weighted recall, and weighted F1-score for the three neural network classiﬁers and methods
from the literature.

Measure/Data set

WeeBit OneStopEnglish Newsela

Slovenian SB

Filighera et al. (2019) 准确性
Xia et al. (2016) 准确性
SVM-BF (Deutsh et al., 2020) F1
SVM-HF (Deutsh et al., 2020) F1
Vajjala et al. (2018) 准确性

BERT accuracy
BERT precision
BERT recall
BERT F1
BERT QWK

HAN accuracy
HAN precision
HAN recall
HAN F1
HAN QWK

BiLSTM accuracy
BiLSTM precision
BiLSTM recall
BiLSTM F1
BiLSTM QWK

0.8130
0.8030
0.8381
–
–

0.8573
0.8658
0.8573
0.8581
0.9527

0.7520
0.7534
0.7520
0.7520
0.8860

0.7743
0.7802
0.7743
0.7750
0.9060

–
–
–
–
0.7813

0.6738
0.7395
0.6738
0.6772
0.7077

0.7872
0.7977
0.7872
0.7888
0.8245

0.6875
0.7177
0.6875
0.6920
0.7230

–
–
0.7627
0.8014
–

0.7573
0.7510
0.7573
0.7514
0.9789

0.8138
0.8147
0.8138
0.8101
0.9835

0.7111
0.6910
0.7111
0.6985
0.9628

–
–
–
–
–

0.4545
0.4736
0.4545
0.4157
0.8855

0.4887
0.4866
0.4887
0.4847
0.8070

0.5277
0.5239
0.5277
0.5219
0.7980

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

that use BERT (utilizing the BERT classiﬁer directly or feeding BERT features to the
SVM classiﬁer as in Deutsch, Jasbi, and Shieber [2020]) seem to return similar results.
最后, in terms of QWK, BERT achieves a very high score of 95.27% and the other two
tested classiﬁers obtain a good QWK score close to 90%.

The best result on Newsela is achieved by HAN, achieving the F1-score of 81.01%
and accuracy of 81.38%. This is similar to the baseline SVM-HF result achieved by
Deutsch, Jasbi, and Shieber (2020), who fed HAN features to the SVM classiﬁer. BERT
performs less competitively on the OneStopEnglish and Newsela corpora. On One-
StopEnglish, it is outperformed by the best performing classiﬁer (HAN) by about 10
百分点, and on Newsela, it is outperformed by about 6 百分点
according to accuracy and F1 criteria. The most likely reason for the bad performance
of BERT on these two corpora is the length of documents in these two data sets. 在
average, documents in the OneStopEnglish and Newsela corpora are 677 和 760 字
长的. 另一方面, BERT only allows input documents of up to 512 byte-pair
代币, which means that documents longer than that need to be truncated. This results
in the substantial loss of information on the OneStopEnglish and Newsela corpora but
not on the WeeBit and Slovenian SB corpora, which contain shorter documents, 193 和
305 words long.

The results show that BiLSTM also has problems when dealing with longer texts,
even though it does not require input truncation. This suggests that the loss of con-
text is not the only reason for the non-competitive performance of BERT and BiLSTM,
and that the key to the successful classiﬁcation of long documents is the leveraging of

169

计算语言学

体积 47, 数字 1

hierarchical information in the documents, for which HAN was built for. The assump-
tion is that this is particularly important in parallel corpora, where the simpliﬁed ver-
sions of the original texts contain the same message as the original texts, which forces
the classiﬁers not to rely as much on semantic differences but rather focus on structural
差异.

While F1-scores and accuracies suggest large discrepancies in performance between
HAN and two other classiﬁers on the OneStopEnglish and Newsela corpora, QWK
scores draw a different picture. Although the discrepancy is still large on OneStop-
英语, all classiﬁers achieve almost perfect QWK scores on the Newsela data set.
This suggests that even though BERT and BiLSTM make more classiﬁcation mistakes
than HAN, these mistakes are seldom costly on the Newsela corpus (IE。, documents are
classiﬁed into neighboring classes of the correct readability class). QWK scores achieved
on the Newsela corpus by all classiﬁers are also much higher than the scores achieved
on other corpora (except for the QWK score achieved by BERT on the WeeBit corpus).
This is in line with the results in the unsupervised setting, where the ρ values on the
Newsela corpus were substantially larger than on other corpora.

The HAN classiﬁer achieves the best performance on the OneStopEnglish corpus
with an accuracy of 78.72% in the ﬁve-fold crossvalidation setting. This is comparable
to the state-of-the-art accuracy of 78.13% achieved by Vajjala and Luˇci´c (2018) with their
SMO classiﬁer using 155 hand-crafted features. BiLSTM and BERT classiﬁers perform
similarly on this corpus, by about 10 percentage points worse than HAN, 根据
准确性, F1-score, and QWK.

The results on the Slovenian SB corpus are also interesting. 一般来说, the perfor-
mance of classiﬁers is the worst on this corpus, with the F1-score of 52.19% 达到了
by BiLSTM being the best result. BiLSTM performs by about 4 percentage points better
than HAN according to F1-score and accuracy, while both classiﬁers achieve roughly the
same QWK score of about 80%. 另一方面, BERT achieves lower F1-score (关于
45.45%) and accuracy (41.57%), but performs much better than the other two classiﬁers
according to QWK, achieving QWK of almost 90%.

Confusion matrices for classiﬁers give us a better insight into what kind of mistakes
are speciﬁc for different classiﬁers. For the WeeBit corpus, confusion matrices show
(数字 2) that all the tested classiﬁers have the most problems distinguishing between
texts for children 8–9 years old and 9–10 years old. The mistakes where the text is
falsely classiﬁed into an age group that is not neighboring the correct age group are rare.
例如, the best performing BERT classiﬁer misclassiﬁed only 16 documents into
non-neighboring classes. When it comes to distinguishing between neighboring classes,
the easiest distinction for the classiﬁers was the distinction between texts for children
9–10 years old and 10–14 years old. Besides ﬁtting into two distinct age groups, 这
documents in these two classes also belong to two different sources (texts for children
9–10 years old consist of articles from WeeklyReader and texts for children 10–14 years
old consist of articles from BBC-Bitesize), which suggests that the semantic and writing
style dissimilarities between these two neighboring classes might be larger than for
other neighboring classes, and that might have a positive effect on the performance
of the classiﬁers.

On the OneStopEnglish corpus (数字 3), the BERT classiﬁer, which performs the
worst on this corpus according to all criteria but precision, had the most problems
correctly classifying documents from the advanced class, misclassifying about half of
the documents. HAN had the most problems with distinguishing documents from the
advanced and intermediate class, while the BiLSTM classiﬁer classiﬁed a disproportion-
ate amount of intermediate documents into the beginner class.

170

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

A) BERT

乙) HAN

C) BiLSTM

数字 2
Confusion matrices for BERT, HAN, and BiLSTM on the WeeBit corpus.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

A) BERT

乙) HAN

C) BiLSTM

数字 3
Confusion matrices for BERT, HAN, and BiLSTM on the OneStopEnglish corpus.

A) BERT

乙) HAN

C) BiLSTM

数字 4
Confusion matrices for BERT, HAN, and BiLSTM on the Newsela corpus.

Confusion matrices of all classiﬁers for the Newsela corpus (数字 4) follow a simi-
lar pattern. 不出所料, no classiﬁer predicted any documents to be in two minority
类 (10th and 11th grade) with minimal training examples. As the QWK score has
already shown, all classiﬁers classiﬁed a large majority of misclassiﬁed instances into

171

计算语言学

体积 47, 数字 1

A) BERT

乙) HAN

C) BiLSTM

数字 5
Confusion matrices for BERT, HAN, and BiLSTM on the Slovenian school books corpus.

neighboring classes, and costlier mistakes are rare. 例如, the best performing
HAN classiﬁer altogether misclassiﬁed only 13 examples into non-neighboring classes.
Confusion matrices for the Slovenian SB corpus (数字 5) are similar for all clas-
siﬁers. The biggest spread of misclassiﬁed documents is visible for the classes in the
middle of the readability range (from the 4th-grade of primary school to the 1st-grade
of high school). The mistakes, which cause BERT to have lower F1-score and accuracy
scores than the other two classiﬁers, are most likely connected to the misclassiﬁcation
of all but two documents belonging to the school books for the 6th class of the primary
学校. 尽管如此, a large majority of these documents were misclassiﬁed into two
neighboring classes, which explains the high QWK score achieved by the classiﬁer.
What negatively affected the QWK scores for HAN and BiLSTM is that the frequency
of making costlier mistakes of classifying documents several grades above or below
the correct grade is slightly higher for them than for BERT. 尽管如此, 虽然
F1-score results are relatively low on this data set for all classiﬁers (BiLSTM achieved
the best F1-score of 52.19%), the QWK scores around or above 80% and the confusion
matrices clearly show that a large majority of misclassiﬁed examples were put into
classes close to the correct one, suggesting that classiﬁcation approaches to readability
prediction can also be reliably used for Slovenian.

全面的, the classiﬁcation results suggest that neural networks are a viable option
for the supervised readability prediction. Some of the proposed neural approaches man-
aged to outperform state-of-the-art machine learning classiﬁers that leverage feature
工程 (Xia, Kochmar, and Briscoe 2016; Vajjala and Luˇci´c 2018; Deutsch, Jasbi, 和
Shieber 2020) on all corpora where comparisons are available. 然而, the gains are
not substantial, and the choice of an appropriate architecture depends on the properties
of the speciﬁc data set.

6. 结论

We presented a set of novel unsupervised and supervised approaches for determining
the readability of documents using deep neural networks. We tested them on several
manually labeled English and Slovenian corpora. We argue that deep neural networks
are a viable option both for supervised and unsupervised readability prediction and
show that the suitability of a speciﬁc architecture for the readability task depends on
the data set speciﬁcs.

172

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

We demonstrate that neural language models can be successfully used in the un-
supervised setting, since they, in contrast to n-gram models, capture high-level textual
properties and can successfully leverage rich semantic information obtained from the
training data set. 然而, the results of this study suggest that unsupervised ap-
proaches to readability prediction that only take these properties of text into account
cannot compete with the shallow lexical sophistication indicators. This is somewhat in
line with the ﬁndings of the study by Todirascu et al. (2016), who also acknowledged
the supremacy of shallow lexical indicators when compared with higher-level discourse
特征. 尽管如此, combining the components of both neural and traditional read-
ability indicators into the new RSRS (ranked sentence readability score) measure does
improve the correlation with human readability scores.

We argue that the RSRS measure is adaptable, robust, and transferable across lan-
guages. The results of the unsupervised experiments show the inﬂuence of the language
model training set on the performance of the measure. While the results indicate that
an exact match between the genres of the train and test sets is not necessary, 文本
complexity of a train set (IE。, its readability), should be in the lower or middle part of
the readability spectrum of the test set for the optimal performance of the measure.
This indicates that out of the two high-level text properties that the RSRS measure
uses for determining readability, semantic information and long-distance structural
信息, the latter seems to have more effect on the performance. This is further
conﬁrmed by the results of using the general BERT language model for the readability
prediction, which show a negative correlation between the language model perplexity
and readability, even though the semantic information the model possesses is extensive
due to the large training set.

The functioning of the proposed RSRS measure can be customized and inﬂuenced
by choice of the training set. This is the desired property because it enables personal-
ization and localization of the readability measure according to the educational needs,
语言, and topic. The usability of this feature might be limited for under-resourced
languages because a sufﬁcient amount of documents needed to train a language model
that can be used for the task of readability prediction in a speciﬁc customized setting
might not be available. 另一方面, our experiments on the Slovenian language
show that a relatively small 2.4 million word training corpus for language models is
sufﬁcient to outperform traditional readability measures.

The results of the unsupervised approach to readability prediction on the corpus
of Slovenian school books are not entirely consistent with the results reported by the
previous Slovenian readability study (ˇSkvorc et al. 2019), where the authors reported
that simple indicators of readability, such as average sentence length, performed quite
出色地. Our results show that the average sentence length performs very competitively on
English but ranks badly on Slovenian. This inconsistency in results might be explained
by the difference in corpora used for the evaluation of our approaches. Whereas ˇSkvorc
等人. (2019) conducted experiments on a corpus of magazines for different age groups
(which we used for language model training), our experiments were conducted on a
corpus of school books, which contains items for sixteen distinct school subjects with
very different topics ranging from literature, 音乐, and history to math, 生物学, 和
化学. As was already shown in Sheehan, Flor, and Napolitano (2013), the variance
in genres and covered topics has an important effect on the ranking and performance
of different readability measures. Further experiments on other Slovenian data sets are
required to conﬁrm this hypothesis.

In the supervised approach to determining readability, we show that the pro-
posed neural classiﬁers can either outperform or at least compare with state-of-the-art

173

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 47, 数字 1

approaches leveraging extensive feature engineering as well as previously used neural
models on all corpora where comparison data is available. While the improved perfor-
mance and elimination of work required for manual feature engineering are desirable,
on the downside, neural approaches tend to decrease the interpretability and explain-
ability of the readability prediction. Interpretability and explainability are especially
important for educational applications (Sheehan et al. 2014; Madnani and Cahill 2018),
where the users of such technology (教育工作者, 教师, 研究人员, ETC。) often need
to understand what causes one text to be judged as more readable than the other and
according to which dimensions. Therefore in the future, we will explore the possibilities
of explaining the readability predictions of the proposed neural classiﬁer with the
help of general explanation techniques such as SHAP (Lundberg and Lee 2017), 或者
attention mechanism (Vaswani et al. 2017), which can be analyzed and visualized and
can offer valuable insights into inner workings of the system.

Another issue worth discussing is the trade-off between performance gains we can
achieve by employing computationally demanding neural networks on the one side and
the elimination of work on the other. 例如, on the OneStopEnglish corpus, 我们
report the accuracy of 78.72% when HAN is used, while Vajjala and Luˇci´c (2018) 报告
an accuracy of 78.13% with their classiﬁer employing 155 hand-crafted features. While it
might be worth opting for a neural network in order to avoid extensive manual feature
工程, 另一方面, the same study by Vajjala and Luˇci´c (2018) also reports
that just by employing generic text classiﬁcation features, 2–5 character n-grams, 他们
obtained the accuracy of 77.25%. Considering this, one might argue that, 取决于
on the use case, it might not be worth dedicating signiﬁcantly more time, 工作, 或者
computational resources for an improvement of slightly more than 1%, especially if this
also decreases the overall interpretability of the prediction.

The performance of different classiﬁers varies across different corpora. The major
factor proved to be the length of documents in the data sets. The HAN architecture,
which tends to be well equipped to handle long-distance hierarchical text structures,
performs the best on these data sets. 另一方面, in terms of QWK measure, BERT
offers signiﬁcantly better performance on data sets that contain shorter documents, 这样的
as WeeBit and Slovenian SB. As was already explained in Section 5.2, a large majority
of OneStopEnglish and Newsela documents need to be truncated in order to satisfy
the BERT’s limitation of 512 byte-pair tokens. Although it is reasonable to assume
that the truncation and the consequential loss of information do have a detrimental
effect on the performance of the classiﬁer, the extent of this effect is still unclear. 这
problem of truncation also raises the question of what is the minimum required length
of a text for a reliable assessment of readability and if there exists a length threshold,
above which having more text does not inﬂuence the performance of a classiﬁer in a
signiﬁcant manner. We plan to assess this in future work thoroughly. Another related
line of research we plan to pursue in the future is the use of novel algorithms, 这样的
as Longformer (Beltagy, Peters, and Cohan 2020) and Linformer (Wang et al. 2020),
in which the attention mechanism scales linearly with the sequence length, making it
feasible to process documents of thousands of tokens. We will check if applying these
two algorithms on the readability data sets with longer documents can further improve
the state of the art.

The other main difference between WeeBit and Slovenian SB data sets on the one
手, and Newsela and OneStopEnglish data sets on the other, is that they are not par-
allel corpora, which means that there can be substantial semantic differences between
the readability classes in these two corpora. It seems that pretraining BERT as a lan-
guage model allows for better exploitation of these differences, which leads to better

174

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

表现. 然而, this reliance on semantic information might badly affect the
performance of transfer learning based models on parallel corpora, since the semantic
differences between classes in these corpora are much more subtle. We plan to assess
the inﬂuence of available semantic information on the performance of different classiﬁ-
cation models in the future.

The differences in performance between classiﬁers on different corpora suggest that
tested classiﬁers take different types of information into account. Provided that this hy-
pothesis is correct, some gains in performance might be achieved if the classiﬁers are
合并的. We plan to test a neural ensemble approach for the task of predicting read-
ability in the future.

While this study mostly focused on multilingual and multi-genre readability predic-
的, in the future, we also plan to test the cross-corpus, cross-genre, and cross-language
transferability of the proposed supervised and unsupervised approaches. 这需要
new readability data sets for different languages and genres that are currently rare
or not publicly available. 另一方面, this type of research will be capable of
further determining the role of genre in the readability prediction and might open an
opportunity to improve the proposed unsupervised readability score further.

致谢
The research was ﬁnancially supported by
the European Social Fund and Republic of
Slovenia, Ministry of Education, 科学, 和
Sport through project Quality of Slovene
Textbooks (KaU ˇC). The work was also
supported by the Slovenian Research Agency
(ARRS) through core research programs
P6-0411 and P2-0103, and the projects
Terminology and Knowledge Frames Across
Languages (J6-9372) and Quantitative and
Qualitative Analysis of the Unregulated
Corporate Financial Reporting (J5-2554). 这
work has also received funding from the
European Union’s Horizon 2020 研究
and Innovation program under grant
agreement no. 825153 (EMBEDDIA). 这
results of this publication reﬂect only the
authors’ views, and the EC is not responsible
for any use that may be made of the
information it contains.

参考
安德森, 乔纳森. 1981. Analysing the
readability of English and non-English
texts in the classroom with LIX. In Seventh
Australian Reading Association Conference,
pages 1–12, Darwin.

Azpiazu, Ion Madrazo and Maria Soledad

Pera. 2019. Multiattentive recurrent neural
network architecture for multilingual
readability assessment. Transactions of the
计算语言学协会,
7:421–436. DOI: https://doi.org/10
.1162/tacl a 00278

Bahdanau, Dzmitry, Kyunghyun Cho, 和
Yoshua Bengio. 2014. Neural machine

translation by jointly learning to align
and translate. arXiv 预印本 arXiv:1409
.0473.

Bai, Shaojie, J. Zico Kolter, and Vladlen

Koltun. 2018. An empirical evaluation of
generic convolutional and recurrent
networks for sequence modeling. arXiv
preprint arXiv:1803.01271.

Beltagy, Iz, Matthew E. Peters, and Arman

Cohan. 2020. Longformer: 这
long-document transformer. arXiv 预印本
arXiv:2004.05150.

本吉奥, Yoshua, Patrice Simard, and Paolo

Frasconi. 1994. Learning long-term
dependencies with gradient descent is
difﬁcult. IEEE Transactions on Neural
网络, 5(2):157–166. DOI: https://
doi.org/10.1109/72.279181, PMID:
18267787

Bird, Steven and Edward Loper. 2004. NLTK:

The natural language toolkit. 在
Proceedings of the ACL 2004 on Interactive
Poster and Demonstration Sessions, 页 31,
巴塞罗那. DOI: https://doi.org/10
.3115/1219044.1219075

Bormuth, 约翰·R. 1969. Development of

Readability Analysis. ERIC Clearinghouse.

给, Kyunghyun, Bart van Merri¨enboer,
Caglar Gulcehre, Dzmitry Bahdanau,
Fethi Bougares, Holger Schwenk, 和
Yoshua Bengio. 2014. Learning phrase
representations using RNN
encoder–decoder for statistical machine
翻译. 在诉讼程序中 2014
Conference on Empirical Methods in Natural
语言处理 (EMNLP),
pages 1724–1734, Doha. DOI: https://土井
.org/10.3115/v1/D14-1179

175

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 47, 数字 1

Collins-Thompson, Kevyn. 2014.

Computational assessment of text
readability: A survey of current and future
研究. ITL-International Journal of Applied
语言学, 165(2):97–135. DOI: https://
doi.org/10.1075/itl.165.2.01col

Collobert, Ronan, Jason Weston, L´eon

波图, Michael Karlen, Koray
Kavukcuoglu, and Pavel Kuksa. 2011.
Natural language processing (almost) 从
scratch. Journal of Machine Learning
研究, 12(八月):2493–2537.

Conneau, Alexis, Douwe Kiela, Holger
Schwenk, Lo¨ıc Barrault, and Antoine
Bordes. 2017. Supervised learning of
universal sentence representations from
natural language inference data. 在
诉讼程序 2017 会议
自然语言的经验方法
加工, pages 670–680, 哥本哈根.
DOI: https://doi.org/10.18653/v1/D17
-1070

Council of Europe, Council for Cultural
Co-operation. Education Committee.
Modern Languages Division. 2001.
Common European Framework of Reference for
Languages: 学习, Teaching, Assessment.
剑桥大学出版社.

Crossley, Scott A., Stephen Skalicky, Mihai
Dascalu, Danielle S. McNamara, 和
Kristopher Kyle. 2017. Predicting text
comprehension, 加工, 和
familiarity in adult readers: 新的
approaches to readability formulas.
Discourse Processes, 54(5-6):340–359.
DOI: https://doi.org/10.1080
/0163853X.2017.1296264

戴尔, Edgar and Jeanne S. Chall. 1948. A
formula for predicting readability:
Instructions. Educational Research Bulletin,
pages 37–54.

Davison, Alice and Robert N. Kantor. 1982.
On the failure of readability formulas to
deﬁne readable texts: A case study from
adaptations. Reading Research Quarterly,
pages 187–209. DOI: https://doi.org
/10.2307/747483

Deutsch, Tovly, Masoud Jasbi, and Stuart
Shieber. 2020. Linguistic features for
readability assessment. arXiv 预印本
arXiv:2006.00377. DOI: https://doi.org
/10.18653/v1/2020.bea-1.1

Devlin, 雅各布, Ming-Wei Chang, Kenton Lee,

and Kristina Toutanova. 2019. BERT:
Pre-training of deep bidirectional
transformers for language understanding.
在诉讼程序中 2019 Conference of the
North American Chapter of the Association for
计算语言学: Human Language

176

Technologies, 体积 1 (Long and Short
文件), pages 4171–4186, 明尼阿波利斯, 明尼苏达州.

冯, Lijun. 2008. Text simpliﬁcation: A

survey, The City University of New York.

冯, Lijun, No´emie Elhadad, and Matt

Huenerfauth. 2009. Cognitively motivated
features for readability assessment. 在
Proceedings of the 12th Conference of the
European Chapter of the ACL (EACL 2009),
pages 229–237, 雅典. DOI: https://土井
.org/10.3115/1609067.1609092

冯, Lijun, Martin Jansche, 马特

Huenerfauth, and No´emie Elhadad. 2010.
A comparison of features for automatic
readability assessment. In COLING 2010:
海报, pages 276–284, 北京.

Filighera, 安娜, Tim Steuer, and Christoph
Rensing. 2019. Automatic text difﬁculty
estimation using embeddings and neural
网络. In European Conference on
Technology Enhanced Learning,
pages 335–348, Delft. DOI: https://土井
.org/10.1007/978-3-030-29736-7 25
Flor, 迈克尔, Beata Beigman Klebanov, 和

Kathleen M. Sheehan. 2013. 词汇
tightness and text complexity. 在
Proceedings of the Workshop on Natural
Language Processing for Improving Textual
Accessibility, pages 29–38, 亚特兰大, 遗传算法.
Goldberg, Yoav and Jon Orwant. 2013. A
data set of syntactic-ngrams over time
from a very large corpus of English books.
In Second Joint Conference on Lexical and
Computational Semantics, pages 241–247,
亚特兰大, 遗传算法.

好人, Ian, Yoshua Bengio, and Aaron
考维尔. 2016. Deep Learning. 与新闻界.
Gunning, 罗伯特. 1952. The Technique of Clear

Writing. 麦格劳-希尔, 纽约.

Halliday, Michael Alexander Kirkwood and
Ruqaiya Hasan. 1976. Cohesion in English.
劳特利奇.

Jawahar, Ganesh, Benoˆıt Sagot, and Djam´e

Seddah. 2019. What does BERT learn about
the structure of language? 在诉讼程序中
the 57th Annual Meeting of the Association for
计算语言学, pages 3651–3657,
Florence, 意大利. DOI: https://doi.org/10
.18653/v1/P19-1356

Jiang, Birong, Endong Xun, and Jianzhong

齐. 2015. A domain independent approach
for extracting terms from research papers.
In Australasian Database Conference,
pages 155–166, 墨尔本. DOI: https://
doi.org/10.1007/978-3-319-19548-3 13

坎德尔, Lilian and Abraham Moles. 1958.
Application de l’indice de ﬂesch `a la
langue franc¸aise. Cahiers Etudes de
Radio-T´el´evision, 19(1958):253–274.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

Kim, Yoon, Yacine Jernite, David Sontag, 和
Alexander M. 匆忙. 2016. Character-aware
neural language models. In AAAI,
pages 2741–2749, Phoenix, AZ.

Kincaid, J. 彼得, 罗伯特·P. Fishburne Jr,

理查德·L. 罗杰斯, and Brad S. Chissom.
1975. Derivation of New Readability Formulas
(Automated Readability Index, Fog Count and
Flesch Reading Ease Formula) for Navy
Enlisted Personnel. Institute for Simulation
and Training, University of Central
Florida.

Kudo, Taku and John Richardson. 2018.

Sentencepiece: A simple and language
independent subword tokenizer and
detokenizer for neural text processing.
arXiv 预印本 arXiv:1808.06226. DOI:
https://doi.org/10.18653/v1/D18
-2012, PMID: 29382465

Landauer, Thomas K. 2011. Pearson’s text

complexity measure, 皮尔逊.

Logar, Nataˇsa, Miha Grˇcar, Marko Brakus,

Tomaˇz Erjavec, ˇSpela Arhar Holdt, 西蒙
Krek, and Iztok Kosem. 2012. Korpusi
slovenskega jezika Gigaﬁda, KRES, ccGigaﬁda
in ccKRES: gradnja, vsebina, uporaba.
Trojina, zavod za uporabno slovenistiko.
Lundberg, Scott M. and Su-In Lee. 2017. A
uniﬁed approach to interpreting model
预测. In Advances in Neural
Information Processing Systems,
pages 4768–4777, Long Beach, CA.
Ma, Yi, Eric Fosler-Lussier, and Robert

Lofthus. 2012. Ranking-based readability
assessment for early primary children’s
文学. 在诉讼程序中 2012
Conference of the North American Chapter of
the Association for Computational Linguistics:
人类语言技术,
pages 548–552, 蒙特利尔.

Madnani, Nitin and Aoife Cahill. 2018.
Automated scoring: Beyond natural
语言处理. In Proceedings
of the 27th International Conference on
计算语言学, pages 1099–1109,
圣达菲, NM.

Madrazo Azpiazu, Ion and Maria Soledad
Pera. 2020. Is crosslingual readability
assessment possible? Journal of the
Association for Information Science and
技术, 71(6):644–656. DOI: https://
doi.org/10.1002/asi.24293
McLaughlin, G. Harry. 1969. SMOG

grading—a new readability formula.
Journal of Reading, 12(8):639–646.
米科洛夫, Tom´aˇs, Anoop Deoras, Stefan
Kombrink, Luk´aˇs Burget, and Jan
ˇCernock `y. 2011. Empirical evaluation and
combination of advanced language

modeling techniques. In Twelfth Annual
Conference of the International Speech
Communication Association, pages 605–608.

米科洛夫, 托马斯, 伊利亚·苏茨克维尔, Kai Chen,
格雷格小号. 科拉多, 和杰夫·迪恩. 2013.
单词和的分布式表示
短语及其组合性. 在
神经信息处理的进展
系统, 第 3111–3119 页, Florence.
Mohammadi, Hamid and Seyed Hossein
Khasteh. 2019. Text as environment: A
deep reinforcement learning text
readability assessment model. arXiv
preprint arXiv:1912.05957.

Nadeem, Farah and Mari Ostendorf. 2018.
Estimating linguistic complexity for
science texts. In Proceedings of the Thirteenth
Workshop on Innovative Use of NLP for
Building Educational Applications,
pages 45–55, New Orleans, 这. DOI:
https://doi.org/10.18653/v1/W18
-0505

Napolitano, Diane, Kathleen M. Sheehan,
and Robert Mundkowsky. 2015. 在线的
readability and text complexity analysis
with TextEvaluator. 在诉讼程序中
2015 Conference of the North American
Chapter of the Association for Computational
语言学: Demonstrations, pages 96–100,
丹佛, 一氧化碳. DOI: https://doi.org/10
.3115/v1/N15-3020

Pennington, 杰弗里, Richard Socher, 和

Christopher Manning. 2014. GloVe: 全球的
vectors for word representation. 在
诉讼程序 2014 conference on
empirical methods in natural language
加工 (EMNLP), pages 1532–1543,
Doha. DOI: https://doi.org/10.3115
/v1/D14-1162

Peters, Matthew E., Mark Neumann, Mohit
伊耶尔, Matt Gardner, Christopher Clark,
Kenton Lee, and Luke Zettlemoyer. 2018.
Deep contextualized word representations.
In Proceedings of NAACL-HLT,
pages 2227–2237, New Orleans, 这. DOI:
https://doi.org/10.18653/v1/N18-1202
彼得森, Sarah E. and Mari Ostendorf. 2009.
A machine learning approach to reading
level assessment. Computer Speech &
语言, 23(1):89–106. DOI: https://
doi.org/10.1016/j.csl.2008.04.003

坑, Emily and Ani Nenkova. 2008.
Revisiting readability: A uniﬁed
framework for predicting text quality. 在
诉讼程序 2008 会议
自然语言的经验方法
加工, pages 186–195, 檀香山, HI.
DOI: https://doi.org/10.3115
/1613715.1613742

177

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

计算语言学

体积 47, 数字 1

Schwarm, Sarah E. and Mari Ostendorf.
2005. Reading level assessment using
support vector machines and statistical
language models. In Proceedings of the 43rd
Annual Meeting of the Association for
计算语言学, pages 523–530,
安娜堡, MI. DOI: https://doi.org
/10.3115/1219840.1219905

Sheehan, Kathleen M., Michael Flor, 和
Diane Napolitano. 2013. A two-stage
approach for generating unbiased
estimates of text complexity. In Proceedings
of the Workshop on Natural Language
Processing for Improving Textual
Accessibility, 第 49–58 页, 亚特兰大, 遗传算法.
Sheehan, Kathleen M., Irene Kostin, Yoko

Futagi, and Michael Flor. 2010. Generating
automated text complexity classiﬁcations
that are aligned with targeted text
complexity standards. ETS Research Report
Series, 2010(2):i–44. DOI: https://土井
.org/10.1002/j.2333-8504.2010
.tb02235.x

Sheehan, Kathleen M., Irene Kostin, Diane
Napolitano, and Michael Flor. 2014. 这
TextEvaluator tool: Helping teachers and
test developers select texts for use in
instruction and assessment. The Elementary
School Journal, 115(2):184–209. DOI:
https://doi.org/10.1086/678294
ˇSkvorc, Tadej, Simon Krek, Senja Pollak,

ˇSpela Arhar Holdt, and Marko
Robnik-ˇSikonja. 2019. Predicting Slovene
text complexity using readability
措施. Contributions to Contemporary
历史 (Spec. Issue on Digital Humanities
and Language Technologies, 59(1):198–220.
DOI: https://doi.org/10.51663/pnz
.59.1.10

史密斯, Edgar A. 和R. J. Senter. 1967.

Automated readability index. AMRL-TR.
Aerospace Medical Research Laboratories
(我们), pages 1–14.

Todirascu, Amalia, Thomas Franc¸ois,

Delphine Bernhard, N ´uria Gala, 和
Anne-Laure Ligozat. 2016. Are cohesive
features relevant for text readability
评估? COLING 论文集 2016,
the 26th International Conference on
计算语言学: 技术论文,
pages 987–997, 大阪.

Ulˇcar, Matej and Marko Robnik-ˇSikonja.
2020. FinEst BERT and CroSloEngual
BERT. In International Conference on Text,
Speech, and Dialogue, pages 104–111, Brno.

Vajjala, Sowmya and Ivana Luˇci´c. 2018.

OneStopEnglish corpus: A new corpus for
automatic readability assessment and text
simpliﬁcation. 在诉讼程序中

178

Thirteenth Workshop on Innovative Use of
NLP for Building Educational Applications,
pages 297–304, New Orleans, 这.

Vajjala, Sowmya and Detmar Meurers. 2012.
On improving the accuracy of readability
classiﬁcation using insights from second
language acquisition. 在诉讼程序中
Seventh Workshop on Building Educational
Applications Using NLP, pages 163–173,
蒙特利尔.

Van Dijk, Teun Adrianus. 1977. Text and

Context: Explorations in the Semantics and
Pragmatics of Discourse. Longman London.

Vaswani, Ashish, Noam Shazeer, Niki

Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you
need. In Advances in Neural Information
Processing Systems, pages 5998–6008,
Long Beach, CA.

王, Sinong, Belinda Li, Madian Khabsa,
Han Fang, and Hao Ma. 2020. Linformer:
Self-attention with linear complexity.
arXiv 预印本 arXiv:2006.04768.

王, Ziyu, Tom Schaul, Matteo Hessel,

Hado Hasselt, Marc Lanctot, 和
Nando Freitas. 2016. Dueling network
architectures for deep reinforcement
学习. In International Conference on
Machine Learning, pages 1995–2003,
纽约.

威廉姆斯, 杰弗里. 2006. Michael Hoey.

Lexical priming: A new theory of words
和语言. International Journal of
Lexicography, 19(3):327–335. DOI:
https://doi.org/10.1093/ijl/ecl017

Xia, Menglin, Ekaterina Kochmar, 和
Ted Briscoe. 2016. Text readability
assessment for second language learners.
In Proceedings of the 11th Workshop on
Innovative Use of NLP for Building
Educational Applications, pages 12–22,
圣地亚哥, CA. DOI: https://doi.org
/10.18653/v1/W16-0502, PMCID:
PMC4879617

徐, Kelvin, Jimmy Ba, Ryan Kiros,

Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Richard Zemel, and Yoshua
本吉奥. 2015. Show, attend and tell: Neural
image caption generation with visual
注意力. In International Conference on
Machine Learning, pages 2048–2057, Lille.
徐, Wei, Chris Callison-Burch, and Courtney
Napoles. 2015. Problems in current text
simpliﬁcation research: New data can help.
Transactions of the Association of
计算语言学, 3(1):283–297.
DOI: https://doi.org/10.1162/tacl
A 00139

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d

F
/

4
7
1
1
4
1
1
9
1
1
4
2
9
/
C
哦

我
我

_
A
_
0
0
3
9
8
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Martinc, Pollak, and Robnik-ˇSikonja

Neural Approaches to Text Readability

哪个, Zichao, Diyi Yang, Chris Dyer,

Xiaodong He, Alex Smola, and Eduard
蓝色的. 2016. Hierarchical attention
networks for document classiﬁcation. 在
诉讼程序 2016 Conference of the
North American Chapter of the Association for
计算语言学: Human Language
Technologies, pages 1480–1489, 圣地亚哥,
CA. DOI: https://doi.org/10.18653/v1
/N16-1174

张, Xiang, Junbo Zhao, and Yann LeCun.

2015. Character-level convolutional
networks for text classiﬁcation. 在
神经信息处理的进展
系统, pages 649–657, 蒙特利尔.

朱, Yukun, Ryan Kiros, Rich Zemel, Ruslan
Salakhutdinov, Raquel Urtasun, Antonio
Torralba, and Sanja Fidler. 2015. Aligning
books and movies: Towards story-like
visual explanations by watching movies
and reading books. 在诉讼程序中
IEEE International Conference on Computer
Vision, pages 19–27, 圣地亚哥. DOI:
https://doi.org/10.1109/ICCV.2015.11

Zwitter Vitez, Ana. 2014. Ugotavljanje

avtorstva besedil: primer “trenirkarjev.”
In Language Technologies: 诉讼程序
17th International Multiconference
Information Society – IS 2014, pages 131–134,
Ljubljana.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
/
C
哦

我
我
/

我

A
r
t
我
C
e
–
p
d