Grammar Error Correction in Morphologically Rich Languages: - 麻省理工学院人工智能研究专业

Grammar Error Correction in Morphologically Rich Languages:
The Case of Russian

Alla Rozovskaya
Queens College, City University of
纽约
arozovskaya@qc.cuny.edu

Dan Roth
宾夕法尼亚大学
danroth@seas.upenn.edu

抽象的

到目前为止, most of the research in grammar
error correction focused on English, 和
problem has hardly been explored for other
语言. We address the task of correcting
writing mistakes in morphologically rich lan-
guages, with a focus on Russian. We present a
corrected and error-tagged corpus of Russian
learner writing and develop models that make
use of existing state-of-the-art methods that
have been well studied for English. 虽然
impressive results have recently been achieved
for grammar error correction of non-native
English writing, these results are limited to
domains where plentiful training data are avail-
有能力的. Because annotation is extremely costly,
these approaches are not suitable for the
majority of domains and languages. We thus
focus on methods that use ‘‘minimal super-
vision’’; 那是, those that do not rely on large
amounts of annotated training data, and show
how existing minimal-supervision approaches
extend to a highly inflectional language such
as Russian. The results demonstrate that these
methods are particularly useful for correcting
mistakes in grammatical phenomena that
involve rich morphology.

介绍

This paper addresses the task of correcting errors
in text. Most of the research in the area of gram-
mar error correction (GEC) focused on correcting
mistakes made by English language learners.
One standard approach to dealing with these
错误, which proved highly successful in text
correction competitions (Dale and Kilgarriff,
2011; Dale et al., 2012; Ng et al., 2013, 2014;
Rozovskaya et al., 2017), makes use of a machine-

learning classifier paradigm and is based on
the methodology for correcting context-sensitive
spelling mistakes (Golding and Roth, 1996,
1999; 班科和布里尔, 2001). In this approach,
classifiers are trained for a particular mistake
类型: 例如, preposition, 文章, 或名词
数字 (Tetreault et al., 2010; Gamon, 2010;
Rozovskaya and Roth, 2010C,乙; Dahlmeier and
的, 2012). 起初, classifiers were trained on
native English data. As several annotated learner
datasets became available, models were also trained
on annotated learner data.

最近, the statistical machine trans-
关系 (公吨) 方法, including neural MT, 有
gained considerable popularity thanks to the
availability of large annotated corpora of learner
写作 (例如, Yuan and Briscoe, 2016; Junczys-
Dowmunt and Grundkiewicz, 2016; Chollampatt
和吴, 2018). Classification methods work very
well on well-defined types of errors, 然而
MT is good at correcting interacting and complex
types of mistakes, which makes these approaches
complementary in some respects (Rozovskaya
and Roth, 2016).

Thanks to the availability of large (in-domain)
datasets, substantial gains in performance have
been made in English grammar correction. 和-
幸运的是, research on other languages has been
scarce. Previous work includes efforts to create
annotated learner corpora for Arabic (Zaghouani
等人。, 2014), Japanese (Mizumoto et al., 2011),
and Chinese (Yu et al., 2014), and shared tasks
on Arabic (Mohit et al., 2014; Rozovskaya et al.,
2015) and Chinese error detection (李等人。,
2016; Rao et al., 2017). 然而, building robust
models in other languages has been a challenge,
since an approach that relies on heavy supervision
is not viable across languages, 流派, and learner
背景. 而且, for languages that are
complex morphologically, we may need more
data to address the lexical sparsity.

计算语言学协会会刊, 卷. 7, PP. 1–17, 2019. 动作编辑器: Jianfeng Gao.
提交批次: 4/2018; 修改批次: 8/2018; 已发表 3/2019.
C(西德:13) 2019 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

This work focuses on Russian, a highly in-
flectional language from the Slavic group. 俄语
has over 260M speakers, 为了 47% of whom
Russian is not their native language.1 We corrected
and error-tagged over 200K words of non-native
Russian texts. We use this dataset to build several
grammar correction systems that draw on and
extend the methods that showed state-of-the-
art performance on English grammar correction.
Because the size of our annotation is limited,
compared with what is used for English, one of the
goals of our work is to quantify the effect of having
limited annotation on existing approaches. 我们
evaluate both the MT paradigm, which requires
large amounts of annotated learner data, 和
classification approaches that can work with any
amount of supervision.

全面的, the results obtained for Russian are
much lower than those reported for English. 我们
further find that the minimal-supervision classi-
fication methods that can combine large amounts
of native data with a small annotated learner
sample give the best results on a language with
rich morphology and with limited annotation.
The system that uses classifiers with minimal
supervision achieves an F0.5 score of 21.0,2
whereas the MT system trained on the same data
achieves a score of only 10.6.

This paper makes the following contributions:
(1) We describe an error classification schema for
Russian learner errors, and present an error-tagged
Russian learner corpus. The dataset is available
for research3 and can serve as a benchmark dataset
for Russian, which should facilitate progress
on grammar correction research, especially for
languages other than English. (2) We present an
analysis of the annotated data, in terms of error
费率, error distributions by learner type (foreign
and heritage), as well as comparison to learner
corpora in other languages. (3) We extend state-
of-the-art grammar correction methods to a
morphologically rich language and, 尤其,
identify classifiers needed to address mistakes

1https://en.wikipedia.org/wiki/Russian

语言.

2This is a standard metric used in grammar correction
since the CoNLL shared tasks. Because precision is more
important than recall in grammar correction, it is weighed
twice as high, and is denoted as F0.5. Other metrics have been
proposed recently (Felice and Briscoe, 2015; Napoles et al.,
2015; Choshen and Abend, 2018A).

3https://github.com/arozovskaya/RULEC-GEC.

that are specific to these languages. (4) We dem-
onstrate that the classification framework with
minimal supervision is particularly useful for
morphologically rich languages; they can benefit
from large amounts of native data, due to a large
variability of word forms, and small amounts
of annotation provide good estimates of typical
learner errors. (5) We present an error analysis
that provides further insight into the behavior of
the models on a morphologically rich language.

部分 2 presents related work. 部分 3
describes the corpus. Experiments are described
in Section 4, and the results are presented in
部分 5. We present an error analysis in Section 6
and conclude in Section 7.

2 Background and Related Work

We first discuss related work in text correction
on languages other than English. We then intro-
duce the two frameworks for grammar correction
(evaluated primarily on English learner datasets)
and discuss the ‘‘minimal supervision’’ approach.

2.1 Grammar Correction in Other

Languages

The two most prominent attempts at grammar
error correction in other languages are shared
tasks on Arabic and Chinese text correction. 在
Arabic, a large-scale corpus (2M words) 曾是
collected and annotated as part of the QALB
项目 (Zaghouani et al., 2014). The corpus is
fairly diverse:
it contains machine translation
outputs, news commentaries, and essays authored
by native speakers and learners of Arabic. 这
learner portion of the corpus contains 90K words
(Rozovskaya et al., 2015), including 43K words
for training. This corpus was used in two editions
of the QALB shared task (Mohit et al., 2014;
Rozovskaya et al., 2015). There have also been
three shared tasks on Chinese grammatical error
diagnosis (李等人。, 2016; Rao et al., 2017,
2018). A corpus of learner Chinese used in the
competition includes 4K units for training (each
unit consists of one to five sentences).

Mizumoto et al. (2011) present an attempt
to extract a Japanese learners’ corpus from the
revision log of a language learning Web site
(Lang-8). They collected 900K sentences pro-
duced by learners of Japanese and implemented
a character-based MT approach to correct the

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

错误. The English learner data from the Lang-8
Web site is commonly used as parallel data in
English grammar correction. One problem with
the Lang-8 data is a large number of remaining
unannotated errors.

In other languages, attempts at automatic gram-
mar detection and correction have been limited
to identifying specific types of misuse (grammar
or spelling). Imamura et al. (2012) address the
problem of particle error correction for Japanese,
and Israel et al. (2013) develop a small corpus
of Korean particle errors and build a classifier
to perform error detection. De Ilarraza et al.
(2008) address errors in postpositions in Basque,
and Vincze et al. (2014) study definite and in-
definite conjugation usage in Hungarian. Sev-
eral studies focus on developing spell checkers
(Ramasamy et al., 2015; Sorokin et al., 2016;
Sorokin, 2017).

There has also been work that focuses on
annotating learner corpora and creating error
taxonomies that do not build a grammar correction
系统. Dickinson and Ledbetter (2012) 展示
an annotated learner corpus of Hungarian; Hana
等人. (2010) and Rosen et al. (2014) build a
learner corpus of Czech; and Abel et al. (2014)
present KoKo, a corpus of essays authored by
German secondary school students, 其中一些人
are non-native writers. For an overview of learner
corpora in other languages, we refer the reader
to Rosen et al. (2014).

2.2 Approaches to Text Correction

There are currently two well-studied paradigms
that achieve competitive results on the task in
English–MT and machine learning classification.
In the classification approach, error-specific
classifiers are built. Given a confusion set, 为了
例子 {A, 这, zero article} for articles, each
occurrence of a confusable word is represented
as a vector of features derived from a context
window around it. Classifiers can be trained
either on learner or on native data, where each
target word occurrence (例如, 这) is treated as a
positive training example for the corresponding
word. Given a text to correct, for each confusable
word, the task is to select the most likely candidate
from the relevant confusion set. Error-specific
classifiers are typically trained for common learner
errors—for example, 文章, preposition, 或者
noun number in English (Izumi et al., 2003; Han

等人。, 2006; Gamon et al., 2008; De Felice and
Pulman, 2008; Tetreault et al., 2010; Gamon,
2010; Rozovskaya and Roth, 2011; Dahlmeier
和吴, 2012).

文本, and original

In the MT approach,

the error correction
problem is cast as a translation task: 即,
translating ungrammatical learner text into well-
formed grammatical
learner
texts and the corresponding corrected texts act
as parallel data. MT systems for grammar cor-
rection are trained using 20M–50M words of
learner texts to achieve competitive performance.
The MT approach has shown state-of-the-art
results on the benchmark CoNLL-14 test set in
英语 (Susanto et al., 2014; Junczys-Dowmunt
and Grundkiewicz, 2016; Chollampatt and Ng,
2017); it is particularly good at correcting com-
plex error patterns, which is a challenge for the
classification methods (Rozovskaya and Roth,
2016). 然而, phrase-based MT systems do
not generalize well beyond the error patterns
observed in the training data. Several neural
encoder–decoder approaches relying on recurrent
neural networks were proposed (Chollampatt
等人。, 2016; Yuan and Briscoe, 2016; Ji et al.,
2017). These initial attempts were not able to
reach the performance of
the state-of-the-art
phrase-based MT systems (Junczys-Dowmunt and
Grundkiewicz, 2016), but more recently neural
MT approaches have shown competitive results
on English grammar correction (Chollampatt
和吴, 2018; Junczys-Dowmunt et al., 2018;
Junczys-Dowmunt and Grundkiewicz, 2018).4
然而, neural MT systems tend to require
even more supervision. 例如, Junczys-
Dowmunt et al. (2018) adopt the methods devel-
oped for low-resource machine translation tasks,
but they still require parallel corpora in tens of
millions of tokens.

Minimal Supervision Framework As we have
著名的, classifiers can be trained on either native
learner data. Native data are cheap and
或者
available in large quantities. 但, when training on
learner data, the potentially erroneous word can
also be used by the model. Because mistakes

4Single neural MT systems are still not as good as a
phrase-based system (Junczys-Dowmunt and Grundkiewicz,
2018), and the top results are achieved using an ensemble
of neural models (Chollampatt and Ng, 2018) or a pipeline
of a phrase-based and a neural model enhanced with a spell
checker (Junczys-Dowmunt and Grundkiewicz, 2018).

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

made by non-native speakers are not random
(Montrul and Slabakova, 2002; Ionin et al., 2008),
using the potentially erroneous word and the
correction provides the models with knowledge
about learner error patterns. 为此原因, 模组-
els trained on error-annotated data often out-
perform models trained on larger amounts of
native data (Gamon, 2010; Dahlmeier and Ng,
2011). But this approach requires large amounts
of annotated learner data (Gamon, 2010). 这
minimal supervision approach (Rozovskaya and
Roth, 2014; Rozovskaya et al., 2017) incorporates
the best of both modes: training on native texts
to facilitate the possibility of training from large
amounts of data without
the need for anno-
站, but using a modest amount of expensive
learner data that contains learner error patterns.
重要的, error patterns can be estimated
robustly with a small amount of annotation
(Rozovskaya et al., 2017). The error patterns can
be provided to the model in the form of artificial
errors or by changing the model priors. 在这个
工作, we use the artificial errors approach; it has
been studied extensively for English grammar
correction. Several other studies consider the
effect of using artificial errors (例如, Cahill et al.,
2013; Felice and Yuan, 2014).

3 Corpus and Annotation

We annotated data from the Russian Learner
Corpus of Academic Writing (RULEC, 560K
字) (Alsufieva et al., 2012), which consists of
essays and papers written in a university setting
in the United States by students learning Russian
as a foreign language and heritage speakers (那些
who grew up in the United States but had exposure
to Russian at home). This closely mirrors the
datasets used for English grammar correction.
The corpus contains data from 15 foreign lan-
guage learners and 13 heritage speakers. RULEC
is freely available for research use.5

3.1 Russian Grammatical Categories

Russian is a fusional language with free word
命令, characterized by rich morphology and a
high number of inflections. Nouns, 形容词,
and certain pronouns are specified for gender,
数字, and case. Modifiers agree with the

5https://github.com/arozovskaya/RULEC-GEC.

head nouns; 因此, words in these grammatical
categories can have up to 24 different word forms.
Verbs are marked for number, 性别, and person
and agree with the grammatical subject. 其他
categories for verbs are aspect, tense, and voice.
These are typically expressed through morphemes
corresponding to functional words in English
(shall, 将要, 曾是, 有, 有, been, ETC。).

3.2 Annotation

Two annotators, native speakers of Russian with
a background in linguistics, corrected a subset
of RULEC (12,480 句子, comprising 206K
字). One of the annotators is an English as a
Second Language instructor and English–Russian
translator. The annotation was performed using
a tool built for a similar annotation project for
英语 (Rozovskaya and Roth, 2010A). We refer
to the resulting corpus as RULEC-GEC.

When selecting sentences to be annotated, 我们
attempted to include a variety of writers from
each group (foreign and heritage speakers). 这
annotated data include 12 foreign and 5 heritage
writers. The essays of each writer were sorted
alphabetically by the essay file name; the essays
for annotation were selected in that order, 和
the sentences were selected in the order they
appear in each essay. We intentionally selected
more essays from non-native authors, 和我们一样
conjectured that these authors would display a
greater variety of grammatical errors and higher
error rates. 最终, for each author, a subset of
that writer’s essays was included, but a different
number of annotated essays per author, 即,
之间 13 和 159 essays per author.

The data were corrected, and each mistake
was assigned a type. We developed an error
classification schema that addresses errors in
morphology, syntax, and word usage, and takes
into account linguistic properties of the Russian
语言, by emphasizing those that are most
commonly misused. The common phenomena
were identified through a pilot annotation, 和
with the help of sample errors that had been
collected with the Russian National Corpus in
the process of developing a similar annotation
of Russian learner texts. The sample errors were
made available to us by the authors (Klyachko
等人。, 2013). This study resulted in an annotated
语料库, available for online search at http://
web-corpora.net/ (Rakhilina et al., 2016).

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Noun:案件

зависит
depends

Это
от
from testimonygen,*sg/gen,pl
这
‘This depends on the testimony of eyewitnesses’

*показания/показаний очевидцев

Preposition

Слова
wordnom,pl
‘Words from previous lessons’

*от/из
*from/out of

прошлых
previousgen,pl

Verb number agreement

eyewitnessgen,pl

уроков
lessongen,pl

Все новые
全部
‘All new buildings are falling apart’

здания
buildingnom,pl

newnom,pl

*разваливается/разваливаются
∗f allpres,imperfect,sg/f allpres,imperfect,pl apart

Verb gender agreement

Лера
Valerie
‘Valerie tried flirting with him’

*пробовал/пробовала
∗trypast,imperfect,masc/trypast,imperfect,fem to flirt

флиртовать

с
和

ним
他

Lexical choice

Тогда люди
然后
‘Then people started to ask questions’

стали
started

peoplenom,pl

*спрашивать/задавать
∗to inquire/to ask

вопросы
questionsacc,pl

桌子 1: Examples of common errors in the Russian learner corpus. Incorrect words are marked with an asterisk.

Annotator
Annotator A
Annotator B
全部的

全部的
字
77,494
128,764
206,258

Corrected
字
5,315
7,732
13,047

Error
速度 (%)
6.9
6.0
6.3

桌子 2: Statistics for the annotated data in RULEC-
GEC.

Our error tagset was developed independently
and is smaller than the one in Rakhilina et al.
(2016), in order to minimize the annotation bur-
这, while still being able to distinguish among
most typical linguistic problems for Russian lan-
guage learners. We include 23 tags that cover
syntactic and morphosyntactic errors, orthogra-
物理层, and lexical errors. 桌子 1 illustrates some
of the common errors, 和表 2 presents anno-
tation statistics. Frequencies for the top 13 错误
are shown in Table 3. Note that the top 10 错误
types account for over 80% of all errors. 不是
shown are the phenomena that occur less than one
error per 1,000 字: adj:性别, 动词:嗓音,
动词:tense, adj:其他, pronoun, adj:数字, 骗局-
junction, 动词:其他, 名词:性别, 名词:其他.

3.3

注释者间协议

Because annotation for grammatical errors is
extremely variable, as there are often multiple

Error type
Spelling
Noun:案件
Lexical choice
标点
Missing word
Replace
Extra word
Adj.:案件
Preposition
Word form
Noun:数字
动词:number/gender
动词:aspect

Errors per
全部的 % 1,000 字
2575
1560
1451
1139
989
687
618
428
364
354
286
285
208

21.7
13.2
12.3
9.6
8.4
5.8
5.2
3.6
3.1
3.0
2.4
2.4
1.8

12.5
7.6
7.0
5.5
4.8
3.3
3.0
2.1
1.8
1.7
1.4
1.4
1.0

桌子 3: Distribution by error type. Total number
of categories is 23. The top 13 are shown. Replace
includes phenomena not covered by other categories,
例如, additional morphological phenomena, replacing
multi-word expressions, and word order.

ways of correcting the same mistake (Bryant
和吴, 2015), we compute inter-rater agreement
following Rozovskaya and Roth (2010A), 在哪里
the texts corrected by one annotator were given
to the second annotator. Agreement is computed
as the percentage of sentences that did not have
additional corrections on the second pass. 后

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

第二
pass
Annotator A
Annotator B

Error
速度 (%)
2.40
0.67

Judged
正确的 (%)
68.5
91.5

桌子 4: Inter-annotator agreement. Error rates based
on the corrections on the second pass. Judged correct
denotes the percentage of sentences that the second
rater did not change.

全部, our goal is to make the sentence well formed,
without enforcing that errors are corrected in
the same way. A total of 200 sentences from
each annotator were selected and given to the
other annotator. 桌子 4 shows that the error rate
of the sentences corrected by annotator A on
the second pass was 2.4%, 和 68.5% 的
sentences remaining unchanged. The sentences
corrected by annotator B on the second pass had
an error rate of less than 1%, and over 91% 的
the sentences did not have additional corrections.
These agreement numbers are higher than those
reported for English, where the percentage of
unchanged sentences varied between 37% 和
83% (Rozovskaya and Roth, 2010A).

3.4 Comparison to Other Learner Corpora

Error Rates
表中 5, we compare the error
rates in RULEC-GEC to those in a learner corpus
of Arabic (Zaghouani et al., 2014) 和三个
corpora of learner English: JFLEG (Napoles et al.,
2017), FCE (Yannakoudakis et al., 2011), 和
CoNLL (Ng et al., 2014). The error rates in
RULEC are generally lower than in the other
learner corpora. The Arabic data have the highest
error rate of 28.7%. In the English learner corpora,
the error rates range between 6.5% 和 25.5%.
The error rates are 17.7% (FCE); 18.5–25.5% for
JFLEG, annotated independently by four raters;
and 10.8–13.6% for CoNLL-test, annotated by two
raters. The lowest error rate that is comparable
to ours is in CoNLL-train (6.6%). We attribute
the differences to the proficiency levels of the
RULEC writers, which is fairly advanced. 实际上,
error rates vary widely by learner group (foreign
与. heritage), as discussed in Section 3.5.

Most Common Errors Table 6 lists the top five
most common errors for the three corpora (这
Arabic corpus and JFLEG are not annotated for
types of errors). 用英语, lexical choice errors,
文章, preposition, punctuation, and spelling are

语料库
俄语 (RULEC-GEC)
英语 (FCE)
英语 (CoNLL-test)
英语 (CoNLL-train)
英语 (JFLEG)
Arabic

Error rate (%)
6.3
17.7
10.8–13.6
6.6
18.5–25.5
28.7

桌子 5: Error rates in various learner corpora. 这
CoNLL and JFLEG have two and four reference
注释, 分别. Numbers shown for each.

the most common mistake types (note that
‘‘mechanical errors’’ in CoNLL group together
spelling and punctuation errors). Noun number
errors are also common in CoNLL, a corpus pro-
duced by learners whose first language is Chinese,
whereas these are less common in FCE, produced
by learners of diverse linguistic backgrounds.
In Russian, spelling, punctuation, and lexical
choice are also in the top five.

the top five error cate-
In RULEC-GEC,
gories are spelling,
lexical choice, 名词:案件,
punctuation, and missing word. 全面的, spelling,
punctuation, and lexical errors are in the top
five categories for all of the three corpora. 作为
for grammar-related errors, although article and
preposition errors also made it
to the top of
the list in the English corpora, noun case usage
is definitely the most challenging and common
phenomenon for Russian learners.

3.5 Foreign vs. Heritage Speakers

We also compare foreign and heritage speakers.
The heritage speaker subcorpus includes 42,187
字, and the foreign speaker partition comprises
164,071 字. The error rates are 4.0% 和 6.9%
for each group; foreign learners make almost twice
as many mistakes as heritage speakers. 在里面
foreign group, there is a lot of variation, 和
five writers exhibiting error rates of 10–13%,
two writers whose error rates are below 3%, 和
five authors having error rates between 5% 和
7%. There is not much variation in the heritage
团体.

The two groups also reveal differences in the
error distributions (桌子 7): 多于 65% 的
errors in the heritage group are in spelling and
punctuation. 实际上, 42.4% of errors in the heritage
corpus are spelling mistakes vs. 18.6% for foreign
speakers. If we consider the number of errors
每 1,000 字, we observe that, interestingly,

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

俄语
Spell. 21.7
Noun:案件 13.2
Lex. 选择 12.3
Punc. 9.6
Miss. word 8.4

Top errors (%)

英语 (FCE)
Art. 11.0
Lex. 选择 9.5
Prep. 9.0
Spell. 8.1
Punc. 8.0

英语 (CoNLL-test)
Lex. 选择 14.2/14.4
Art. 13.9/13.3
Mechan. 9.6/14.9
Noun:数字 9.0/6.8
Prep. 8.8/11.7

桌子 6: Comparison statistics for Russian and English learner corpora. The CoNLL-test was annotated by two
annotators; numbers shown for each.

外国的

Heritage

Error
Spell.
Noun:案件
Lex. 选择
Miss. word
Punc.
Replace
Extra word
Adj:案件
Prep.
Word form
Noun:数字
Verb agr.

(%)
18.6
14.0
13.3
8.9
7.6
6.3
5.7
3.9
3.3
3.1
2.6
2.5

Errors
每
1,000
11.7
8.8
8.3
5.6
4.8
3.9
3.5
2.4
2.1
2.0
1.6
1.6

Error
Spell.
Punc.
Noun:案件
Lex. 选择
Miss. word
Replace
Extra word
Adj:案件
Word form
Noun:数字
Verb agr.
Prep.

(%)
42.4
22.9
7.8
5.5
4.7
2.8
2.4
2.1
2.1
1.8
1.6
1.5

Errors
每
1,000
15.7
8.5
2.9
2.0
1.7
1.0
0.9
0.8
0.8
0.7
0.6
0.6

桌子 7: Most common errors for foreign and heritage Russian speakers.

heritage speakers make spelling and punctuation
errors more frequently (15.7 spelling and 8.5
punctuation errors in the heritage group vs. 11.7
spelling and 4.8 punctuation errors in the foreign
团体). As for the other grammatical phenomena,
although these are all more challenging for the
foreign speaker group, the distributions of these
phenomena are quite similar. 例如, 她-
itage speakers make 2.9 noun case errors per
1,000 字, whereas foreign speakers make
8.8 noun case errors per 1,000 字; for both
types of writers, noun case errors are at the top
of the list (second most common for the foreign
group and third most common for the heritage
团体).

4 实验

The experiments investigate the following:

1. How do the two state-of-the-art methods
compare under the conditions that we have
for Russian (rich morphology and limited
注释)?

2. What is the performance on individual errors
and the overall performance compared with re-
sults obtained for English grammar correction?
3. How well do the classifiers within the min-
imal supervision framework perform in mor-
phologically rich languages, on grammatical
phenomena that are common in highly inflec-
tional languages such as Russian, as well as on
phenomena that also occur in English?

To answer these questions, the following three

approaches are implemented:

• Learner-trained classifiers: Error-specific clas-

sifiers trained on learner data

• Minimal-supervision classifiers: Error-specific
classifiers trained on learner and native data
with minimal supervision (参见章节 2.2)

• Phrase-based machine translation system

Data We split the annotated data into training
(4,980 句子, 83,410 字), 发展
(2,500 句子, 41,163 字), and test (5,000

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

句子, 81,693 字). For the native data, 我们
use the Yandex corpus (Borisov and Galinskaya,
2014), a diverse corpus of newswire, fiction, 和
other genres (18M words). All the data was pre-
processed with the the Mystem morphological
analyzer (Segalovich, 2003) and a part-of-speech
标签 (Schmid, 1995).

4.1 Classifiers

In the classification framework, we develop
classifiers for several common grammar errors:
preposition, noun case, verb aspect, and verb
协议 (split into number and gender). 这
rationale for selecting these errors is to evaluate
the behavior of the classifiers on phenomena that
have been well studied in English (例如, preposition
and verb number agreement), as well as those that
have not received much attention (verb aspect); 或者
those that are specific to Russian (noun case and
gender agreement). For each error type, 一个特别的
classifier is developed. The features include word
n-grams, POS n-grams,
lemma n-grams, 和
morphological properties of the target word and
in line with
neighboring words. 此外,
Rozovskaya and Roth (2016), we include a punc-
tuation module that inserts missing commas, 使用
patterns mined from the Yandex corpus and the
RULEC-GEC training data. We now provide more
detail on the grammar phenomena considered.

Noun Case Errors Noun case usage is the most
common error type after spelling and accounts for
14% of all errors. The Russian case system consists
of six cases: Nominative, Genitive, Accusative,
Dative, Instrumental, and Locative. 案子
classifier is thus a six-way classifier, with each
class corresponding to one of the cases. The labels
are obtained by extracting the case information
predicted by the morphological analyzer on
original and corrected noun forms. It should be
noted that the surface form of the noun may be
ambiguous with respect to case. 例如, 这
word яблоко (‘‘apple’’) in different contexts can
be interpreted as nominative or accusative. 在那里面
案件, the morphological analyzer will list both
分析, and both of these will be included as
gold labels for the word. This is because our task
is not to predict the case but the surface form of
名词. 关于 58% of nouns are unambiguous
(have one case-related morphological analysis),
34% have two possible case analyses, 和 8% 的
nouns have three or more analyses.

Error
Noun:案件

Verb agr. (num.)
Verb agr. (性别)
Aspect
Prep.

Confusion set
{Nom., Gen., Acc.,
Dat., Inst., Loc.}
{Singular, Plural}
{Fem., Masc., Neutral}
{Perfect, Imperfect}
{15 介词}

桌子 8: Confusion sets for the five types of errors.

Number and Gender Verb Agreement Errors
Verb agreement functions in a way that is sim-
ilar in English. In Russian, verbs are specified
for number (singular, plural), 性别 (feminine,
masculine, and neutral), and person. Errors in
person agreement are rare, and we ignore these.

the most common errors for

Preposition Errors Preposition errors are some
的
learners of
英语 (Leacock et al., 2010), and are also quite
common among the Russian learners, 会计
超过 3% of all errors (桌子 3). In the clas-
sification framework, it is common to consider
top n most frequent prepositions (Dahlmeier and
的, 2012; Tetreault et al., 2010). In line with work
in English, we consider mistakes that involve the
顶部 15 Russian prepositions.6

Verb Aspect The Russian verb system is dif-
ferent from English, and verb aspect errors among
Russian learners are quite common. 俄语
has three tenses—present, 过去的, and future—and
each tense can be expressed in imperfective or
perfective aspect. Although there is no direct
correspondence between the Russian aspect usage
and the English tenses, the aspect can be weakly
aligned with the English tense system. 事先的
research in English showed that these are some of
the most difficult mistakes, as verb tense usage is
highly semantic rather than grammatical (Lee and
Seneff, 2008; Tajiri et al., 2012).

桌子 8 lists the confusion sets for each error
classifier. In all cases, discriminative learning
framework is used with the Averaged Perceptron
algorithm (Rizzolo, 2011).

Adding Artificial Errors in the Classifiers Within
both the learner-trained and minimally supervised
classifiers, we make use of the artificial errors

6{в (在, 到), на (在, 到), с (从), о (关于), для (为了),
к (到), из (从), по (沿着, 在), от (从), у (在), за
(为了, behind), во (在, 到), между (之间), до (前), об
(关于)}.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Label
Nom.
Gen.
Dat.
Acc.
Inst.
Loc.

Nom.
.9961
.01147
.0097
.0064
.0163
.0029

Gen.
.00214
.9775
.0105
.0056
.0181
.0068

来源

Dat.
.0002
.0009
.9589
.0004
.0018
–

Acc.
.0006
.0043
.0105
.9837
.0113
.0087

Inst.
.0002
.0031
0045
.0008
.9511
.0017

Loc.
.0007
.0027
.0060
.0031
.0014
.0980

桌子 9: Confusion matrix for noun case errors based on the training and development data from the RULEC-GEC
语料库. The left column shows the correct case. Each row shows the author’s case choices for that label and
P rob(来源|标签).

方法 (Rozovskaya et al., 2017) to simulate
learner errors in training. Learner error patterns (或者
error statistics) are extracted from the annotated
learner data. 具体来说, given an error type, 我们
collect all source/label pairs from the annotated
sample, where both the source and the label
belong to the confusion set, and generate a
confusion matrix, where each cell represents
P rob(source=s|label=l).

桌子 9 shows a confusion matrix for noun case
errors based on error statistics collected from the
training and development data in RULEC-GEC.
The values in the confusion matrix are used to
generate noun form errors in the training data.
例如, according to the table, given a noun
that needs to be in the genitive case, a learner
is four times more likely to use the nominative
case instead of the locative case. 我们用这个
table both to introduce artificial errors in native
training data and to increase the error rates in
the learner data by adding artificial mistakes to
naturally occurring errors. Adding artificial errors
when training on learner data is also useful, 作为
increasing the error rates improves the recall of
系统. In both cases, the generated errors are
added, so that the relative frequencies of different
confusions are preserved (例如, nominative is four
times more likely than locative to be used in place
of genitive), and the error rates can be varied
(higher error rates will improve the recall of the
system at the expense of precision).

4.2 The MT System

One advantage of the MT approach is that error
types need not be formulated explicitly. 我们
build a phrase-based MT system that follows the
implementation in Susanto et al. (2014). Our MT
system is trained using Moses (Koehn et al., 2007).
The phrase table is trained on the training partition

of RULEC-GEC. We use two 4-gram language
models—one is trained on the Yandex corpus, 和
the other one is trained on the corrected side of
the RULEC-GEC training data. Both are trained
with KenLM (Heafield et al., 2013). Tuning is
done on the development dataset with MERT
(和, 2003). We use BLEU (Papineni et al., 2002)
as the tuning metric.

We note that several neural MT systems have
been proposed recently (参见章节 2). 因为
we only have a small amount of parallel data, 我们
adopt the phrase-based MT, as it is known that
neural MT systems have a steeper learning curve
尊重地
to the amount of training data,
resulting in worse quality in low-resource settings
(Koehn and Knowles, 2017). We also note that
Junczys-Dowmunt and Grundkiewicz (2016) pre-
sent a stronger SMT system for English gram-
mar correction. Their best result that is due to
adding dense and sparse features is an improve-
的改变 3 到 4 points over the baseline system
(they also rely on much large tuning sets, as re-
quired for sparse features). The baseline system is
essentially the same as that of Susanto et al.
(2014). Because our MT result is so much lower
than the classification system, we do not expect
that adding sparse and dense features will close
that gap.

5 结果

We start by comparing performance on individual
错误; then the overall performance of the best
classification systems and the MT system is
compared.

Classifier Performance On Individual Errors
第一的, we wish to assess the contribution of
the minimal-supervision approach compared with
training on the learner data for a language with rich

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Error

案件

Number agr.

Gender agr.

Aspect

Prep.

训练数据
(1) Learner
(2) Learner+native
(1) Learner
(2) Learner+native
(1) Learner
(2) Learner+native
(1) Learner
(2) Learner+native
(1) Learner
(2) Learner+native

Performance
右
32.8
36.1
7.7
16.6
7.1
16.0
2.5
9.1
3.8
24.9

F0.5
19.7
34.8
24.9
35.3
22.4
41.2
8.6
16.9
12.9
44.8

磷
17.9
34.5
56.7
49.1
48.5
67.9
21.6
21.5
31.9
56.1

桌子 10: Comparison of classifiers trained on (1) learner data and (2) learner + native data, using the minimal
supervision framework.

morphology. 为此, two types of classifiers
are compared: learner-trained (trained on learner
数据) and minimal-supervision (trained on native
data with artificial errors based on error statistics
extracted from the learner data; 部分 2).
The classifiers are tuned on the development
partition—that is, the error rates that determine
at which rate artificial errors injected into the
training data are optimized on the development
数据. Performance results on the test data are
for models trained on the training+development
数据 (learner-trained models). 相似地, the mini-
mal supervision classifiers use error statistics
extracted from training+development.

桌子 10 shows performance for the five types
of errors. For all errors, minimal-supervision
models outperform the learner-trained models
substantially, 经过 8 到 32 F0.5 points. 这是
because the amount of annotation that we have
is really too small to estimate all parameters, 但
it is sufficient to provide error estimates in the
minimal supervision framework. 此外, 这
punctuation module achieves an F0.5 score of 30.5
(precision of 47.4 and recall of 12.6).

Classifiers vs. MT So far, we have evaluated
performance of the classifiers with respect
到
individual errors. 桌子 11 shows the performance
of the three systems on the entire dataset and
evaluates with respect to all errors in the data.
The results show that when annotation is scarce,
MT performs poorly. This result is consistent with
findings for English, showing that MT systems
outperform classifiers only when the parallel
corpus is large (30–40M words) (Rozovskaya
and Roth, 2016) but lag behind even when over
1M tokens are available.

系统
Classifiers
(learner)
Classifiers
(minimal sup.)
公吨

训练数据

P R F0.5

Learner

22.6 4.8 12.9

Learner+native 38.0 7.5 21.0

Learner+native 30.6 2.9 10.6

桌子 11: Performance of the three systems.

We combine the MT system and the minimally
supervised classifiers following Rozovskaya and
Roth (2016). Because MT systems are not re-
stricted for error type, the misuse they correct is
typically more diverse (see also Section 6). 这
F0.5 score thus improves by 2 点, 到 23.8, 为了
the combined system, due to a slightly better recall
(10.2). 然而, the precision drops from 38.0 到
35.8, since the MT system has a lower precision
than the classifiers.

6 Discussion and Error Analysis

The current state of the art in English gram-
mar correction on the widely used benchmark
CoNLL test is 50.27 for a single system (Junczys-
Dowmunt and Grundkiewicz, 2018). System com-
bination, model ensembles, and adding a spell
checker boost these numbers by 4 到 6 点
(Chollampatt and Ng, 2018; Junczys-Dowmunt
and Grundkiewicz, 2018). These models are
trained on the CoNLL training data and additional
learner data (about 30M words). An MT system
trained on CoNLL data (1.2M words) obtains an
F0.5 score of 28.25 (Rozovskaya and Roth, 2016).
Although these MT systems differ in how they
are trained, these numbers should give an idea
of the effect the amount of parallel data has on
the performance.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Error type
案件
Number agr.
Gender agr.
Prep.
Aspect

磷
34.5
49.1
67.9
56.1
21.5

前
右
36.1
16.6
16.0
24.9
9.1

F 0.5
34.8
35.3
41.2
44.8
16.9

磷
38.2
56.7
67.9
72.2
30.0

后
右
36.1
35.3
16.0
24.9
9.1

F 0.5
37.8
38.2
41.2
52.3
20.6

桌子 12: Performance of minimal-supervision classifiers before and after false positive analysis.

A minimal-supervision classification system
that uses CoNLL data obtains an F0.5 score of
36.26 (Rozovskaya and Roth, 2016). 相比之下,
the classification system for Russian obtains a
much lower score of 21.0. This may be due to
a larger variety of grammatical phenomena in
俄语, lower error rates, and a high proportion
of spelling errors (especially among heritage
speakers), which we currently do not specifically
目标. Note also that the CoNLL-2014 results
are based on two gold references for each sen-
张力, while we evaluate with respect to one,
and having more reference annotations improves
表现 (Bryant and Ng, 2015; Sakaguchi
等人。, 2016; Choshen and Abend, 2018乙).7 它
should also be noted that the gap between the MT
system and the classification system when both
are trained with limited supervision is larger for
俄语 (10.6 与. 20.5) than for English (28.25
与. 36.26). This indicates that the MT system suf-
fers more than classifiers, when the amount of
supervision is particularly small, while the mor-
phological complexity of the language is higher.
Considering Arabic and Chinese, 哪里的
training data is also limited, the results are also
much lower than in English. In Arabic, 在哪里
the supervised learner data includes 43K words,
the best reported F-score is 27.32 (Rozovskaya
等人。, 2015).8 In Chinese, the supervised dataset
size is about 50K sentences, and the highest
reported scores are 26.93 for detection (Rao et al.,
2017) 和 17.23 for error correction (Rao et al.,
2018), 分别. These results confirm that
the approaches that rely on large amounts of
supervision do not carry over to low-resource

7There is ongoing research on the question of the
most appropriate evaluation metric and gold references
for grammatical error correction. See Sakaguchi et al.
(2016), Choshen and Abend (2018乙), and Choshen and Abend
(2018C).

8This result

is based on performance that does not
take into account some trivial Arabic-specific normalization
corrections.

settings. It is thus desirable to develop approaches
that can be robust with a small amount of
supervision, especially when applied to languages
that are morphologically more complex than
英语.

6.1 误差分析

To understand the challenges of grammar cor-
rection in a morphologically rich language such
as Russian, we perform error analysis of the MT
system and the classification system that uses
minimal supervision. The nature of grammar
correction is such that multiple different correc-
tions are often acceptable (Ng et al., 2014).
此外, annotators often disagree on what
constitutes a mistake, and some gold errors missed
by a system may be considered as acceptable
usage by another rater. 因此, when a system is
the gold truth produced by
compared against
just one annotator, performance is understated. 在
事实, the F-score of a system increases with the
number of per-sentence annotations (Bryant and
的, 2015).

Classifiers: False Positives We start by ana-
lyzing the cases where the system flagged an error
that was not marked in the gold annotation. False
positive cases were manually annotated by one
of the annotators and acceptable predictions were
identified. 正如预期的那样, because of the variability
in the annotators’ judgments and possibility of
multiple acceptable options, there are false pos-
itives that actually should be true positives. 我们
re-evaluate the performance of the classifiers
based on the error analysis in Table 12.

For all error types, except gender agreement
(which has a high precision of 67.9%), precision
improvements range between 4 points and 16 点.
The highest improvement is observed for pre-
position errors: 关于 48% of false positives are
in fact acceptable suggestions. This improvement
mirrors the results in English (precision improves
从 30% 到 70% [Rozovskaya et al., 2017]) 和

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

这些

В этих местах мало
在
‘There are few prospects in these places’
例子 1: Case error on a noun following the adverbial ‘‘few’’.

*перспектива/перспектив
∗prospectspl,nom/prospectspl,gen

地方

很少

Он обеспечивает
supplies
它
‘It provides clients with access to information’

*клиентов/клиентам
∗clientspl,gen/clientspl,dat

доступ
accesssg,acc

к
towards

информации
inf ormationsg,gen

例子 2: Case error on a noun governed by the verb ‘‘provides’’.

станции *использует/используют разные приборы
station

На
在
‘At the station (他们) use various tools’
例子 3: Agreement error on a verb without an explicit subject.

use3rd-person,*sg/3rd-person,pl

各种各样的

工具

была
曾是

готова
准备好

*давать/дать
Она
∗giveinf,imperfect/giveinf,完美的
她
‘She was ready to give me everything that is necessary’
例子 4: Aspect error on a verb that requires wider context beyond sentence.

все
一切

мне
to me

, что нужно
,

必要的

那

can be explained by the fact that preposition usage
is highly variable (IE。, many contexts license
multiple prepositions [Tetreault and Chodorow,
2008]).

Classifiers: Errors Missed by the System
Although the precision of
the classifiers is
generally quite good, the recall is much lower,
ranging between 36.1% 和 24.9% for noun case
and preposition errors to 16% for agreement errors
和 9.1% for verb aspect errors.

Among the languages studied in the grammar
correction research, noun case errors are unique
to Russian.9 But because the appropriate case
choice depends on the word governing the noun,
one can view case declension to be similar to
subject-verb agreement. 然而, case errors are
arguably more challenging because the target noun
may be governed by a verb, a preposition, 其他
名词, or even by an adverbial; 因此, 有
a higher level of ambiguity when identifying the
dependency as well as determining the appropriate
案件. A morphologically rich language such as
Russian uses case to express relations that are
commonly conveyed by prepositions in English;
因此, verbs that are followed by a direct
object and a prepositional object in English appear
with two noun phrases, whose relationship to
the verb is expressed through appropriate cases.
Examples (1) 和 (2) illustrate two case errors,

9Case errors have certainly been considered in studies that
aim at annotating learner corpora, including Czech (Hana
等人。, 2010) and German (Abel et al., 2014).

where the first noun is governed by an adverbial,
and the other noun is governed by a verb. 一个
additional challenge is that prepositions and verbs
can also license multiple cases. 例如, 这
prepositions на and в can denote location, 什么时候
followed by a noun in locative case, 也
direction when followed by a noun in dative case.
Analysis of the missed verb agreement errors
reveals several challenges; some of these are
specific to morphologically rich languages. 这
main challenge here is identifying the subject of
the target verb. 因此, errors on verbs that are
located far from the subject head are typically
not handled well in both Russian and English; 在
the Russian corpus, these account for 20% 全部的
missed errors. Because the system currently does
not use a parser, we anticipate that adding a parser
will improve performance. 然而, because of
Russian’s free word order, there are more options
for the location of the subject. It is also not
uncommon for a subject to be placed after the
动词, 和 19% of errors that are currently missed
occur when the subject is located after the verb.
最后, 关于 6% of missed errors occur on verbs
that have no explicit subject, as in Example (3). 在
such cases the verb takes the form of third person
singular masculine or third person plural.

Compared with other errors, aspect errors
exhibit the lowest performance. Appropriate as-
pect form may require understanding the con-
text around the verb, often beyond the sentence
illustrates an error
边界. 例子 (4)
在哪里, without looking at the wider context, 两个都
perfect and imperfect forms are possible. 一些

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

verb aspect errors are similar to verb tense errors
in English. Studies in English also reported poor
表现, a precision of 20% 对应于
a recall of about 20% on verb aspect errors (Tajiri
等人。, 2012). Our expectation is that with richer
表示, such as identification
语境
of temporal relations, one can do better. 一些
verbs are also ambiguous with respect to aspect;
例如, проводить can be translated as
‘‘carry out’’ (imperfective), and ‘‘accompany’’
(perfective).

The MT System Because the output of the MT
system does not specify the correction type, 我们的
annotator manually analyzed the true positives
of the system and classified these for type. 这
most common true positive corrections of the MT
system fall into the following categories: spelling
(40%), missing comma (36%), 名词:案件 (13%),
and lexical (7%).

We also analyze the false positives. 关于 15%
of the false positives are in fact true positives.
因此, the precision and the F-score of the
MT system improve from 30.6 到 41.0 和来自
10.6 到 11.4, 分别. Even though the cur-
rent MT system performs poorly, the analysis
supports the findings in English that MT systems
correct a more diverse set of errors and, if trained
with sufficient supervision, should complement a
classification system well.

7 结论

We address the task of correcting writing mistakes
in Russian, a morphologically rich language.
We correct and error-tag a corpus of Russian
learner data. The release of this corpus should
facilitate research efforts in grammar correction
for languages other than English that do not have
many resources available to them. 实验
on that corpus demonstrate that the MT approach
performs poorly due to lack of annotated data. 这
MT system is outperformed substantially by a min-
imally supervised machine learning classifica-
tion approach.

致谢

The authors thank Olesya Kisselev for her help
with obtaining the RULEC corpus, and Elmira
Mustakimova for sharing the error categories
developed at the Russian National Corpus. 这
authors thank Mark Sammons and the anony-

mous reviewers for their comments. This work
was partially supported by contract HR0011-15-
2-0025 with the US Defense Advanced Research
Projects Agency (DARPA). The views expressed
are those of the authors and do not reflect the
official policy or position of the Department of
Defense or the US Government.

参考

Andrea Abel, Aivars Glaznieks, Lionel Nicolas,
and Egol Stemle. 2014. KoKo: An L1 learner
corpus for German. In Proceedings of LREC,
pages 2414–2421.

Anna Alsufieva, Olesya Kisselev, and Sandra
Freels. 2012. 结果 2012: Using flagship data
to develop a Russian learner corpus of academic
写作. Russian Language Journal, 62:79–105.

Michele Banko and Eric Brill. 2001. 缩放
to very very large corpora for natural
lan-
guage disambiguation. In Proceedings of ACL,
pages 26–33.

Alexey Borisov and Irina Galinskaya. 2014.
Yandex school of data analysis Russian-English
machine translation system for WMT14. 在
Proceedings of the Ninth Workshop on Sta-
tistical Machine Translation, pages 66–70.

Christopher Bryant and Hwee Tou Ng. 2015. 如何
far are we from fully automatic high quality
grammatical error correction? In Proceedings
of ACL, pages 697–707.

Aoife Cahill, Nitin Madnani, Joel Tetreault, 和
Diane Napolitano. 2013. Robust systems for
preposition error correction using Wikipedia
In Proceedings of NAACL-HLT,
revisions.
pages 507–517.

Shamil Chollampatt and Hwee Tou Ng. 2017.
Connecting the dots: Towards human-level
grammatical error correction. In Proceedings
of the 12th Workshop on Innovative Use of
NLP for Building Educational Applications,
pages 327–333.

Shamil Chollampatt and Hwee Tou Ng. 2018.
A multilayer convolutional encoder-decoder
neural network for grammatical error correc-
的. In Proceedings of the AAAI, 第 1–8 页.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Shamil Chollampatt, Kaveh Taghipour, and Hwee
Tou Ng. 2016. Neural network translation
models for grammatical error correction. 在
Proceedings of IJCAI, pages 2768–2774.

Mariano Felice and Zheng Yuan. 2014. Gener-
ating artificial errors for grammatical error
correction. In Proceedings of the Student Re-
search Workshop at EACL.

Leshem Choshen and Omri Abend. 2018A.
Automatic metric validation for grammatical
error correction. In Proceedings of ACL.

Michael Gamon. 2010. Using mostly native data
to correct errors in learners’ writing. In Pro-
ceedings of NAACL-HLT, pages 163–171.

Leshem Choshen and Omri Abend. 2018乙. 在-
herent biases in reference-based evaluation for
grammatical error correction and text simpli-
fication. In Proceedings of ACL.

Leshem Choshen and Omri Abend. 2018C.
Reference-less measure of
faithfulness for
grammatical error correction. In Proceedings
of NAACL-HLT.

Daniel Dahlmeier and Hwee Tou Ng. 2011.
Grammatical error correction with alternating
structure optimization. In Proceedings of ACL,
pages 915–923.

Daniel Dahlmeier and Hwee Tou Ng. 2012. A
beam-search decoder for grammatical error
correction. In Proceedings of EMNLP-CoNLL,
pages 568–587.

Robert Dale,

and George
Ilya Anisimoff,
Narroway. 2012. A report on the preposition
and determiner error correction shared task.
In Proceedings of the NAACL Workshop on
Innovative Use of NLP for Building Educational
应用领域, pages 54–62.

Robert Dale and Adam Kilgarriff. 2011. Helping
Our Own: The HOO 2011 pilot shared task. 在
Proceedings of the 13th European Workshop on
Natural Language Generation, pages 242–249.

Rachele De Felice and Stephen G. Pulman. 2008.
A classifier-based approach to preposition and
determiner error correction in L2 English. 在
Proceedings of COLING, pages 169–176.

Markus Dickinson and Scott Ledbetter. 2012.
Annotating errors in a Hungarian learner cor-
脓. In Proceedings of LREC.

Mariano Felice and Ted Briscoe. 2015. Towards
a standard evaluation method for grammatical
error detection and correction. In Proceedings
of NAACL-HLT, pages 578–587.

Michael Gamon, Jianfeng Gao, Chris Brockett,
Alexander Klementiev, William Dolan, Dmitriy
Belenko, and Lucy Vanderwende. 2008. 使用
techniques and language
contextual speller
modeling for ESL error correction. In Pro-
ceedings of IJCNLP.

Andrew R. Golding and Dan Roth. 1996. 应用程序-
plying Winnow to context-sensitive spelling
correction. In Proceedings of ICML.

Andrew R. Golding and Dan Roth. 1999. A
Winnow based approach to context-sensitive
spelling correction. Machine Learning, 34(1-3):
107–130.

Na-Rae Han, Martin Chodorow, and Claudia
Leacock. 2006. Detecting errors in English
article usage by non-native speakers. 杂志
of Natural Language Engineering, 12(2):
115–129.

Jirka Hana, Alexandr Rosen, Svatava ˇSkodov´a,
and Barbora ˇStindlov´a. 2010. Error-tagged
learner corpus of Czech. 在诉讼程序中
the Fourth Linguistic Annotation Workshop,
pages 11–19.

Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H.
克拉克, and Philipp Koehn. 2013. Scalable mod-
ified Kneser-Ney language model estimation.
In Proceedings of ACL, pages 690–696.

Arantza D´ıaz de Ilarraza, Koldo Gojenola, 和
Maite Oronoz. 2008. Detecting erroneous uses
of complex postpositions in an agglutina-
tive language. COLING 论文集,
pages 31–34.

Kenji

Imamura, Kuniko

Saito, Kugatsu
Sadamitsu, and Hitoshi Nishikawa. 2012.
correction using pseudo-
语法
error sentences and domain adaptation.
在
Proceedings of ACL, pages 388–392.

错误

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Tania Ionin, Maria Luisa Zubizarreta, 和
Salvador Bautista Maldonado. 2008. 来源
of linguistic knowledge in the second lan-
guage acquisition of English articles. Lingua,
118:554–576.

Ross Israel, Markus Dickinson, and Sun-Hee Lee.
2013. Detecting and correcting learner Korean
particle omission errors. 在诉讼程序中
IJCNLP, pages 1419–1427.

Emi Izumi, Kiyotaka Uchimoto, Toyomi Saiga,
Thepchai Supnithi, and Hitoshi Isahara. 2003.
Automatic error detection in the Japanese
learners’ English spoken data. In Proceedings
of ACL, pages 145–148.

Jianshu Ji, Qinlong Wang, Kristina Toutanova,
Yongen Gong, Steven Truong, and Jianfeng
高. 2017. A nested attention neural hybrid
model for grammatical error correction. 在
Proceedings of ACL, pages 753–762.

Marcin

和

Junczys-Dowmunt

Roman
Grundkiewicz. 2016. Phrase-based machine
translation is state-of-the-art
for automatic
grammatical error correction. In Proceedings
of EMNLP, pages 1546–1556.

Marcin

和

Junczys-Dowmunt

Roman
Grundkiewicz. 2018. Near human-level perfor-
mance in grammatical error correction with
hybrid machine translation. 在诉讼程序中
NAACL-HLT, pages 284–290.

Marcin Junczys-Dowmunt, Roman Grundkiewicz,
Shubha Guha and Kenneth Heafield. 2018.
Approaching neural grammatical error cor-
rection as a low-resource machine transla-
tion task. In Proceedings of NAACL-HLT,
pages 595–606.

Elena Klyachko, Timofey Arkhangelskiy, Olesya
Kisselev, and Ekaterina Rakhilina. 2013. 汽车-
matic error detection in Russian learner lan-
规格. 第一届研讨会论文集
on Corpus Analysis with Noise in the Signal
(CANS).

Philipp Koehn, Hieu Hoang, Alexandra Birch,
Chris Callison-Burch, Marcello Federico,
Nicola Bertoldi, Brooke Cowan, Wade Shen,
Christine Moran, Richard Zens, Chris Dyer,
Ondˇrej Bojar, Alexandra Constantin, and Evan

Herbst. 2007. Moses: Open source toolkit for
statistical machine translation. In Proceedings
of ACL, pages 177–180.

Philipp Koehn and Rebecca Knowles. 2017. Six
challenges for neural machine translation. 在
Proceedings of the First Workshop on Neural
机器翻译, pages 28–39.

Claudia Leacock, Martin Chodorow, 迈克尔
Gamon, and Joel Tetreault. 2010. Automated
Grammatical Error Detection for Language
Learners. Morgan and Claypool Publishers.

John Lee and Stephanie Seneff. 2008. An analysis
of grammatical errors in non-native speech in
英语. 在诉讼程序中 2008 Spoken
Language Technology Workshop, pages 89–92.

Lung-Hao Lee, Gaoqi Rao, Liang-Chih Yu,
Endong Xun, Baolin Zhang, and Li-Ping Chang.
2016. Overview of NLP-TEA 2016 共享
task for Chinese grammatical error diagnosis.
在诉讼程序中
the Third Workshop on
Natural Language Processing Techniques for
Educational Applications, pages 1–6.

Tomoya Mizumoto, Mamoru Komachi, Masaaki
Nagata, and Yuji Matsumoto. 2011. Mining
revision log of language learning SNS for
automated Japanese error correction of second
language learners. In Proceedings of IJCNLP,
pages 147–155.

Behrang Mohit, Alla Rozovskaya, Nizar Habash,
Wajdi Zaghouani, and Ossama Obeid. 2014.
The first QALB shared task on automatic
text correction for Arabic. In Proceedings
of the EMNLP Workshop on Arabic Natural
语言处理 (ANLP), pages 39–47.

Silvina Montrul and Roumyana Slabakova.
2002. Acquiring morphosyntactic and seman-
tic properties of preterite and imperfect tenses
in L2 Spanish. In Perez-Laroux A-T and Liceras J
(编辑). The Acquisition of Spanish Morphosyn-
tax: The L1-L2 Connection. 多德雷赫特: Kluwer,
pages 113–149.

Courtney Napoles, Keisuke Sakaguchi, Matt Post,
and Joel Tetreault. 2015. Ground truth for
grammatical error correction metrics. In Pro-
ceedings of ACL, pages 588–593.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Courtney Napoles, Keisuke Sakaguchi, and Joel
Tetreault. 2017. JFLEG: A fluency corpus and
benchmark for grammatical error correction. 在
Proceedings of EACL, pages 229–234.

Alexandr Rosen, Jirka Hana, Barbora ˇStindlov´a,
and Anna Feldman. 2014. Evaluating and
automating the annotation of a learner corpus.
In Proceedings of LREC, pages 65–92.

Hwee Tou Ng, Siew Mei Wu, Ted Briscoe,
Christian Hadiwinoto, Raymond Hendy
Susanto, and Christopher Bryant. 2014. 这
CoNLL-2014 shared task on grammatical error
correction. In Proceedings of CoNLL: Shared
任务, pages 1–14.

Alla Rozovskaya, Houda Bouamor, Wajdi Zaghouani,
Ossama Obeid, Nizar Habash, and Behrang
Mohit. 2015. The second QALB shared task
on automatic text correction for Arabic. 在
Proceedings of the ACL Workshop on Arabic
自然语言处理, pages 26–35.

Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu,
Christian Hadiwinoto, and Joel Tetreault. 2013.
The CoNLL-2013 shared task on grammatical
error correction. In Proceedings of CoNLL:
Shared Task, pages 1–12.

Alla Rozovskaya and Dan Roth. 2010A. Anno-
tating ESL errors: Challenges and rewards.
In Proceedings of the NAACL Workshop on
Innovative Use of NLP for Building Educational
应用领域, pages 28–36.

Franz Josef Och. 2003. Minimum error rate
training in statistical machine translation. 在
Proceedings of ACL, pages 160–167.

Alla Rozovskaya and Dan Roth. 2010乙. Generat-
ing confusion sets for context-sensitive error cor-
反应. In Proceedings of EMNLP, pages961–970.

Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. 蓝线: A method for
automatic evaluation of machine translation. 在
Proceedings of ACL, pages 311–318.

Ekaterina Rakhilina, Anastasia Vyrenkova,
Elmira Mustakimova, Alina Ladygina, and Ivan
Smirnov. 2016. Building a learner corpus for
俄语. In Proceedings of the Joint Workshop
on NLP for Computer Assisted Language
Learning and NLP for Language Acquisition.

Loganathan Ramasamy, Alexandr Rosen, and Pavel
Stran´ak. 2015. Improvements to korektor: A
case study with native and non-native Czech. 在
Proceedings of Slovenskoˇcesk´y NLP Workshop.

Gaoqi Rao, Qi Gong, Baolin Zhang, and Endong
Xun. 2018. Overview of NLPTEA-2018 share
task Chinese grammatical error diagnosis.
在诉讼程序中
the Fifth Workshop on
Natural Language Processing Techniques for
Educational Applications, pages 42–51.

Gaoqi Rao, Baolin Zhang, and Endong Xun. 2017.
IJCNLP-2017 task 1: Chinese grammatical error
diagnosis. In Proceedings of IJCNLP, 第 1–8 页.

Nicholas Rizzolo. 2011. Learning Based Pro-
gramming. 博士论文. University of Illinois,
厄巴纳: 原野.

Alla Rozovskaya and Dan Roth. 2010C. Training
paradigms for correcting errors in grammar and
用法. In Proceedings of NAACL, pages 154–162.

Alla Rozovskaya and Dan Roth. 2011. Algorithm
selection and model adaptation for ESL correc-
tion tasks. In Proceedings of ACL, pages 924–933.

Alla Rozovskaya and Dan Roth. 2014. 建筑
a state-of-the-art grammatical error correction
系统. In Transactions of ACL, pages 419–434.

Alla Rozovskaya and Dan Roth. 2016. Gram-
matical error correction: Machine transla-
tion and classifiers. In Proceedings of ACL,
pages 2205–2215.

Alla Rozovskaya, Dan Roth, and Mark Sammons.
2017. Adapting to learner errors with min-
imal supervision. 计算语言学,
43(4):723–760.

Keisuke Sakaguchi, Courtney Napoles, Matt Post,
and Joel Tetreault. 2016. Reassessing the
goals of grammatical error correction: Fluency
instead of grammaticality. In Transactions of
前交叉韧带, pages 169–182.

Helmut Schmid. 1995. Improvements in part-of-
speech tagging with an application to German.
In Proceedings of the ACL SIGDAT Workshop,
pages 47–50.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Ilya Segalovich. 2003. A fast morphological
algorithm with unknown word guessing in-
duced by a dictionary for a web search engine.
In Proceedings of the International Conference
on Machine Learning; 楷模, Technologies
and Applications (MLMTA), pages 273–280.

Alexey Sorokin. 2017. Spelling correction for
morphologically rich language: A case study of
俄语. In Proceedings of the Sixth Workshop
on Balto-Slavic Natural Language Processing,
pages 45–53.

Alexey Sorokin, Alexey Baytin, Irina Galinskaya,
Elena Rykunova,
and Tatiana Shavrina.
2016. SpellRuEval: The first competition on
automatic spelling correction for Russian.
In Computational Linguistics and Intellectual
Technologies: Proceedings of the International
Conference ‘‘Dialogue 2016’’.

Raymond Hendy Susanto, Peter Phandi, 和
Hwee Tou Ng. 2014. System combination for
grammatical error correction. 在诉讼程序中
EMNLP, pages 951–962.

Toshikazu Tajiri, Mamoru Komachi, and Yuji
Matsumoto. 2012. Tense and aspect error
correction for ESL learners using global
语境. In Proceedings of ACL: Short Papers,
pages 198–202.

Joel Tetreault and Martin Chodorow. 2008. Native
judgments of non-native usage: Experiments in
preposition error detection. 在诉讼程序中

the COLING Workshop on Human Judgements
in Computational Linguistics, pages 24–32.

Joel Tetreault,

Jennifer Foster, and Martin
Chodorow. 2010. Using parse features for prep-
osition selection and error detection. In Pro-
ceedings of ACL, pages 353–358.

Veronika Vincze, J´anos Zsibrita, P´eter Durst,
and Martina Katalin Szab´o. 2014. Automatic
error detection concerning the definite and
indefinite conjugation in the hunlearner corpus.
In Proceedings of LREC, pages 26–31.

Helen Yannakoudakis, Ted Briscoe, and Ben
Medlock. 2011. A new dataset and method
for automatically grading ESOL texts.
在
Proceedings of ACL, pages 180–189.

Liang-Chih Yu, Lung-Hao Lee, and Li-Ping
张. 2014. Overview of grammatical error
diagnosis for learning Chinese as a foreign
语言. In Proceedings of the Workshop on
Natural Language Processing Techniques for
Educational Applications (NLP-TEA).

Zheng Yuan and Ted Briscoe. 2016. 语法
错误
neural machine
使用
翻译. In Proceedings of NAACL-HLT,
pages 380–386.

correction

Wajdi Zaghouani, Behrang Mohit, Nizar Habash,
Ossama Obeid, Nadi Tomeh, Alla Rozovskaya,
Noura Farra, Sarah Alkuhlani, and Kemal
Oflazer. 2014. Large scale arabic error anno-
站: Guidelines and framework. In Pro-
ceedings of LREC.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
2
5
1
1
9
2
3
1
1
8

/
t

我

A
C
_
A
_
0
0
2
5
1
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3
下载pdf