Czech Grammar Error Correction with a Large and Diverse Corpus
Jakub N´aplava† Milan Straka†
Jana Strakov´a† Alexandr Rosen‡
†Charles University, Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics, Czech Republic
{naplava,straka,strakova}@ufal.mff.cuni.cz
‡Charles University, Faculty of Arts
Institute of Theoretical and Computational Linguistics, Czech Republic
alexandr.rosen@ff.cuni.cz
Abstract
We introduce a large and diverse Czech cor-
pus annotated for grammatical error correction
(GEC) with the aim to contribute to the still
scarce data resources in this domain for lan-
guages other than English. The Grammar
Error Correction Corpus for Czech (GECCC)
offers a variety of four domains, covering error
distributions ranging from high error density
essays written by non-native speakers, to web-
site texts, where errors are expected to be much
less common. We compare several Czech GEC
systems, including several Transformer-based
ones, setting a strong baseline to future re-
search. Finally, we meta-evaluate common
GEC metrics against human judgments on our
data. We make the new Czech GEC corpus
publicly available under the CC BY-SA 4.0 li-
cense at http://hdl.handle.net/11234
/1-4639.
1
Introduction
Representative data both in terms of size and
domain coverage are vital for NLP systems devel-
opment. However, in the field of grammar error
correction (GEC), most GEC corpora are limited
to corrections of mistakes made by foreign or
second language learners even in the case of En-
glish (Tajiri et al., 2012; Dahlmeier et al., 2013;
Yannakoudakis et al., 2011, 2018; Ng et al., 2014;
Napoles et al., 2017). At the same time, as recently
pointed out by Flachs et al. (2020), learner cor-
pora are only a part of the full spectrum of GEC
applications. To alleviate the skewed perspective,
the authors released a corpus of website texts.
Despite recent efforts aimed to mitigate the
notorious shortage of national GEC-annotated cor-
pora (Boyd, 2018; Rozovskaya and Roth, 2019;
Davidson et al., 2020; Syvokon and Nahorna,
452
2021; Cotet et al., 2020; N´aplava and Straka,
2019), the lack of adequate data is even more
acute in languages other than English. We aim to
address both the issue of scarcity of non-English
data and the ubiquitous need for broad domain
coverage by presenting a new, large and diverse
Czech corpus, expertly annotated for GEC.
Grammar Error Correction Corpus for Czech
(GECCC) includes texts from multiple domains
in a total of 83 058 sentences, being, to our knowl-
edge, the largest non-English GEC corpus, as well
as being one of the largest GEC corpora overall.
In order to represent a diversity of writing
styles and origins, besides essays of both native
and non-native speakers from Czech learner cor-
pora, we also scraped website texts to complement
the learner domain with supposedly lower error
density texts, encompassing a representation of
the following four domains:
• Native Formal – essays written by native
students of elementary and secondary schools
• Native Web Informal – informal website
discussions
• Romani – essays written by children and
teenagers of the Romani ethnic minority
• Second Learners – essays written by non-
native learners
Using the presented data, we compare several
state-of-the-art Czech GEC systems, including
some Transformer-based.
Finally, we conduct a meta-evaluation of GEC
metrics against human judgments to select the
most appropriate metric for evaluating corrections
on the new dataset. The analysis is performed
across domains, in line with Napoles et al. (2019).
Transactions of the Association for Computational Linguistics, vol. 10, pp. 452–467, 2022. https://doi.org/10.1162/tacl a 00470
Action Editor: Alice Oh. Submission batch: 6/2021; Revision batch: 11/2021; Published 4/2022.
c(cid:2) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Language
Corpus
Sentences
Err. r.
Domain
Lang-8
NUCLE
FCE
W&I+LOCNESS
CoNLL-2014 test
JFLEG
GMEG
AESW
CWEB
1 147 451
57 151
33 236
43 169
1 312
1 511
6 000
over 1M
13 574
14.1% SL
6.6% SL
11.5% SL
11.8% SL, native students
8.2% SL
SL
—
web, formal articles, SL
—
—
scientific writing
∼2% web
AKCES-GEC
47 371
21.4% SL essays, Romani ethnolect of Czech
Falko-MERLIN
24 077
16.8% SL essays
RULEC-GEC
12 480
6.4% SL, heritage speakers
COWS-L2H
12 336
—
SL, heritage speakers
English
Czech
German
Russian
Spanish
Ukrainian
UA-GEC
20 715
7.1% natives/SL, translations and personal texts
Romanian
RONACC
10 119
—
native speakers transcriptions
# Refs.
1
1
1
5
2,10,8
4
4
1
2
2
1
1
2
2
1
Table 1: Comparison of GEC corpora in size, token error rate, domain, and number of reference
annotations in the test portion. SL = second language learners.
Our contributions include (i) a large and di-
verse Czech GEC corpus, covering learner cor-
pora and website texts, with unified and, in some
domains, completely new GEC annotations, (ii)
a comparison of Czech GEC systems, and (iii)
a meta-evaluation of common GEC metrics
against human judgment on the released corpus.
2 Related Work
2.1 Grammar Error Correction Corpora
Until recently, attention has been focused mostly
on English, while GEC data resources for other
languages were in short supply. Here we list a
few examples of English GEC corpora, collected
mostly within an English-as-a-second-language
(ESL) paradigm. For a comparison of their rele-
vant statistics see Table 1.
Lang-8 Corpus of Learner English (Tajiri et al.,
2012) is a corpus of English language learner texts
from the Lang-8 social networking system.
NUCLE (Dahlmeier et al., 2013) consists of
essays written by undergraduate students of the
National University of Singapore.
FCE (Yannakoudakis et al., 2011) includes
short essays written by non-native learners for the
Cambridge ESOL First Certificate in English.
W&I+LOCNESS is a union of two datasets, the
W&I (Write & Improve) dataset (Yannakoudakis
et al., 2018) of non-native learners’ essays, com-
plemented by the LOCNESS corpus (Granger,
1998), a collection of essays written by native
English students.
The GEC error annotations for the learner
corpora above were distributed with the BEA-
2019 Shared Task on Grammatical Error Correc-
tion (Bryant et al., 2019).
The CoNLL-2014 shared task test set (Ng et al.,
2014) is often used for GEC systems evaluation.
This small corpus consists of 50 essays written
by 25 South-East Asian undergraduates.
JFLEG (Napoles et al., 2017) is another fre-
quently used GEC corpus with fluency edits in
addition to usual grammatical edits.
To broaden the restricted variety of domains,
focused primarily on learner essays, a CWEB col-
lection (Flachs et al., 2020) of website texts was
recently released, aiming at contributing lower
error density data.
AESW (Daudaravicius et al., 2016) is a large
corpus of scientific writing (over 1M sentences),
edited by professional editors.
Finally, Napoles et al. (2019) recently released
GMEG, a corpus for the evaluation of GEC metrics
across domains.
Grammatical error correction corpora for lan-
guages other than English are less common and—
if available—usually limited in size and domain:
German Falko-MERLIN (Boyd, 2018), Russian
RULEC-GEC (Rozovskaya and Roth, 2019),
453
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Spanish COWS-L2H (Davidson et al., 2020),
Ukrainian UA-GEC (Syvokon and Nahorna, 2021),
and Romanian RONACC (Cotet et al., 2020).
To better account for multiple correction op-
tions, datasets often contain several reference sen-
tences for each original noisy sentence in the test
set, proposed by multiple annotators. As we can
see in Table 1, the number of annotations typically
ranges between 1 and 5 with an exception of the
CoNLL14 test set, which—on top of the official 2
reference corrections—later received 10 annota-
tions from Bryant and Ng (2015) and 8 alternative
annotations from Sakaguchi et al. (2016).
2.2 Czech Learner Corpora
By the early 2010s, Czech was one of a few
languages other than English to boast a series
of learner corpora, compiled under the umbrella
project AKCES, evoking the concept of acquisition
corpora ( ˇSebesta, 2010).
The native section includes transcripts of
hand-written essays (SKRIPT 2012) and class-
room conversation (SCHOLA 2010) from ele-
mentary and secondary schools. Both have their
counterparts documenting the Roma ethnolect of
Czech:1 essays (ROMi 2013) and recordings and
transcripts of dialogues (ROMi 1.0).2
The non-native section goes by the name of
CzeSL,
the acronym of Czech as the Second
Language. CzeSL consists of transcripts of short
hand-written essays collected from non-native
learners with various levels of proficiency and na-
tive languages, mostly students attending Czech
language courses before or during their studies at
a Czech university. There are several releases of
1The Romani ethnolect of Czech is the result of contact
with Romani as the linguistic substrate. To a lesser (and
weakening) extent the ethnolect shows some influence of
Slovak or even Hungarian, because most of its speakers have
roots in Slovakia. The ethnolect can exhibit various specifics
across all linguistic levels. However, nearly all of them
are complementary with their colloquial or standard Czech
counterparts. A short written text, devoid of phonological
properties, may be hard to distinguish from texts written by
learners without the Romani backround. The only striking
exception are misspellings in contexts where the latter benefit
from more exposure to written Czech. The typical example is
the omission of word boundaries within phonological words,
e.g., between a clitic and its host. In other respects, the pattern
of error distribution in texts produced by ethnolect speakers
is closer to native rather than foreign learners (Boˇrkovcov´a,
2007, 2017).
2A more recent release SKRIPT 2015 includes a balanced
mix of essays from SKRIPT 2012 and ROMi 2013. For more
details and links see http://utkl.ff.cuni.cz/akces/.
CzeSL, which differ mainly to what extent and
how the texts are annotated (Rosen et al., 2020).3
More recently, hand-written essays have been
transcribed and annotated in TEITOK (Janssen,
2016),4 a tool combining a number of cor-
pus compilation, annotation and exploitation
functionalities.
Learner Czech is also represented in MERLIN, a
multilingual (German, Italian, and Czech) corpus
built in 2012–2014 from texts submitted as a part
of tests for language proficiency levels (Boyd
et al., 2014).5
Finally, AKCES-GEC (N´aplava and Straka,
2019) is a GEC corpus for Czech created from
the subset of the above mentioned AKCES re-
sources ( ˇSebesta, 2010): the CzeSL-man corpus
(non-native Czech learners with manual annota-
tion) and a part of the ROMi corpus (speakers of
the Romani ethnolect).
Compared to the AKCES-GEC,
the new
GECCC corpus contains much more data (47 371
sentences vs. 83 058 sentences, respectively), by
extending data in the existing domains and also
adding two new domains: essays written by native
learners and website texts, making it the largest
non-English GEC corpus and one of the largest
GEC corpora overall.
3 Annotation
3.1 Data Selection
We draw the original uncorrected data from
the following Czech learner corpora or Czech
websites:
• Native Formal – essays written by native stu-
dents of elementary and secondary schools
from the SKRIPT 2012 learner corpus,
compiled in the AKCES project
• Native Web Informal – newly annotated
informal website discussions from Czech
Facebook Dataset (Habernal et al., 2013a,b)
and Czech news site novinky.cz.
• Romani – essays written by children and
teenagers of the Romani ethnic minority from
the ROMi corpus of the AKCES project and
the ROMi section of the AKCES-GEC corpus
3For a list of CzeSL corpora with their sizes and annotation
details see http://utkl.ff.cuni.cz/learncorp/.
4http://www.teitok.org.
5https://www.merlin-platform.eu.
454
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Dataset
Documents
Selected
AKCES-GEC-test
AKCES-GEC-dev
MERLIN
Novinky.cz
Facebook
SKRIPT 2012
ROMi
188
195
441
—
10 000
394
1 529
188
195
385
2 695
3 850
167
218
Table 2: Data resources for the new Czech GEC
corpus. The second column (Selected) shows the
size of the selected subset from all available
documents (first column, Documents).
• Second Learners – essays written by non-
native learners, from the Foreigners section
of the AKCES-GEC corpus, and the MERLIN
corpus
Since we draw our data from several Czech cor-
tools with
pora originally created in different
different annotation schemes and instructions, we
re-annotated the errors in a unified manner for the
entire development and test set and partially also
for the training set.
The data split was carefully designed to main-
tain representativeness, coverage and backwards
compatibility. Specifically, (i) test and develop-
ment data contain roughly the same amount of
annotated data from all domains, (ii) original
AKCES-GEC dataset splits remain unchanged,
and (iii) additional available detailed annotations
such as user proficiency level in MERLIN were
leveraged to support the split balance. Overall,
the main objective was to achieve a representative
cover over development and testing data. Table 2
presents the sizes of data resources in the num-
ber of documents. The first column (Documents)
shows the number of all available documents
collected in an initial scan. The second column
(Selected) is a selected subset from the available
documents, due to budgetary constraints and to
achieve a representative sample over all domains
and data portions. The relatively higher number of
documents selected for the Native Web Informal
domain is due to its substantially shorter texts,
yielding fewer sentences; also, we needed to pop-
ulate this part of the corpus as a completely new
domain with no previously annotated data.
To achieve more fine-grained balancing of the
splits, we used additional metadata where avail-
able: user’s proficiency levels and origin language
from MERLIN and the age group from AKCES.
3.2 Preprocessing
De/tokenization is an important part of data pre-
processing in grammar error correction. Some
formats, such as the M2 format (Dahlmeier and
Ng, 2012), require tokenized formats to track and
evaluate correction edits. On the other hand, deto-
kenized text in its natural form is required for other
applications. We therefore release our corpus in
two formats: a tokenized M2 format and deto-
kenized format aligned at sentence, paragraph, and
document level. As part of our data is drawn from
tokenized GEC corpora AKCES-GEC
earlier,
and MERLIN, this data had to be detokenized. A
slightly modified Moses detokenizer6 is attached
to the corpus. To tokenize the data for the M2
format, we use the UDPipe tokenizer (Straka
et al., 2016).
3.3 Annotation
The test and development sets in all domains
were annotated from scratch by five in-house ex-
pert annotators,7 including re-annotations of the
development and test data of the earlier GEC cor-
pora to achieve a unified annotation style. All the
test sentences were annotated by two annotators;
one half of the development sentences received
two annotations and the second half one annota-
tion. The annotation process took about 350 hours
in total.
The annotation instructions were unified across
all domains: The corrected text must not contain
any grammatical or spelling errors and should
sound fluent. Fluency edits are allowed if the
original is incoherent. The entire document was
given as a context for the annotation. Annotators
were instructed to remove documents that were
too incomprehensible or those containing private
information.
To keep the annotation process simple for the
annotators, the sentences were annotated (cor-
rected) in a text editor and postprocessed auto-
matically to retrieve and categorize the GEC edits
6https://github.com/moses-smt/mosesdecoder
/blob/master/scripts/tokenizer/detokenizer.perl.
7Our annotators are senior undergraduate students of
humanities, regularly employed for various annotation efforts
at our institute.
455
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
by the ERRor ANnotation Toolkit (ERRANT)
(Bryant et al., 2017).
3.4 Train Data
The first source for the training data are the data
from the SKRIPT 2012; the MERLIN corpus and
the AKCES-GEC train set that were not annotated,
thus containing original annotations. These data
cover the Native Formal, the Romani, and the
Second Learners domain. The second part of the
training data are newly annotated data. Specifi-
cally, these are all Native Web Informal data and
also a small part in the Second Learners domain.
All data in the training set were annotated with
one annotation.
3.5 Corpus Alignment
The majority of models proposed for grammat-
ical error correction operates over sentences.
However, preliminary studies on document-level
grammatical error correction recently appeared
(Chollampatt et al., 2019; Yuan and Bryant, 2021).
The models were shown to benefit from larger
context as certain errors such as errors in arti-
cles or tense choice do require larger context. To
simplify future work with our dataset, we release
three alignment
levels: (i) sentence-level, (ii)
paragraph-level, and (iii) document-level. Given
that the state-of-the-art grammatical error correc-
tion systems still operate on sentence level despite
the initial attempts with document-level systems,
we perform model training and evaluation at the
usual sentence level.8
3.6 Inter-Annotator Agreement
As suggested by Rozovskaya and Roth (2010),
followed later by Rozovskaya and Roth (2019)
and Syvokon and Nahorna (2021), we evaluate
inter-annotator agreement by asking a second an-
notator to judge the need for a correction in a
sentence already annotated by someone else, in a
single-blind setting as to the status of the sentence
(corrected/uncorrected).9 Five annotators anno-
tated the first pass and three annotators judged
the sentence correctness in the second pass. In
8Note that even if human evaluation in Section 5 is per-
formed on sentence-aligned data, human annotators process
whole documents, and thus take the full context into account.
9A sentence-level agreement on sentence correctness
is generally preferred in GEC annotations to an exact
inter-annotator match on token edits, since different series of
corrections may possibly lead to a correct sentence (Bryant
and Ng, 2015).
First →
Second ↓
A1
A2
A3
A1
A2
A3
A4
A5
— 93.39
97.96
84.43 — 95.91
68.80
89.63
90.18
87.68 — 79.39
72.50
78.15
57.50
Table 3: Inter-annotator agreement based on
second-pass judgments. Numbers represent per-
centage of sentences judged correct in second-
pass proofreading. Five annotators annotated the
first pass, and three annotators judged the sen-
tence correctness in the second pass.
Error Type
POS (15)
MORPH
ORTH
SPELL
WO
QUOTATION
DIACR
OTHER
Subtype
:INFL
:CASING
:WSPACE
:SPELL
Example
taˇzen´e → ˇr´ızen´e
manˇzelka → man´zelkou
maj → maj´ı
usa → USA
pˇres to → pˇresto
ochtnat → ochutnat
pln´a jsou → jsou pln´a
bl´ıskaj´e zelenˇe → zelenˇe bl´yskaj´ı
” → ,,
tiskarna → tisk´arna
sem → jsem ho
Table 4: Czech ERRANT error types.
the second pass, each of the three annotators
judged a disjoint set of 120 sentences. Table 3
summarizes the inter-annotator agreement based
on second-pass judgments: The numbers represent
the percentage of sentences judged correct in the
second pass.
Both the average and the standard deviation
(82.96 ± 12.12) of our inter-annotator agreement
are similar to inter-annotator agreement measured
on English (63 ± 18.46, Rozovskaya and Roth
2010), Russian (80 ± 16.26, Rozovskaya and Roth
2019), and Ukrainian (69.5 ± 7.78 Syvokon and
Nahorna 2021).
3.7 Error Type Analysis
To retrieve and categorize the correction edits
from the erroneous-corrected sentence pairs, ER-
Ror ANnotation Toolkit (ERRANT) (Bryant et al.,
2017) was used. Inspired by Boyd (2018), we
adapted the original English error types to the
Czech language. For the resulting set see Table 4.
The POS error types are based on the UD POS
tags (Nivre et al., 2020) and may contain an op-
tional :INFL subtype when the original and the
corrected words share a common lemma. The
word-order error type was extended by an optional
456
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Sentence-aligned
#sentences
Dev
1 952
2 465
1 254
2 807
8 478
Train
4 060
6 977
24 824
30 812
66 673
Test
1 684
2 166
1 260
2 797
7 907
Paragraph-aligned
#paragraphs
Dev
859
1 294
574
865
3 592
Train
1 618
3 622
9 723
8 781
23 744
Test
669
1 256
561
756
3 242
Doc-aligned
#docs
Train Dev
87
1 291
173
167
1 718
227
3 619
3 247
2 050
9 143
Test
76
1 256
169
170
1 671
Error Rate
5.81%
15.61%
26.21%
25.16%
18.19%
Native Formal
Native Web Informal
Romani
Second Learners
Total
Table 5: Corpus statistics at three alignment levels: sentence-aligned, paragraph-aligned, and doc-
aligned. Average error rate was computed on the concatenation of development and test data at all three
alignment levels.
:SPELL subtype to allow for capturing word or-
der errors including words with minor spelling
errors. The original orthography error type ORTH
covering both errors in casing and whitespaces
is now subtyped with :WSPACE and :CASING
to better distinguish between the two phenomena.
Finally, we add two error types specific to Czech:
DIACR for errors in either missing or redundant
diacritics and QUOTATION for wrongly used
quotation marks. Two original error types remain
unchanged: MORPH, indicating replacement of a
token by another with the same lemma but differ-
ent POS, and SPELL, indicating incorrect spelling.
For part-of-speech tagging and lemmatization
we rely on UDPipe (Straka et al., 2016).10 The
word list for detecting spelling errors comes from
MorfFlex (Hajiˇc et al., 2020).11
We release the Czech ERRANT at https://
github.com/ufal/errant czech. We assume
that it is applicable to other languages with a
similar set of errors, especially Slavic languages, if
lemmatizer, tagger, and morphological dictionary
are available.
3.8 Final Dataset
The final corpus consists of 83 058 sentences and
is distributed in two formats: the tokenized M2
format (Dahlmeier and Ng, 2012) and the deto-
kenized format with alignments at the sentence,
paragraph, and document levels. Although the
detokenized format does not include correction
edits, it does retain full information about the
original spacing.
The statistics of the final dataset are presented in
Table 5. The individual domains are balanced on
10Using the czech-pdt-ud-2.5-191206.udpipe model.
11We also use the aggresive variant of the stemmer from
https://research.variancia.com/czech_stemmer/.
the sentence level in the development and testing
sets, each of them containing about 8 000 sen-
tences. The number of paragraphs and documents
varies: on average, the Native Web Informal do-
main contains less than 2 sentences per document,
while the Native Formal domain more than 20.
As expected, the domains differ also in the error
rate, that is, the proportion of erroneous tokens
(see Table 5). The students’ essays in the Native
Formal domain are almost 3 times less erroneous
than any other domain, while in the Romani and
Second Learners domain, approximately every
fourth token is incorrect.
Furthermore, the prevalence of error types dif-
fers for each individual domain. The 10 most
common error types in each domain are pre-
sented in Figure 1. Overall, errors in punctuation
(PUNCT) constitute the most common error type.
They are the most common error in three domains,
although their relative frequency varies. We fur-
ther estimated that of these errors, 9% (Native
Formal) to 27% (Native Web Informal) are unin-
teresting from the linguistic perspective, as they
are only omissions of the sentence formal end-
ing, probably purposeful in case of Native Web
Informal. The rest (75–91%) appears in a sen-
tence, most of which (35–68% Native Formal) is
a misplaced comma: In Czech, syntactic status of
finite clauses strictly determine the use of com-
mas in the sentence. Finally, in 5–7% cases of all
punctuation errors, a correction included joining
two sentences or splitting a sentence into two sen-
tences. Errors in either missing or wrongly used
diacritics (DIACR), spelling errors (SPELL), and
errors in orthography (ORTH) are also common,
with varying frequency across domains.
Compared to the AKCES-GEC corpus, the Gram-
mar Error Correction Corpus for Czech contains
457
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Figure 1: Distribution of top-10 ERRANT error types per domain in the development set.
System
Params
Boyd (2018)
Choe et al. (2019)
Lichtarge et al. (2019)
Lichtarge et al. (2020)
Omelianchuk et al. (2020)
Rothe et al. (2021) base
Rothe et al. (2021) xxl
Rozovskaya and Roth (2019)
Xu et al. (2019)
AG finetuned
–
–
–
–
–
580M
13B
–
–
210M
English
W&I+L
–
63.05
–
66.5
72.4
60.2
69.83
–
63.94
69.00
CoNLL 14
–
–
56.8
62.1
65.3
54.10
65.65
–
60.90
63.40
Czech
AKCES-GEC
–
–
–
–
–
71.88
83.15
–
–
80.17
German
Falko-Merlin
45.22
–
–
–
–
69.21
75.96
–
–
73.71
Russian
RULEC-GEC
–
–
–
–
26.24
51.62
21.00
–
50.20
Table 6: Comparison of selected single-model systems on English (W&I+L, CoNLL-2014), Czech
(AKCES-GEC), German (Falko-Merlin GEC), and Russian (RULEC-GEC) datasets. Our reimplemen-
tation of the AG finetuned model is from N´aplava and Straka (2019). Note that models vastly differ in
training/fine-tuning data and size (e.g., Rothe et al. (2021) xxl is 50 times larger than AG finetuned).
more than 3 times as many sentences in the devel-
opment and test sets, more than 50% sentences in
the training set and also two new domains.
To the best of our knowledge, the newly intro-
duced GECCC dataset is the largest among GEC
corpora in languages other than English and it
is surpassed in size only by the English Lang-8
and AESW datasets. With the exclusion of these
two datasets, the GECCC dataset contains more
sentences than any other GEC corpus currently
known to us.
4 Model
In this section, we describe five systems for auto-
matic error correction in Czech and analyze their
performance on the new dataset. Four of these
systems represent previously published Czech
work (Richter et al., 2012; N´aplava and Straka,
2019; N´aplava et al., 2021) and one is our new im-
plementation. The first system is a pre-neural ap-
proach, published and available for Czech (Richter
et al., 2012), included for historical reasons as a
previously known and available Czech GEC tool;
the following four systems represent the current
state of the art in GEC: They are all neural network
architectures based on Transformers, differing in
the training procedure, training data, or training
objective. A comparison of systems, trained and
evaluated on English, Czech, German, and Rus-
sian, with state of the art, is given in Table 6.
4.1 Models
We experiment with the following models:
Korektor (Richter et al., 2012) is a pre-neural
statistical spellchecker and (occasional) grammar
checker. It uses the noisy channel approach with
a candidate model that for each word suggests its
variants up to a predefined edit distance. Inter-
nally, a hidden Markov model (Baum and Petrie,
1966) is built. Its hidden states are the variants
of words proposed by the candidate model, and
the transition costs are determined from three
N -gram language models built over word forms,
458
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
System
Original
Korektor
Synthetic trained
AG finetuned
GECCC finetuned
Joint GEC+NMT
Reference
NF
—
28.99
46.83
65.77
72.50
68.14
—
M 2
0.5-score
R
—
46.77
46.36
69.71
72.23
65.21
—
NWI
—
31.51
38.63
55.20
71.09
66.64
—
SL
—
55.93
62.20
71.41
73.21
70.43
—
Σ
NF
— 8.47
8.26
8.55
8.97
9.19
9.06
— 9.58
45.09
53.07
68.08
72.96
67.40
Mean human score
SL
R
NWI
7.18
7.76
7.99
7.55
7.90
7.60
7.88
8.10
7.99
8.91
8.35
8.22
8.67
8.91
8.72
8.19
8.69
8.37
9.63
9.60
9.48
Σ
7.61
7.63
7.98
8.38
8.74
8.35
9.57
Table 7: Mean score of human judgments and M 2
0.5 score for each system in domains (NF = Native
Formal, NWI = Native Web Informal, R = Romani, SL = Second Learners, Σ = whole dataset). All
results in the whole dataset (the Σ column) are statistically significant with p-value < 0.001, except for
the AG finetuned and Joint GEC+NMT systems, where the p-value is less than 6.2% for M 2
0.5 score
and less than 4.3% for human score, using the Monte Carlo permutation test with 10M samples and
probability of error at most 10−6 (Fay and Follmann, 2002; Gandy, 2009).
lemmas, and part-of-speech-tags. To find an opti-
mal correction, Viterbi algorithm (Forney, 1973)
is used.
Synthetic trained (N´aplava and Straka, 2019)
is a neural-based Transformer model that is trained
to translate the original ungrammatical text to a
well formed text. The original Transformer model
(Vaswani et al., 2017) is regularized with an ad-
ditional source and target word dropout and the
training objective is modified to focus on tokens
that should change (Grundkiewicz and Junczys-
Dowmunt, 2019). As the amount of existing anno-
tated data is small, an unsupervised approach with
a spelling dictionary is used to generate a large
amount of synthetic training data. The model is
trained solely on these synthetic data.
AKCES-GEC (AG) finetuned (N´aplava and
Straka, 2019) is based on Synthetic trained, but
finetunes its weights on a mixture of synthetic
and authentic data from the AKCES-GEC corpus,
namely, on data from the Romani and Second
Learners domains. See Table 6 for comparison
with state of the art in English, Czech, German
and Russian.
GECCC finetuned uses the same architecture
as Synthetic trained, but we finetune its weights on
a mixture of synthetic and (much larger) authen-
tic data from the newly released GECCC corpus.
We use the official code of N´aplava and Straka
(2019) with the default settings and mix the syn-
thetic and new authentic data in a ratio of 2:1.
Joint GEC+NMT (N´aplava et al., 2021) is a
Transformer model trained in a multi-task setting.
It pursues two objectives: (i) to correct Czech and
English texts; (ii) to translate the noised Czech
texts into English texts and the noised English
texts into Czech texts. The source data come from
the CzEng v2.0 corpus (Kocmi et al., 2020) and
were noised using a statistical system, KaziText
(N´aplava et al., 2021), that tries to model several
most frequently occurring errors such as diacrit-
ics, spelling or word ordering. The statistics of the
Czech noise were estimated on the new training
set, therefore, the system was indirectly trained
also on data from Native Formal and Native Web
Informal domains, unlike the AG finetuned sys-
tem. The statistics of the English noise were esti-
mated on NUCLE (Dahlmeier et al., 2013), FCE
(Yannakoudakis et al., 2011), and W&I+LOCNESS
(Yannakoudakis et al., 2018; Granger, 1998).
4.2 Results and Analysis
Table 7 summarizes the evaluation of the five
grammar error correction systems (described in
the previous Section 4.1), evaluated with highest-
correlating and widely used metric, the M 2 score
with β = 0.5, denoted as M 2
0.5 (left); and with
human judgments (right). For the meta-evaluation
of GEC metrics against human judgments, see the
following Section 5.
Clearly, learning on GEC annotated data im-
proves performance significantly, as evidenced
by a giant leap between the systems without GEC
data (Korektor, Synthetic trained) and the systems
trained on GEC data (AG finetuned, GECCC fine-
tuned, and Joint GEC+NMT). Further addition
459
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Error Type
DIACR
MORPH
ORTH:CASING
ORTH:WSPACE
OTHER
POS
POS:INFL
PUNCT
QUOTATION
SPELL
WO
#
3 617
610
1 058
385
3 719
2 735
1 276
4 709
223
1 816
662
P
86.84
73.58
81.60
64.44
23.59
56.50
74.47
71.42
89.44
77.27
60.00
R
88.77
55.91
55.15
74.36
20.04
22.12
48.22
61.17
61.06
75.76
29.89
F0.5
87.22
69.20
74.46
66.21
22.78
43.10
67.16
69.10
81.83
76.96
49.94
Table 8: Analysis of GECCC finetuned model
performance on individual error types. For this
analysis, all POS-error types were merged into a
single error type POS.
of GEC data volume and domains is statisti-
cally significantly better (p < 0.001), as the only
difference between AG finetuned and GECCC
finetuned systems is that the former uses the
AKCES-GEC corpus, while the latter is trained
on larger and domain-richer GECCC. Access to
larger data and more domains in the multi-task
setting is useful (compare Joint GEC+NMT and
AG finetuned on newly added Native Formal and
Native Web Informal domains), although direct
training seems superior (GECCC finetuned over
Joint GEC+NMT).
We further analyze the best model (GECCC
finetuned) and inspect its performance with re-
spect to individual error types. For simpler anal-
ysis, we grouped all POS-related errors into two
error types: POS and POS:INFL for words that are
erroneous only in inflection and share the same
lemma with their correction.
As we can see in Table 8, the model is very good
at correcting local errors in diacritics (DIACR),
quotation (QUOTATION), spelling (SPELL), and
casing (ORTH:CASING). Unsurprisingly, small
changes are easier than longer edits; similarly, the
system is better in inflection corrections (POS:
INFL, words with the same lemma) than on POS
(correction involves finding a word with a dif-
ferent lemma).
Should the word be split or joined with an
adjacent word, the model does so with a relatively
high success rate (ORTH:WSPACE). The model
is also able to correctly reorder words (WO), but
here its recall is rather low. The model performs
worst on errors categorized as OTHER, which
includes edits that often require rewriting larger
pieces of text. Generally, the model has higher
precision than recall, which suits the needs of
standard GEC, where proposing a bad correction
for a good text is worse than being inert to an
existing error.
5 Meta-evaluation of Metrics
There are several automatic metrics used for
evaluating system performance on GEC dataset,
although it is not clear which of them is prefera-
ble in terms of high correlation with human judg-
ments on our dataset.
The most popular GEC metrics are the Max-
Match (M2) scorer (Dahlmeier and Ng, 2012) and
the ERRANT scorer (Bryant et al., 2017).
The MaxMatch (M2) scorer reports the F-score
over the optimal phrasal alignment between a
source sentence and a system hypothesis reaching
the highest overlap with the gold standard anno-
tation. It was used as the official metric for the
CoNLL 2013 and 2014 Shared Tasks (Ng et al.,
2013, 2014) and is also used on various other
datasets such as the German Falko-MERLIN
GEC (Boyd, 2018) or Russian RULEC-GEC
(Rozovskaya and Roth, 2019).
The ERRANT scorer was used as the official
metric of the recent Building Educational Appli-
cation 2019 Shared Task on GEC (Bryant et al.,
2019). The ERRANT scorer also contains a set
of rules operating over a set of linguistic annota-
tions to construct the alignment and extract indi-
vidual edits.
Other popular automatic metrics are the Gen-
eral Language Evaluation Understanding (GLEU)
metric (Napoles et al., 2015), which additionally
measures text fluency, and I-Measure (Felice and
Briscoe, 2015), which calculates weighted accu-
racy of both error detection and correction.
5.1 Human Judgments Annotation
In order to evaluate the correlation of several GEC
metrics with human judgments, we collected an-
notations of the original erroneous sentences, the
manually corrected gold references, and automatic
corrections made by five GEC systems described
in Section 4. We used the hybrid partial ranking
with scalars (Sakaguchi and Van Durme, 2018),
in which the annotators judged the sentences
on a scale from 0–10 (from ungrammatical to
460
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
correct).12 The sentences were evaluated with
respect to the context of the document. In total,
three annotators judged 1 100 documents, sampled
from the test set comprising about 4 300 original
sentences and about 15 500 unique corrected vari-
ants and gold references of the sentences. The
annotators annotated 127 documents jointly and
the rest was annotated by a single annotator. This
annotation process took about 170 hours. To-
gether with the model training, data preparation,
and management of the annotation process, our
rough estimation is about 300+ man-hours for the
correlation analysis per corpus (language).
5.2 Agreement in Human Judgments
For the agreement in human judgments, we report
the Pearson correlation and Spearman’s rank cor-
relation coefficient between 3 human judgments
of 5 automatic sentence corrections at the system-
and sentence-level. At
the
correlation of the judgments about the 5 sen-
tence corrections is calculated for each sentence
and each pair of the three annotators. The final
sentence-level annotator agreement is the mean of
these values over all sentences.
the sentence level,
At the system level, the annotators’ judgments
for each system are averaged over the sentences,
and the correlation of these averaged judgments is
computed for each pair of the three annotators. In
order to obtain smoother estimates (especially for
Spearman’s ρ), we utilize bootstrap resampling
with 100 samples of a test set.
The human judgments agreement across do-
mains is shown in Table 9. On the sentence level,
the human judgments correlation is high on the
least erroneous domain Native Formal, implying
that it is easier to judge the corrections in a low
error density setting, and it is more difficult in
high error density domains, such as Romani and
Second Learners (compare error rates in Table 5).
5.3 Metrics Correlations with Judgments
Following Napoles et al. (2019), we provide a
meta-evaluation of the following common GEC
metrics robustness on our corpus:
• MaxMatch (M2) (Dahlmeier and Ng, 2012)
12Recent work (Sakaguchi and Van Durme, 2018;
Novikova et al., 2018) found partial ranking with scalars
to be more reliable than direct assessment framework used
by WMT (Bojar et al., 2016) and earlier GEC evaluation
approaches (Grundkiewicz et al., 2015; Napoles et al., 2015).
Domain
Native Formal
Native Web Inf.
Romani
Second Learners
Whole Dataset
Sentence level
System level
r
87.13
80.23
86.57
78.50
79.07
ρ
88.76
81.47
86.57
79.97
80.40
r
92.01
95.33
88.73
96.50
96.11
ρ
92.52
91.80
85.90
97.23
95.54
Table 9: Human judgments agreement: Pearson
(r) and Spearman (ρ) mean correlation between
3 human judgments of 5 sentence versions at
sentence- and system-level.
• ERRANT (Bryant et al., 2017)
• GLEU (Napoles et al., 2015)
• I-measure (Felice and Briscoe, 2015)
Moreover, we vary the proportion of recall and
precision, ranging from 0 to 2.0 for M2-scorer
and ERRANT, as Grundkiewicz et al. (2015)
report that the standard choice of considering
precision two times as important as recall may be
sub-optimal.
While we considered both sentence-level and
system-level evaluation in Section 5.2, the au-
tomatic metrics should by design be used on a
whole corpus, leaving us with only system-level
evaluation. Given that the GEC systems perform
differently on the individual domains (as indicated
by Table 7), we perform the correlation compu-
tation on each domain separately and report
the average.
For a given domain and metric, we compute
the correlation between the automatic metric eval-
uations of the five systems on one side and
the (average of) human judgments on the other
side. In order to obtain a smoother estimate of
Spearman’s ρ and also to estimate standard devia-
tions, we employ bootstrap resampling again, with
100 samples.
The results are presented in Table 10. While
Spearman’s ρ has more straightforward interpre-
tation, it also has a much higher variance, because
it harshly penalizes the differences in the ranking
of systems with similar performance (namely, AG
finetuned and Joint GEC+NMT in our case). This
fact has previously been observed by Mach´aˇcek
and Bojar (2013).
Therefore, we choose the most suitable GEC
metric for our GECCC dataset according to Pearson
r, which implies that M2
0.5 and ERRANT0.5 are
461
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Figure 2: Left: System-level Pearson correlation coefficient r between human annotation and M2
various values of β. Right: The same correlation for ERRANTβ.
β-scorer for
Metric
GLEU
I-measure
M2
0.2
M2
0.5
M2
1.0
ERRANT0.2
ERRANT0.5
ERRANT1.0
System level
r
97.37 ± 1.52
95.37 ± 2.16
96.25 ± 1.71
98.28 ± 1.03
95.62 ± 1.81
94.66 ± 2.44
98.28 ± 1.04
95.70 ± 1.80
ρ
92.28 ± 6.19
98.66 ± 3.21
93.27 ± 9.45
97.77 ± 4.27
93.22 ± 4.30
91.19 ± 4.76
98.35 ± 4.81
93.61 ± 4.47
Table 10: System-level Pearson (r) and Spearman
(ρ) correlation between the automatic metric
scores and human annotations.
the metrics most correlating with human judg-
ments. Of those two, we prefer the M2
0.5 score,
not due to its marginal superiority in correlation
(Table 10), but rather because it is much more
language-agnostic compared to ERRANT, which
requires a POS tagger, lemmatizer, morphological
dictionary, and language-specific rules.
Our results confirm that both M2-scorer and
ERRANT with β = 0.5 (chosen only by intu-
ition for the CoNLL 2014 Shared task; Ng et al.,
2014) correlate much better with human judg-
ments, compared to β = 0.2 and β = 1. The
detailed plots of correlations of M2
β score and
ERRANTβ score with human judgments for β
ranging between 0 and 2, presented in Figure 2,
show that optimal β in our case lies between 0.4
and 0.5. However, we opt to employ the widely
used β = 0.5 because of its prevalence and be-
cause the difference to the optimal β is marginal.
Our results are distinct from the results of
Grundkiewicz et al. (2015), where β = 0.18
correlates best on the CoNLL 14 test set. Nev-
ertheless, Napoles et al. (2019) demonstrate that
β = 0.5 correlates slightly better than β = 0.2
on the FCE dataset, but that β = 0.2 correlates
substantially better than β = 0.5 on Wikipedia
and also on Yahoo discussions (a dataset contain-
ing paragraphs of Yahoo! Answers, which are in-
formal user answers to other users’ questions).
In the latter work, Napoles et al. (2019) propose
that larger β = 0.5 correlate better on datasets
with higher error rate and vice versa, given that
the FCE dataset has 20.2% token error rate, com-
pared to the error rates of 9.8% and 10.5% of
Wikipedia and Yahoo, respectively. The hypothe-
sis seems to extend to our results and the results of
Grundkiewicz et al. (2015), considering that the
GECCC dataset and the CoNLL 14 test set have
token error rates of 18.2% and 8.2%, respectively.
5.4 GEC Systems Results
Table 7 presents both human scores for the GEC
systems described in Section 4 and also results
obtained by the chosen M2
0.5 metric. The results are
presented both on the individual domains and the
entire dataset. Measuring over the entire dataset,
human judgments and the M 2-scorer rank the
systems in accordance.
Judged by the human annotators, all systems
are better than the ‘‘do nothing’’ baseline (the
Original) measured over the entire dataset, al-
though Korektor makes harmful changes in two
domains: Native Formal and Native Web Infor-
mal. These two domains contain frequent named
entities, which upon an eager change disturb the
meaning of a sentence, leading to severe penal-
ization by human annotators. Korektor is also not
capable of deleting, inserting, splitting or joining
462
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
tokens. The fact that Korektor sometimes per-
forms detrimental changes cannot be revealed by
the M 2-scorer as it assigns zero score to the Orig-
inal baseline and does not allow negative scores.
The human judgments confirm that there is
still a large gap between the optimal Reference
score and the best performing models. Regarding
the domains, the neural models in the finetuned
mode that had access to data from all domains
seemed to improve the results consistently across
each domain. However, given the fact that the
source sentences in the Second Learners domain
received the worst scores by human annotators,
this domain seems to hold the greatest potential
for future improvements.
6 Conclusions
We release a new Czech GEC corpus,
the
Grammar Error Correction Corpus for Czech
(GECCC). This large corpus with 83 058 sen-
tences covers four diverse domains, including
essays written by native students, informal website
texts, essays written by Romani ethnic minor-
ity children and teenagers and essays written by
non-native speakers. All domains are profession-
ally annotated for GEC errors in a unified man-
ner, and errors were automatically categorized
with a Czech-specific version of ERRANT re-
leased at https://github.com/ufal/errant
czech. We compare several strong Czech GEC
systems, and finally, we provide a meta-evaluation
of common GEC metrics across domains in our
data. We conclude that M2 and ERRANT scores
with β = 0.5 are the measures most correlating
with human judgments on our dataset, and we
choose the M2
0.5 as the preferred metric for the
GECCC dataset. The corpus is publicly available
under the CC BY-SA 4.0 license at http://hdl
.handle.net/11234/1-4639.
Acknowledgments
This work has been supported by the Grant
Agency of the Czech Republic, project EX-
PRO LUSyD (GX20-16819X). This research was
also partially supported by SVV project num-
ber 260 575 and GAUK 578218 of the Charles
University. The work described herein has been
supported by and has been using language re-
sources stored by the LINDAT/CLARIAH-CZ
Research Infrastructure (https://lindat.cz) of
the Ministry of Education, Youth and Sports of
the Czech Republic (project no. LM2018101).
This work was supported by the European Re-
gional Development Fund project ‘‘Creativity
and Adaptability as Conditions of the Success
of Europe in an Interrelated World’’ (reg. no.:
CZ.02.1.01/0.0/0.0/16 019/0000734).
We would also like to thank the reviewers
and the TACL action editor for their thoughtful
comments, which helped to improve this work.
References
Leonard E. Baum and Ted Petrie. 1966. Statistical
inference for probabilistic functions of finite
state Markov chains. The Annals of Mathe-
matical Statistics, 37(6):1554–1563. https://
doi.org/10.1214/aoms/1177699147
Ondˇrej Bojar, Rajen Chatterjee, Christian
Federmann, Yvette Graham, Barry Haddow,
Matthias Huck, Antonio Jimeno Yepes, Philipp
Koehn, Varvara Logacheva, Christof Monz,
Matteo Negri, Aur´elie N´ev´eol, Mariana Neves,
Martin Popel, Matt Post, Raphael Rubino,
Carolina Scarton, Lucia Specia, Marco Turchi,
Karin Verspoor, and Marcos Zampieri. 2016.
Findings of the 2016 conference on machine
translation. In Proceedings of the First Confer-
ence on Machine Translation: Volume 2, Shared
Task Papers, pages 131–198, Berlin, Germany.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/W16
-2301
M´aˇsa Boˇrkovcov´a. 2007. Romsk´y etnolekt ˇceˇstiny.
Signeta, Praha.
M´aˇsa Boˇrkovcov´a. 2017. Romsk´y etnolekt
ˇceˇstiny. In Petr Karl´yk, Marek Nekula, and
Jana Pleskalov´a, editors, Nov´y encyklopedick´y
slovn´ık ˇceˇstiny. Nakladatelstv´y Lidov´e Noviny.
Adriane Boyd. 2018. Using Wikipedia edits in
low resource grammatical error correction. In
Proceedings of the 4th Workshop on Noisy
User-generated Text. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/W18-6111
Adriane Boyd,
Jirka Hana, Lionel Nicolas,
Detmar Meurers, Katrin Wisniewski, Andrea
Abel, Karin Sch¨one, Barbora ˇStindlov´a, and
463
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Chiara Vettori. 2014. The MERLIN corpus:
Learner language and the CEFR. In Proceed-
ings of the Ninth International Conference on
Language Resources and Evaluation (LREC’
14), pages 1281–1288, Reykjavik,
Iceland.
European Language Resources Association
(ELRA).
In Proceedings of
Christopher Bryant, Mariano Felice, Øistein E.
Andersen, and Ted Briscoe. 2019. The BEA-
2019 Shared task on grammatical error cor-
the Fourteenth
rection.
Workshop on Innovative Use of NLP for Build-
ing Educational Applications, pages 52–75,
Florence, Italy. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/W19-4406
Christopher Bryant, Mariano Felice, and Ted
Briscoe. 2017. Automatic annotation and eval-
uation of error types for grammatical error
correction. In Proceedings of the 55th Annual
Meeting of
the Association for Computa-
tional Linguistics (Volume 1: Long Papers),
pages 793–805, Vancouver, Canada. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/P17-1074
Christopher Bryant and Hwee Tou Ng. 2015.
How far are we from fully automatic high
quality grammatical error correction? In Pro-
ceedings of
the 53rd Annual Meeting of
the Association for Computational Linguistics
and the 7th International Joint Conference
on Natural Language Processing (Volume 1:
Long Papers), pages 697–707, Beijing, China.
Association for Computational Linguistics.
https://doi.org/10.3115/v1/P15-1068
Yo Joong Choe, Jiyeon Ham, Kyubyong Park,
and Yeoil Yoon. 2019. A neural grammatical
error correction system built On better pre-
training and sequential
transfer learning. In
Proceedings of the Fourteenth Workshop on
Innovative Use of NLP for Building Educa-
tional Applications, pages 213–227, Florence,
Italy. Association for Computational Linguistics.
https://doi.org/10.18653/v1/W19
-4423
Shamil Chollampatt, Weiqi Wang, and Hwee Tou
Ng. 2019. Cross-sentence grammatical error
correction. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Linguistics, pages 435–445. https://doi
.org/10.18653/v1/P19-1042
Teodor-Mihai Cotet, Stefan Ruseti, and Mihai
Dascalu. 2020. Neural grammatical error cor-
rection for romanian. In 2020 IEEE 32nd In-
ternational Conference on Tools with Artificial
Intelligence (ICTAI), pages 625–631.
Daniel Dahlmeier and Hwee Tou Ng. 2012. Better
evaluation for grammatical error correction. In
Proceedings of the 2012 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, pages 568–572, Montr´eal, Canada.
Association for Computational Linguistics.
Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei
Wu. 2013. Building a large annotated cor-
pus of learner English: The NUS corpus of
learner English. In Proceedings of the Eighth
Workshop on Innovative Use of NLP for Build-
ing Educational Applications, pages 22–31,
Atlanta, Georgia. Association for Computa-
tional Linguistics.
Vidas Daudaravicius, Rafael E. Banchs, Elena
Volodina, and Courtney Napoles. 2016. A re-
port on the automatic evaluation of scientific
the
writing shared task. In Proceedings of
11th Workshop on Innovative Use of NLP for
Building Educational Applications, pages 53–62,
San Diego, CA. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/W16-0506
Sam Davidson, Aaron Yamada,
Paloma
Fernandez Mira, Agustina Carando, Claudia
H. Sanchez Gutierrez, and Kenji Sagae. 2020.
Developing NLP tools with a new corpus of
learner Spanish. In Proceedings of the 12th Lan-
guage Resources and Evaluation Conference,
pages 7238–7243, Marseille, France. European
Language Resources Association.
Michael P. Fay and Dean A. Follmann. 2002.
Designing Monte Carlo implementations of
permutation or bootstrap hypothesis tests. The
American Statistician, 56(1):63–70. https://
doi.org/10.1198/000313002753631385
Mariano Felice and Ted Briscoe. 2015. Towards
a standard evaluation method for grammatical
error detection and correction. In Proceedings
of the 2015 Conference of the North American
Chapter of the Association for Computational
464
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Linguistics: Human Language Technologies,
pages 578–587, Denver, Colorado. Association
for Computational Linguistics. https://doi
.org/10.3115/v1/N15-1060
Simon
Flachs, Oph´elie
Lacroix, Helen
and Anders
Yannakoudakis, Marek Rei,
Søgaard. 2020. Grammatical error correction
in low error density domains: A new benchmark
and analyses. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 8467–8478,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/2020.emnlp-main.680
G. David Forney. 1973. The viterbi algorithm.
Proceedings of the IEEE, 61(3):268–278.
Axel Gandy. 2009. Sequential
implementation
of Monte Carlo tests with uniformly bounded
resampling risk. Journal of
the American
Statistical Association, 104(488):1504–1511.
https://doi.org/10.1198/jasa.2009
.tm08368
Sylvianne Granger. 1998. Learner English on
Computer, chapter. The computer learner cor-
pus: A versatile new source of data for SLA
research. Addison Wesley Longman, London
& New York.
Roman Grundkiewicz and Marcin Junczys-
Dowmunt. 2019. Minimally-augmented gram-
matical error correction. In Proceedings of the
5th Workshop on Noisy User-generated Text
(W-NUT 2019), pages 357–363, Hong Kong,
China. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/D19-5546
Roman
Grundkiewicz, Marcin
Junczys-
Dowmunt, and Edward Gillian. 2015. Hu-
man evaluation of grammatical error correction
systems. In Proceedings of the 2015 Confer-
ence on Empirical Methods in Natural Lan-
guage Processing, pages 461–470, Lisbon,
Portugal. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/D15-1052.
and
Ivan Habernal, Tom´aˇs Pt´aˇcek,
Josef
Steinberger. 2013a. Facebook data for senti-
ment analysis. LINDAT/CLARIAH-CZ digital
library at the Institute of Formal and Applied
Linguistics ( ´UFAL), Faculty of Mathematics
and Physics, Charles University.
and
Ivan Habernal, Tom´aˇs Pt´aˇcek,
Josef
in
Steinberger. 2013b. Sentiment analysis
Czech social media using supervised machine
learning. In Proceedings of the 4th Workshop
on Computational Approaches to Subjectiv-
ity, Sentiment and Social Media Analysis,
pages 65–74, Atlanta, Georgia. Association for
Computational Linguistics.
Jan Hajiˇc, Jaroslava Hlav´aˇcov´a, Marie Mikulov´a,
Milan Straka, and Barbora ˇStˇep´ankov´a. 2020.
MorfFlex CZ 2.0. LINDAT/CLARIN digital
library at the Institute of Formal and Applied
Linguistics ( ˇUFAL), Faculty of Mathematics
and Physics, Charles University.
In Proceedings of
Maarten Janssen. 2016. TEITOK: Text-faithful
the
annotated corpora.
Tenth International Conference on Language
Resources and Evaluation (LREC 2016),
pages 4037–4043, Paris, France. European
Language Resources Association (ELRA).
Tom Kocmi, Martin Popel, and Ondrej Bojar.
2020. Announcing CzEng 2.0 parallel corpus
with over 2 gigawords. CoRR, abs/2007
.03006v1.
Jared Lichtarge, Chris Alberti, and Shankar
Kumar. 2020. Data weighted training strategies
for grammatical error correction. Transac-
tions of
the Association for Computational
Linguistics, 8:634–646. https://doi.org
/10.1162/tacl_a_00336
Jared Lichtarge, Chris Alberti, Shankar Kumar,
Noam Shazeer, Niki Parmar, and Simon Tong.
2019. Corpora generation for grammatical er-
ror correction. In Proceedings of
the 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 3291–3301,
Minneapolis, Minnesota. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N19-1333
Matouˇs Mach´aˇcek and Ondˇrej Bojar. 2013. Re-
sults of the WMT13 metrics shared task. In
Proceedings of the Eighth Workshop on Sta-
tistical Machine Translation, pages 45–51,
Sofia, Bulgaria. Association for Computational
Linguistics.
Jakub N´aplava, Martin Popel, Milan Straka, and
Jana Strakov´a. 2021. Understanding model
465
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
robustness to user-generated noisy texts. In
the Seventh Workshop on
Proceedings of
Noisy User-generated Text (W-NUT 2021),
pages 340–350, Online. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/2021.wnut-1.38
The CoNLL-2013 shared task on grammatical
error correction. In Proceedings of the Seven-
teenth Conference on Computational Natural
Language Learning: Shared Task, pages 1–12,
Sofia, Bulgaria. Association for Computational
Linguistics.
error
Jakub N´aplava and Milan Straka. 2019. Gram-
matical
correction in low-resource
scenarios. In Proceedings of the 5th Workshop
on Noisy User-generated Text (W-NUT 2019),
pages 346–356, Stroudsburg, PA, USA. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/D19-5545
Courtney Napoles, Maria N˘adejde, and Joel
Tetreault. 2019. Enabling robust grammati-
cal error correction in new domains: Data
sets, metrics, and analyses. Transactions of
the Association for Computational Linguistics,
7:551–566. https://doi.org/10.1162
/tacl_a_00282
Courtney Napoles, Keisuke Sakaguchi, Matt Post,
and Joel Tetreault. 2015. Ground truth for
grammatical error correction metrics. In Pro-
ceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and
the 7th International Joint Conference on Nat-
ural Language Processing (Volume 2: Short
Papers), pages 588–593, Beijing, China. Associa-
tion for Computational Linguistics. https://
doi.org/10.3115/v1/P15-2097
Courtney Napoles, Keisuke Sakaguchi, and Joel
Tetreault. 2017. JFLEG: A fluency corpus and
benchmark for grammatical error correction.
In Proceedings of the 15th Conference of the
European Chapter of the Association for Com-
putational Linguistics: Volume 2, Short Papers,
pages 229–234. Valencia, Spain. Association
for Computational Linguistics, https://doi
.org/10.18653/v1/E17-2037
Hwee Tou Ng, Siew Mei Wu, Ted Briscoe,
Christian Hadiwinoto, Raymond Hendy
Susanto, and Christopher Bryant. 2014. The
CoNLL-2014 shared task on grammatical error
correction. In Proceedings of the Eighteenth
Conference on Computational Natural Lan-
guage Learning: Shared Task, pages 1–14,
Baltimore, Maryland. Association for Compu-
tational Linguistics.
Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu,
Christian Hadiwinoto, and Joel Tetreault. 2013.
Joakim Nivre, Marie-Catherine de Marneffe, Filip
Ginter, Jan Hajiˇc, Christopher D. Manning,
Sampo Pyysalo, Sebastian Schuster, Francis
Tyers, and Daniel Zeman. 2020. Universal
dependencies v2: An evergrowing multilin-
gual treebank collection. In Proceedings of the
12th Language Resources and Evaluation Con-
ference, pages 4034–4043, Marseille, France.
European Language Resources Association.
the 2018 Conference of
Jekaterina Novikova, Ondˇrej Duˇsek, and Verena
Rieser. 2018. RankME: Reliable human ratings
for natural language generation. In Proceed-
ings of
the North
American Chapter of the Association for Com-
putational Linguistics: Human Language Tech-
nologies, Volume 2 (Short Papers), pages 72–78,
New Orleans, Louisiana. Association for Com-
putational Linguistics.
rewrite.
In Proceedings of
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem
and Oleksandr Skurzhanskyi.
Chernodub,
2020. GECToR – grammatical error correction:
the
Tag, not
Fifteenth Workshop on Innovative Use of
NLP for Building Educational Applications,
pages 163–170, Seattle, WA, USA, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.bea-1.16
Michal Richter, Pavel Straˇn´ak, and Alexandr
Rosen. 2012. Korektor – A system for contex-
tual spell-checking and diacritics completion.
In Proceedings of COLING 2012: Posters,
pages 1019–1028, Mumbai, India. The COL-
ING 2012 Organizing Committee.
Alexandr Rosen, Jiˇr´ı Hana, Barbora Hladk´a,
Tom´aˇs Jel´ınek, Svatava ˇSkodov´a, and Barbora
ˇStindlov´a. 2020. Compiling and annotating a
learner corpus for a morphologically rich lan-
guage – CzeSL, a corpus of non-native Czech.
Karolinum, Charles University Press, Praha.
Sascha Rothe, Jonathan Mallinson, Eric Malmi,
Sebastian Krause, and Aliaksei Severyn. 2021.
A simple recipe for multilingual grammatical
466
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
error correction. In Proceedings of the 59th
Annual Meeting of the Association for Com-
putational Linguistics and the 11th Interna-
tional Joint Conference on Natural Language
Processing
Short Papers),
pages 702–707, Online. Association for Com-
putational Linguistics.
(Volume
2:
Alla Rozovskaya and Dan Roth. 2010. Annotat-
ing ESL errors: Challenges and rewards. In
Proceedings of the NAACL HLT 2010 Fifth
Workshop on Innovative Use of NLP for Build-
ing Educational Applications, pages 28–36,
Los Angeles, California. Association for Com-
putational Linguistics.
Slovenia. European Language Resources Asso-
ciation (ELRA).
Oleksiy Syvokon and Olena Nahorna. 2021.
UA-GEC: Grammatical error correction and
fluency corpus for the ukrainian language.
CoRR, abs/2103.16997v1.
Toshikazu Tajiri, Mamoru Komachi, and Yuji
Matsumoto. 2012. Tense and aspect error cor-
rection for ESL learners using global context.
In Proceedings of the 50th Annual Meeting of
the Association for Computational Linguistics
(Volume 2: Short Papers), pages 198–202, Jeju
Island, Korea. Association for Computational
Linguistics.
Alla Rozovskaya and Dan Roth. 2019. Gram-
mar error correction in morphologically rich
languages: The case of Russian. Transactions
of the Association for Computational Linguis-
tics, 7:1–17. https://doi.org/10.1162
/tacl_a_00251
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017. At-
tention is all you need. In Advances in Neural
Information Processing Systems, volume 30.
Curran Associates, Inc.
Keisuke Sakaguchi, Courtney Napoles, Matt
Post, and Joel Tetreault. 2016. Reassessing
the goals of grammatical error correction: Flu-
ency instead of grammaticality. Transactions of
the Association for Computational Linguistics,
4:169–182. https://doi.org/10.1162
/tacl_a_00091
Keisuke Sakaguchi and Benjamin Van Durme.
2018. Efficient online scalar annotation with
bounded support. In Proceedings of the 56th
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers),
pages 208–218, Melbourne, Australia. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/P18-1020
Karel ˇSebesta. 2010. Korpusy ˇceˇstiny a osvojov´an´ı
jazyka. Studie z aplikovan´e lingvistiky/Studies
in Applied Linguistics, 1:11–34.
Milan Straka, Jan Hajiˇc, and Jana Strakov´a. 2016.
UDPipe: Trainable pipeline for processing
CoNLL-U files performing tokenization, mor-
phological analysis, pos tagging and parsing.
In Proceedings of the Tenth International Con-
ference on Language Resources and Evalua-
tion (LREC’16), pages 4290–4297, Portoroˇz,
Shuyao Xu, Jiehao Zhang, Jin Chen, and Long
Qin. 2019. Erroneous data generation for
grammatical error correction. In Proceedings
of the Fourteenth Workshop on Innovative Use
of NLP for Building Educational Applications,
pages 149–158, Florence, Italy. Association for
Computational Linguistics.
Helen Yannakoudakis, Øistein Andersen,
Ardeshir Geranpayeh, Ted Briscoe, and Diane
Nicholls. 2018. Developing an automated writ-
ing placement system for ESL learners. Applied
Measurement in Education, 31. https://doi
.org/10.1080/08957347.2018.1464447
Helen Yannakoudakis, Ted Briscoe, and Ben
Medlock. 2011. A new dataset and method for
automatically grading ESOL texts. In Pro-
ceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Hu-
man Language Technologies, pages 180–189,
Portland, Oregon, USA. Association for Com-
putational Linguistics.
Zheng Yuan and Christopher Bryant. 2021.
Document-level grammatical error correction.
In Proceedings of the 16th Workshop on In-
novative Use of NLP for Building Educational
Applications, pages 75–84.
467
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0
/
/
t
l
a
c
_
a
_
0
0
4
7
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Download pdf