Czech Grammar Error Correction with a Large and Diverse Corpus

Czech Grammar Error Correction with a Large and Diverse Corpus

Jakub N´aplava† Milan Straka†

Jana Strakov´a† Alexandr Rosen‡

†Charles University, Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics, Czech Republic
{naplava,straka,strakova}@ufal.mff.cuni.cz
‡Charles University, Faculty of Arts
Institute of Theoretical and Computational Linguistics, Czech Republic

alexandr.rosen@ff.cuni.cz

Abstract

We introduce a large and diverse Czech cor-
pus annotated for grammatical error correction
(GEC) with the aim to contribute to the still
scarce data resources in this domain for lan-
guages other than English. The Grammar
Error Correction Corpus for Czech (GECCC)
offers a variety of four domains, covering error
distributions ranging from high error density
essays written by non-native speakers, to web-
site texts, where errors are expected to be much
less common. We compare several Czech GEC
systems, including several Transformer-based
ones, setting a strong baseline to future re-
search. Finally, we meta-evaluate common
GEC metrics against human judgments on our
data. We make the new Czech GEC corpus
publicly available under the CC BY-SA 4.0 li-
cense at http://hdl.handle.net/11234
/1-4639.

1

Introduction

Representative data both in terms of size and
domain coverage are vital for NLP systems devel-
opment. However, in the field of grammar error
correction (GEC), most GEC corpora are limited
to corrections of mistakes made by foreign or
second language learners even in the case of En-
glish (Tajiri et al., 2012; Dahlmeier et al., 2013;
Yannakoudakis et al., 2011, 2018; Ng et al., 2014;
Napoles et al., 2017). At the same time, as recently
pointed out by Flachs et al. (2020), learner cor-
pora are only a part of the full spectrum of GEC
applications. To alleviate the skewed perspective,
the authors released a corpus of website texts.

Despite recent efforts aimed to mitigate the
notorious shortage of national GEC-annotated cor-
pora (Boyd, 2018; Rozovskaya and Roth, 2019;
Davidson et al., 2020; Syvokon and Nahorna,

452

2021; Cotet et al., 2020; N´aplava and Straka,
2019), the lack of adequate data is even more
acute in languages other than English. We aim to
address both the issue of scarcity of non-English
data and the ubiquitous need for broad domain
coverage by presenting a new, large and diverse
Czech corpus, expertly annotated for GEC.

Grammar Error Correction Corpus for Czech
(GECCC) includes texts from multiple domains
in a total of 83 058 sentences, being, to our knowl-
edge, the largest non-English GEC corpus, as well
as being one of the largest GEC corpora overall.

In order to represent a diversity of writing
styles and origins, besides essays of both native
and non-native speakers from Czech learner cor-
pora, we also scraped website texts to complement
the learner domain with supposedly lower error
density texts, encompassing a representation of
the following four domains:

• Native Formal – essays written by native
students of elementary and secondary schools
• Native Web Informal – informal website

discussions

• Romani – essays written by children and
teenagers of the Romani ethnic minority
• Second Learners – essays written by non-

native learners

Using the presented data, we compare several
state-of-the-art Czech GEC systems, including
some Transformer-based.

Finally, we conduct a meta-evaluation of GEC
metrics against human judgments to select the
most appropriate metric for evaluating corrections
on the new dataset. The analysis is performed
across domains, in line with Napoles et al. (2019).

Transactions of the Association for Computational Linguistics, vol. 10, pp. 452–467, 2022. https://doi.org/10.1162/tacl a 00470
Action Editor: Alice Oh. Submission batch: 6/2021; Revision batch: 11/2021; Published 4/2022.
c(cid:2) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0

/

/
t

l

a
c
_
a
_
0
0
4
7
0
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Language

Corpus

Sentences

Err. r.

Domain

Lang-8
NUCLE
FCE
W&I+LOCNESS
CoNLL-2014 test
JFLEG
GMEG
AESW
CWEB

1 147 451
57 151
33 236
43 169
1 312
1 511
6 000
over 1M
13 574

14.1% SL
6.6% SL
11.5% SL
11.8% SL, native students
8.2% SL
SL

web, formal articles, SL


scientific writing
∼2% web

AKCES-GEC

47 371

21.4% SL essays, Romani ethnolect of Czech

Falko-MERLIN

24 077

16.8% SL essays

RULEC-GEC

12 480

6.4% SL, heritage speakers

COWS-L2H

12 336

SL, heritage speakers

English

Czech

German

Russian

Spanish

Ukrainian

UA-GEC

20 715

7.1% natives/SL, translations and personal texts

Romanian

RONACC

10 119

native speakers transcriptions

# Refs.

1
1
1
5
2,10,8
4
4
1
2

2

1

1

2

2

1

Table 1: Comparison of GEC corpora in size, token error rate, domain, and number of reference
annotations in the test portion. SL = second language learners.

Our contributions include (i) a large and di-
verse Czech GEC corpus, covering learner cor-
pora and website texts, with unified and, in some
domains, completely new GEC annotations, (ii)
a comparison of Czech GEC systems, and (iii)
a meta-evaluation of common GEC metrics
against human judgment on the released corpus.

2 Related Work

2.1 Grammar Error Correction Corpora

Until recently, attention has been focused mostly
on English, while GEC data resources for other
languages were in short supply. Here we list a
few examples of English GEC corpora, collected
mostly within an English-as-a-second-language
(ESL) paradigm. For a comparison of their rele-
vant statistics see Table 1.

Lang-8 Corpus of Learner English (Tajiri et al.,
2012) is a corpus of English language learner texts
from the Lang-8 social networking system.

NUCLE (Dahlmeier et al., 2013) consists of
essays written by undergraduate students of the
National University of Singapore.

FCE (Yannakoudakis et al., 2011) includes
short essays written by non-native learners for the
Cambridge ESOL First Certificate in English.

W&I+LOCNESS is a union of two datasets, the
W&I (Write & Improve) dataset (Yannakoudakis
et al., 2018) of non-native learners’ essays, com-

plemented by the LOCNESS corpus (Granger,
1998), a collection of essays written by native
English students.

The GEC error annotations for the learner
corpora above were distributed with the BEA-
2019 Shared Task on Grammatical Error Correc-
tion (Bryant et al., 2019).

The CoNLL-2014 shared task test set (Ng et al.,
2014) is often used for GEC systems evaluation.
This small corpus consists of 50 essays written
by 25 South-East Asian undergraduates.

JFLEG (Napoles et al., 2017) is another fre-
quently used GEC corpus with fluency edits in
addition to usual grammatical edits.

To broaden the restricted variety of domains,
focused primarily on learner essays, a CWEB col-
lection (Flachs et al., 2020) of website texts was
recently released, aiming at contributing lower
error density data.

AESW (Daudaravicius et al., 2016) is a large
corpus of scientific writing (over 1M sentences),
edited by professional editors.

Finally, Napoles et al. (2019) recently released
GMEG, a corpus for the evaluation of GEC metrics
across domains.

Grammatical error correction corpora for lan-
guages other than English are less common and—
if available—usually limited in size and domain:
German Falko-MERLIN (Boyd, 2018), Russian
RULEC-GEC (Rozovskaya and Roth, 2019),

453

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0

/

/
t

l

a
c
_
a
_
0
0
4
7
0
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Spanish COWS-L2H (Davidson et al., 2020),
Ukrainian UA-GEC (Syvokon and Nahorna, 2021),
and Romanian RONACC (Cotet et al., 2020).

To better account for multiple correction op-
tions, datasets often contain several reference sen-
tences for each original noisy sentence in the test
set, proposed by multiple annotators. As we can
see in Table 1, the number of annotations typically
ranges between 1 and 5 with an exception of the
CoNLL14 test set, which—on top of the official 2
reference corrections—later received 10 annota-
tions from Bryant and Ng (2015) and 8 alternative
annotations from Sakaguchi et al. (2016).

2.2 Czech Learner Corpora

By the early 2010s, Czech was one of a few
languages other than English to boast a series
of learner corpora, compiled under the umbrella
project AKCES, evoking the concept of acquisition
corpora ( ˇSebesta, 2010).

The native section includes transcripts of
hand-written essays (SKRIPT 2012) and class-
room conversation (SCHOLA 2010) from ele-
mentary and secondary schools. Both have their
counterparts documenting the Roma ethnolect of
Czech:1 essays (ROMi 2013) and recordings and
transcripts of dialogues (ROMi 1.0).2

The non-native section goes by the name of
CzeSL,
the acronym of Czech as the Second
Language. CzeSL consists of transcripts of short
hand-written essays collected from non-native
learners with various levels of proficiency and na-
tive languages, mostly students attending Czech
language courses before or during their studies at
a Czech university. There are several releases of

1The Romani ethnolect of Czech is the result of contact
with Romani as the linguistic substrate. To a lesser (and
weakening) extent the ethnolect shows some influence of
Slovak or even Hungarian, because most of its speakers have
roots in Slovakia. The ethnolect can exhibit various specifics
across all linguistic levels. However, nearly all of them
are complementary with their colloquial or standard Czech
counterparts. A short written text, devoid of phonological
properties, may be hard to distinguish from texts written by
learners without the Romani backround. The only striking
exception are misspellings in contexts where the latter benefit
from more exposure to written Czech. The typical example is
the omission of word boundaries within phonological words,
e.g., between a clitic and its host. In other respects, the pattern
of error distribution in texts produced by ethnolect speakers
is closer to native rather than foreign learners (Boˇrkovcov´a,
2007, 2017).

2A more recent release SKRIPT 2015 includes a balanced
mix of essays from SKRIPT 2012 and ROMi 2013. For more
details and links see http://utkl.ff.cuni.cz/akces/.

CzeSL, which differ mainly to what extent and
how the texts are annotated (Rosen et al., 2020).3
More recently, hand-written essays have been
transcribed and annotated in TEITOK (Janssen,
2016),4 a tool combining a number of cor-
pus compilation, annotation and exploitation
functionalities.

Learner Czech is also represented in MERLIN, a
multilingual (German, Italian, and Czech) corpus
built in 2012–2014 from texts submitted as a part
of tests for language proficiency levels (Boyd
et al., 2014).5

Finally, AKCES-GEC (N´aplava and Straka,
2019) is a GEC corpus for Czech created from
the subset of the above mentioned AKCES re-
sources ( ˇSebesta, 2010): the CzeSL-man corpus
(non-native Czech learners with manual annota-
tion) and a part of the ROMi corpus (speakers of
the Romani ethnolect).

Compared to the AKCES-GEC,

the new
GECCC corpus contains much more data (47 371
sentences vs. 83 058 sentences, respectively), by
extending data in the existing domains and also
adding two new domains: essays written by native
learners and website texts, making it the largest
non-English GEC corpus and one of the largest
GEC corpora overall.

3 Annotation

3.1 Data Selection

We draw the original uncorrected data from
the following Czech learner corpora or Czech
websites:

• Native Formal – essays written by native stu-
dents of elementary and secondary schools
from the SKRIPT 2012 learner corpus,
compiled in the AKCES project

• Native Web Informal – newly annotated
informal website discussions from Czech
Facebook Dataset (Habernal et al., 2013a,b)
and Czech news site novinky.cz.

• Romani – essays written by children and
teenagers of the Romani ethnic minority from
the ROMi corpus of the AKCES project and
the ROMi section of the AKCES-GEC corpus

3For a list of CzeSL corpora with their sizes and annotation

details see http://utkl.ff.cuni.cz/learncorp/.

4http://www.teitok.org.
5https://www.merlin-platform.eu.

454

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0

/

/
t

l

a
c
_
a
_
0
0
4
7
0
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Dataset

Documents

Selected

AKCES-GEC-test
AKCES-GEC-dev
MERLIN
Novinky.cz
Facebook
SKRIPT 2012
ROMi

188
195
441

10 000
394
1 529

188
195
385
2 695
3 850
167
218

Table 2: Data resources for the new Czech GEC
corpus. The second column (Selected) shows the
size of the selected subset from all available
documents (first column, Documents).

• Second Learners – essays written by non-
native learners, from the Foreigners section
of the AKCES-GEC corpus, and the MERLIN
corpus

Since we draw our data from several Czech cor-
tools with
pora originally created in different
different annotation schemes and instructions, we
re-annotated the errors in a unified manner for the
entire development and test set and partially also
for the training set.

The data split was carefully designed to main-
tain representativeness, coverage and backwards
compatibility. Specifically, (i) test and develop-
ment data contain roughly the same amount of
annotated data from all domains, (ii) original
AKCES-GEC dataset splits remain unchanged,
and (iii) additional available detailed annotations
such as user proficiency level in MERLIN were
leveraged to support the split balance. Overall,
the main objective was to achieve a representative
cover over development and testing data. Table 2
presents the sizes of data resources in the num-
ber of documents. The first column (Documents)
shows the number of all available documents
collected in an initial scan. The second column
(Selected) is a selected subset from the available
documents, due to budgetary constraints and to
achieve a representative sample over all domains
and data portions. The relatively higher number of
documents selected for the Native Web Informal
domain is due to its substantially shorter texts,
yielding fewer sentences; also, we needed to pop-
ulate this part of the corpus as a completely new
domain with no previously annotated data.

To achieve more fine-grained balancing of the
splits, we used additional metadata where avail-
able: user’s proficiency levels and origin language
from MERLIN and the age group from AKCES.

3.2 Preprocessing

De/tokenization is an important part of data pre-
processing in grammar error correction. Some
formats, such as the M2 format (Dahlmeier and
Ng, 2012), require tokenized formats to track and
evaluate correction edits. On the other hand, deto-
kenized text in its natural form is required for other
applications. We therefore release our corpus in
two formats: a tokenized M2 format and deto-
kenized format aligned at sentence, paragraph, and
document level. As part of our data is drawn from
tokenized GEC corpora AKCES-GEC
earlier,
and MERLIN, this data had to be detokenized. A
slightly modified Moses detokenizer6 is attached
to the corpus. To tokenize the data for the M2
format, we use the UDPipe tokenizer (Straka
et al., 2016).

3.3 Annotation

The test and development sets in all domains
were annotated from scratch by five in-house ex-
pert annotators,7 including re-annotations of the
development and test data of the earlier GEC cor-
pora to achieve a unified annotation style. All the
test sentences were annotated by two annotators;
one half of the development sentences received
two annotations and the second half one annota-
tion. The annotation process took about 350 hours
in total.

The annotation instructions were unified across
all domains: The corrected text must not contain
any grammatical or spelling errors and should
sound fluent. Fluency edits are allowed if the
original is incoherent. The entire document was
given as a context for the annotation. Annotators
were instructed to remove documents that were
too incomprehensible or those containing private
information.

To keep the annotation process simple for the
annotators, the sentences were annotated (cor-
rected) in a text editor and postprocessed auto-
matically to retrieve and categorize the GEC edits

6https://github.com/moses-smt/mosesdecoder
/blob/master/scripts/tokenizer/detokenizer.perl.
7Our annotators are senior undergraduate students of
humanities, regularly employed for various annotation efforts
at our institute.

455

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0

/

/
t

l

a
c
_
a
_
0
0
4
7
0
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

by the ERRor ANnotation Toolkit (ERRANT)
(Bryant et al., 2017).

3.4 Train Data

The first source for the training data are the data
from the SKRIPT 2012; the MERLIN corpus and
the AKCES-GEC train set that were not annotated,
thus containing original annotations. These data
cover the Native Formal, the Romani, and the
Second Learners domain. The second part of the
training data are newly annotated data. Specifi-
cally, these are all Native Web Informal data and
also a small part in the Second Learners domain.
All data in the training set were annotated with
one annotation.

3.5 Corpus Alignment

The majority of models proposed for grammat-
ical error correction operates over sentences.
However, preliminary studies on document-level
grammatical error correction recently appeared
(Chollampatt et al., 2019; Yuan and Bryant, 2021).
The models were shown to benefit from larger
context as certain errors such as errors in arti-
cles or tense choice do require larger context. To
simplify future work with our dataset, we release
three alignment
levels: (i) sentence-level, (ii)
paragraph-level, and (iii) document-level. Given
that the state-of-the-art grammatical error correc-
tion systems still operate on sentence level despite
the initial attempts with document-level systems,
we perform model training and evaluation at the
usual sentence level.8

3.6 Inter-Annotator Agreement

As suggested by Rozovskaya and Roth (2010),
followed later by Rozovskaya and Roth (2019)
and Syvokon and Nahorna (2021), we evaluate
inter-annotator agreement by asking a second an-
notator to judge the need for a correction in a
sentence already annotated by someone else, in a
single-blind setting as to the status of the sentence
(corrected/uncorrected).9 Five annotators anno-
tated the first pass and three annotators judged
the sentence correctness in the second pass. In

8Note that even if human evaluation in Section 5 is per-
formed on sentence-aligned data, human annotators process
whole documents, and thus take the full context into account.
9A sentence-level agreement on sentence correctness
is generally preferred in GEC annotations to an exact
inter-annotator match on token edits, since different series of
corrections may possibly lead to a correct sentence (Bryant
and Ng, 2015).

First →
Second ↓
A1
A2
A3

A1

A2

A3

A4

A5

— 93.39

97.96
84.43 — 95.91
68.80

89.63
90.18
87.68 — 79.39

72.50
78.15
57.50

Table 3: Inter-annotator agreement based on
second-pass judgments. Numbers represent per-
centage of sentences judged correct in second-
pass proofreading. Five annotators annotated the
first pass, and three annotators judged the sen-
tence correctness in the second pass.

Error Type
POS (15)

MORPH
ORTH

SPELL
WO

QUOTATION
DIACR
OTHER

Subtype

:INFL

:CASING
:WSPACE

:SPELL

Example
taˇzen´e → ˇr´ızen´e
manˇzelka → man´zelkou
maj → maj´ı
usa → USA
pˇres to → pˇresto
ochtnat → ochutnat
pln´a jsou → jsou pln´a
bl´ıskaj´e zelenˇe → zelenˇe bl´yskaj´ı
” → ,,
tiskarna → tisk´arna
sem → jsem ho

Table 4: Czech ERRANT error types.

the second pass, each of the three annotators
judged a disjoint set of 120 sentences. Table 3
summarizes the inter-annotator agreement based
on second-pass judgments: The numbers represent
the percentage of sentences judged correct in the
second pass.

Both the average and the standard deviation
(82.96 ± 12.12) of our inter-annotator agreement
are similar to inter-annotator agreement measured
on English (63 ± 18.46, Rozovskaya and Roth
2010), Russian (80 ± 16.26, Rozovskaya and Roth
2019), and Ukrainian (69.5 ± 7.78 Syvokon and
Nahorna 2021).

3.7 Error Type Analysis

To retrieve and categorize the correction edits
from the erroneous-corrected sentence pairs, ER-
Ror ANnotation Toolkit (ERRANT) (Bryant et al.,
2017) was used. Inspired by Boyd (2018), we
adapted the original English error types to the
Czech language. For the resulting set see Table 4.
The POS error types are based on the UD POS
tags (Nivre et al., 2020) and may contain an op-
tional :INFL subtype when the original and the
corrected words share a common lemma. The
word-order error type was extended by an optional

456

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0

/

/
t

l

a
c
_
a
_
0
0
4
7
0
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Sentence-aligned
#sentences
Dev
1 952
2 465
1 254
2 807
8 478

Train
4 060
6 977
24 824
30 812
66 673

Test
1 684
2 166
1 260
2 797
7 907

Paragraph-aligned
#paragraphs
Dev
859
1 294
574
865
3 592

Train
1 618
3 622
9 723
8 781
23 744

Test
669
1 256
561
756
3 242

Doc-aligned
#docs
Train Dev
87
1 291
173
167
1 718

227
3 619
3 247
2 050
9 143

Test
76
1 256
169
170
1 671

Error Rate

5.81%
15.61%
26.21%
25.16%
18.19%

Native Formal
Native Web Informal
Romani
Second Learners
Total

Table 5: Corpus statistics at three alignment levels: sentence-aligned, paragraph-aligned, and doc-
aligned. Average error rate was computed on the concatenation of development and test data at all three
alignment levels.

:SPELL subtype to allow for capturing word or-
der errors including words with minor spelling
errors. The original orthography error type ORTH
covering both errors in casing and whitespaces
is now subtyped with :WSPACE and :CASING
to better distinguish between the two phenomena.
Finally, we add two error types specific to Czech:
DIACR for errors in either missing or redundant
diacritics and QUOTATION for wrongly used
quotation marks. Two original error types remain
unchanged: MORPH, indicating replacement of a
token by another with the same lemma but differ-
ent POS, and SPELL, indicating incorrect spelling.
For part-of-speech tagging and lemmatization
we rely on UDPipe (Straka et al., 2016).10 The
word list for detecting spelling errors comes from
MorfFlex (Hajiˇc et al., 2020).11

We release the Czech ERRANT at https://
github.com/ufal/errant czech. We assume
that it is applicable to other languages with a
similar set of errors, especially Slavic languages, if
lemmatizer, tagger, and morphological dictionary
are available.

3.8 Final Dataset

The final corpus consists of 83 058 sentences and
is distributed in two formats: the tokenized M2
format (Dahlmeier and Ng, 2012) and the deto-
kenized format with alignments at the sentence,
paragraph, and document levels. Although the
detokenized format does not include correction
edits, it does retain full information about the
original spacing.

The statistics of the final dataset are presented in
Table 5. The individual domains are balanced on

10Using the czech-pdt-ud-2.5-191206.udpipe model.
11We also use the aggresive variant of the stemmer from
https://research.variancia.com/czech_stemmer/.

the sentence level in the development and testing
sets, each of them containing about 8 000 sen-
tences. The number of paragraphs and documents
varies: on average, the Native Web Informal do-
main contains less than 2 sentences per document,
while the Native Formal domain more than 20.

As expected, the domains differ also in the error
rate, that is, the proportion of erroneous tokens
(see Table 5). The students’ essays in the Native
Formal domain are almost 3 times less erroneous
than any other domain, while in the Romani and
Second Learners domain, approximately every
fourth token is incorrect.

Furthermore, the prevalence of error types dif-
fers for each individual domain. The 10 most
common error types in each domain are pre-
sented in Figure 1. Overall, errors in punctuation
(PUNCT) constitute the most common error type.
They are the most common error in three domains,
although their relative frequency varies. We fur-
ther estimated that of these errors, 9% (Native
Formal) to 27% (Native Web Informal) are unin-
teresting from the linguistic perspective, as they
are only omissions of the sentence formal end-
ing, probably purposeful in case of Native Web
Informal. The rest (75–91%) appears in a sen-
tence, most of which (35–68% Native Formal) is
a misplaced comma: In Czech, syntactic status of
finite clauses strictly determine the use of com-
mas in the sentence. Finally, in 5–7% cases of all
punctuation errors, a correction included joining
two sentences or splitting a sentence into two sen-
tences. Errors in either missing or wrongly used
diacritics (DIACR), spelling errors (SPELL), and
errors in orthography (ORTH) are also common,
with varying frequency across domains.

Compared to the AKCES-GEC corpus, the Gram-
mar Error Correction Corpus for Czech contains

457

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0

/

/
t

l

a
c
_
a
_
0
0
4
7
0
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

Figure 1: Distribution of top-10 ERRANT error types per domain in the development set.

System

Params

Boyd (2018)
Choe et al. (2019)
Lichtarge et al. (2019)
Lichtarge et al. (2020)
Omelianchuk et al. (2020)
Rothe et al. (2021) base
Rothe et al. (2021) xxl
Rozovskaya and Roth (2019)
Xu et al. (2019)
AG finetuned






580M
13B


210M

English

W&I+L

63.05

66.5
72.4
60.2
69.83

63.94
69.00

CoNLL 14


56.8
62.1
65.3
54.10
65.65

60.90
63.40

Czech
AKCES-GEC





71.88
83.15


80.17

German
Falko-Merlin
45.22




69.21
75.96


73.71

Russian
RULEC-GEC



26.24
51.62
21.00

50.20

Table 6: Comparison of selected single-model systems on English (W&I+L, CoNLL-2014), Czech
(AKCES-GEC), German (Falko-Merlin GEC), and Russian (RULEC-GEC) datasets. Our reimplemen-
tation of the AG finetuned model is from N´aplava and Straka (2019). Note that models vastly differ in
training/fine-tuning data and size (e.g., Rothe et al. (2021) xxl is 50 times larger than AG finetuned).

more than 3 times as many sentences in the devel-
opment and test sets, more than 50% sentences in
the training set and also two new domains.

To the best of our knowledge, the newly intro-
duced GECCC dataset is the largest among GEC
corpora in languages other than English and it
is surpassed in size only by the English Lang-8
and AESW datasets. With the exclusion of these
two datasets, the GECCC dataset contains more
sentences than any other GEC corpus currently
known to us.

4 Model

In this section, we describe five systems for auto-
matic error correction in Czech and analyze their
performance on the new dataset. Four of these
systems represent previously published Czech
work (Richter et al., 2012; N´aplava and Straka,
2019; N´aplava et al., 2021) and one is our new im-
plementation. The first system is a pre-neural ap-
proach, published and available for Czech (Richter

et al., 2012), included for historical reasons as a
previously known and available Czech GEC tool;
the following four systems represent the current
state of the art in GEC: They are all neural network
architectures based on Transformers, differing in
the training procedure, training data, or training
objective. A comparison of systems, trained and
evaluated on English, Czech, German, and Rus-
sian, with state of the art, is given in Table 6.

4.1 Models

We experiment with the following models:

Korektor (Richter et al., 2012) is a pre-neural
statistical spellchecker and (occasional) grammar
checker. It uses the noisy channel approach with
a candidate model that for each word suggests its
variants up to a predefined edit distance. Inter-
nally, a hidden Markov model (Baum and Petrie,
1966) is built. Its hidden states are the variants
of words proposed by the candidate model, and
the transition costs are determined from three
N -gram language models built over word forms,

458

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
7
0
2
0
0
8
0
5
0

/

/
t

l

a
c
_
a
_
0
0
4
7
0
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3

System
Original
Korektor
Synthetic trained
AG finetuned
GECCC finetuned
Joint GEC+NMT
Reference

NF

28.99
46.83
65.77
72.50
68.14

M 2

0.5-score
R

46.77
46.36
69.71
72.23
65.21

NWI

31.51
38.63
55.20
71.09
66.64

SL


55.93
62.20
71.41
73.21
70.43

Σ
NF
— 8.47
8.26
8.55
8.97
9.19
9.06
— 9.58

45.09
53.07
68.08
72.96
67.40

Mean human score
SL
R
NWI
7.18
7.76
7.99
7.55
7.90
7.60
7.88
8.10
7.99
8.91
8.35
8.22
8.67
8.91
8.72
8.19
8.69
8.37
9.63
9.60
9.48

Σ
7.61
7.63
7.98
8.38
8.74
8.35
9.57

Table 7: Mean score of human judgments and M 2
0.5 score for each system in domains (NF = Native
Formal, NWI = Native Web Informal, R = Romani, SL = Second Learners, Σ = whole dataset). All
results in the whole dataset (the Σ column) are statistically significant with p-value < 0.001, except for the AG finetuned and Joint GEC+NMT systems, where the p-value is less than 6.2% for M 2 0.5 score and less than 4.3% for human score, using the Monte Carlo permutation test with 10M samples and probability of error at most 10−6 (Fay and Follmann, 2002; Gandy, 2009). lemmas, and part-of-speech-tags. To find an opti- mal correction, Viterbi algorithm (Forney, 1973) is used. Synthetic trained (N´aplava and Straka, 2019) is a neural-based Transformer model that is trained to translate the original ungrammatical text to a well formed text. The original Transformer model (Vaswani et al., 2017) is regularized with an ad- ditional source and target word dropout and the training objective is modified to focus on tokens that should change (Grundkiewicz and Junczys- Dowmunt, 2019). As the amount of existing anno- tated data is small, an unsupervised approach with a spelling dictionary is used to generate a large amount of synthetic training data. The model is trained solely on these synthetic data. AKCES-GEC (AG) finetuned (N´aplava and Straka, 2019) is based on Synthetic trained, but finetunes its weights on a mixture of synthetic and authentic data from the AKCES-GEC corpus, namely, on data from the Romani and Second Learners domains. See Table 6 for comparison with state of the art in English, Czech, German and Russian. GECCC finetuned uses the same architecture as Synthetic trained, but we finetune its weights on a mixture of synthetic and (much larger) authen- tic data from the newly released GECCC corpus. We use the official code of N´aplava and Straka (2019) with the default settings and mix the syn- thetic and new authentic data in a ratio of 2:1. Joint GEC+NMT (N´aplava et al., 2021) is a Transformer model trained in a multi-task setting. It pursues two objectives: (i) to correct Czech and English texts; (ii) to translate the noised Czech texts into English texts and the noised English texts into Czech texts. The source data come from the CzEng v2.0 corpus (Kocmi et al., 2020) and were noised using a statistical system, KaziText (N´aplava et al., 2021), that tries to model several most frequently occurring errors such as diacrit- ics, spelling or word ordering. The statistics of the Czech noise were estimated on the new training set, therefore, the system was indirectly trained also on data from Native Formal and Native Web Informal domains, unlike the AG finetuned sys- tem. The statistics of the English noise were esti- mated on NUCLE (Dahlmeier et al., 2013), FCE (Yannakoudakis et al., 2011), and W&I+LOCNESS (Yannakoudakis et al., 2018; Granger, 1998). 4.2 Results and Analysis Table 7 summarizes the evaluation of the five grammar error correction systems (described in the previous Section 4.1), evaluated with highest- correlating and widely used metric, the M 2 score with β = 0.5, denoted as M 2 0.5 (left); and with human judgments (right). For the meta-evaluation of GEC metrics against human judgments, see the following Section 5. Clearly, learning on GEC annotated data im- proves performance significantly, as evidenced by a giant leap between the systems without GEC data (Korektor, Synthetic trained) and the systems trained on GEC data (AG finetuned, GECCC fine- tuned, and Joint GEC+NMT). Further addition 459 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 0 2 0 0 8 0 5 0 / / t l a c _ a _ 0 0 4 7 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Error Type DIACR MORPH ORTH:CASING ORTH:WSPACE OTHER POS POS:INFL PUNCT QUOTATION SPELL WO # 3 617 610 1 058 385 3 719 2 735 1 276 4 709 223 1 816 662 P 86.84 73.58 81.60 64.44 23.59 56.50 74.47 71.42 89.44 77.27 60.00 R 88.77 55.91 55.15 74.36 20.04 22.12 48.22 61.17 61.06 75.76 29.89 F0.5 87.22 69.20 74.46 66.21 22.78 43.10 67.16 69.10 81.83 76.96 49.94 Table 8: Analysis of GECCC finetuned model performance on individual error types. For this analysis, all POS-error types were merged into a single error type POS. of GEC data volume and domains is statisti- cally significantly better (p < 0.001), as the only difference between AG finetuned and GECCC finetuned systems is that the former uses the AKCES-GEC corpus, while the latter is trained on larger and domain-richer GECCC. Access to larger data and more domains in the multi-task setting is useful (compare Joint GEC+NMT and AG finetuned on newly added Native Formal and Native Web Informal domains), although direct training seems superior (GECCC finetuned over Joint GEC+NMT). We further analyze the best model (GECCC finetuned) and inspect its performance with re- spect to individual error types. For simpler anal- ysis, we grouped all POS-related errors into two error types: POS and POS:INFL for words that are erroneous only in inflection and share the same lemma with their correction. As we can see in Table 8, the model is very good at correcting local errors in diacritics (DIACR), quotation (QUOTATION), spelling (SPELL), and casing (ORTH:CASING). Unsurprisingly, small changes are easier than longer edits; similarly, the system is better in inflection corrections (POS: INFL, words with the same lemma) than on POS (correction involves finding a word with a dif- ferent lemma). Should the word be split or joined with an adjacent word, the model does so with a relatively high success rate (ORTH:WSPACE). The model is also able to correctly reorder words (WO), but here its recall is rather low. The model performs worst on errors categorized as OTHER, which includes edits that often require rewriting larger pieces of text. Generally, the model has higher precision than recall, which suits the needs of standard GEC, where proposing a bad correction for a good text is worse than being inert to an existing error. 5 Meta-evaluation of Metrics There are several automatic metrics used for evaluating system performance on GEC dataset, although it is not clear which of them is prefera- ble in terms of high correlation with human judg- ments on our dataset. The most popular GEC metrics are the Max- Match (M2) scorer (Dahlmeier and Ng, 2012) and the ERRANT scorer (Bryant et al., 2017). The MaxMatch (M2) scorer reports the F-score over the optimal phrasal alignment between a source sentence and a system hypothesis reaching the highest overlap with the gold standard anno- tation. It was used as the official metric for the CoNLL 2013 and 2014 Shared Tasks (Ng et al., 2013, 2014) and is also used on various other datasets such as the German Falko-MERLIN GEC (Boyd, 2018) or Russian RULEC-GEC (Rozovskaya and Roth, 2019). The ERRANT scorer was used as the official metric of the recent Building Educational Appli- cation 2019 Shared Task on GEC (Bryant et al., 2019). The ERRANT scorer also contains a set of rules operating over a set of linguistic annota- tions to construct the alignment and extract indi- vidual edits. Other popular automatic metrics are the Gen- eral Language Evaluation Understanding (GLEU) metric (Napoles et al., 2015), which additionally measures text fluency, and I-Measure (Felice and Briscoe, 2015), which calculates weighted accu- racy of both error detection and correction. 5.1 Human Judgments Annotation In order to evaluate the correlation of several GEC metrics with human judgments, we collected an- notations of the original erroneous sentences, the manually corrected gold references, and automatic corrections made by five GEC systems described in Section 4. We used the hybrid partial ranking with scalars (Sakaguchi and Van Durme, 2018), in which the annotators judged the sentences on a scale from 0–10 (from ungrammatical to 460 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 0 2 0 0 8 0 5 0 / / t l a c _ a _ 0 0 4 7 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 correct).12 The sentences were evaluated with respect to the context of the document. In total, three annotators judged 1 100 documents, sampled from the test set comprising about 4 300 original sentences and about 15 500 unique corrected vari- ants and gold references of the sentences. The annotators annotated 127 documents jointly and the rest was annotated by a single annotator. This annotation process took about 170 hours. To- gether with the model training, data preparation, and management of the annotation process, our rough estimation is about 300+ man-hours for the correlation analysis per corpus (language). 5.2 Agreement in Human Judgments For the agreement in human judgments, we report the Pearson correlation and Spearman’s rank cor- relation coefficient between 3 human judgments of 5 automatic sentence corrections at the system- and sentence-level. At the correlation of the judgments about the 5 sen- tence corrections is calculated for each sentence and each pair of the three annotators. The final sentence-level annotator agreement is the mean of these values over all sentences. the sentence level, At the system level, the annotators’ judgments for each system are averaged over the sentences, and the correlation of these averaged judgments is computed for each pair of the three annotators. In order to obtain smoother estimates (especially for Spearman’s ρ), we utilize bootstrap resampling with 100 samples of a test set. The human judgments agreement across do- mains is shown in Table 9. On the sentence level, the human judgments correlation is high on the least erroneous domain Native Formal, implying that it is easier to judge the corrections in a low error density setting, and it is more difficult in high error density domains, such as Romani and Second Learners (compare error rates in Table 5). 5.3 Metrics Correlations with Judgments Following Napoles et al. (2019), we provide a meta-evaluation of the following common GEC metrics robustness on our corpus: • MaxMatch (M2) (Dahlmeier and Ng, 2012) 12Recent work (Sakaguchi and Van Durme, 2018; Novikova et al., 2018) found partial ranking with scalars to be more reliable than direct assessment framework used by WMT (Bojar et al., 2016) and earlier GEC evaluation approaches (Grundkiewicz et al., 2015; Napoles et al., 2015). Domain Native Formal Native Web Inf. Romani Second Learners Whole Dataset Sentence level System level r 87.13 80.23 86.57 78.50 79.07 ρ 88.76 81.47 86.57 79.97 80.40 r 92.01 95.33 88.73 96.50 96.11 ρ 92.52 91.80 85.90 97.23 95.54 Table 9: Human judgments agreement: Pearson (r) and Spearman (ρ) mean correlation between 3 human judgments of 5 sentence versions at sentence- and system-level. • ERRANT (Bryant et al., 2017) • GLEU (Napoles et al., 2015) • I-measure (Felice and Briscoe, 2015) Moreover, we vary the proportion of recall and precision, ranging from 0 to 2.0 for M2-scorer and ERRANT, as Grundkiewicz et al. (2015) report that the standard choice of considering precision two times as important as recall may be sub-optimal. While we considered both sentence-level and system-level evaluation in Section 5.2, the au- tomatic metrics should by design be used on a whole corpus, leaving us with only system-level evaluation. Given that the GEC systems perform differently on the individual domains (as indicated by Table 7), we perform the correlation compu- tation on each domain separately and report the average. For a given domain and metric, we compute the correlation between the automatic metric eval- uations of the five systems on one side and the (average of) human judgments on the other side. In order to obtain a smoother estimate of Spearman’s ρ and also to estimate standard devia- tions, we employ bootstrap resampling again, with 100 samples. The results are presented in Table 10. While Spearman’s ρ has more straightforward interpre- tation, it also has a much higher variance, because it harshly penalizes the differences in the ranking of systems with similar performance (namely, AG finetuned and Joint GEC+NMT in our case). This fact has previously been observed by Mach´aˇcek and Bojar (2013). Therefore, we choose the most suitable GEC metric for our GECCC dataset according to Pearson r, which implies that M2 0.5 and ERRANT0.5 are 461 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 0 2 0 0 8 0 5 0 / / t l a c _ a _ 0 0 4 7 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Figure 2: Left: System-level Pearson correlation coefficient r between human annotation and M2 various values of β. Right: The same correlation for ERRANTβ. β-scorer for Metric GLEU I-measure M2 0.2 M2 0.5 M2 1.0 ERRANT0.2 ERRANT0.5 ERRANT1.0 System level r 97.37 ± 1.52 95.37 ± 2.16 96.25 ± 1.71 98.28 ± 1.03 95.62 ± 1.81 94.66 ± 2.44 98.28 ± 1.04 95.70 ± 1.80 ρ 92.28 ± 6.19 98.66 ± 3.21 93.27 ± 9.45 97.77 ± 4.27 93.22 ± 4.30 91.19 ± 4.76 98.35 ± 4.81 93.61 ± 4.47 Table 10: System-level Pearson (r) and Spearman (ρ) correlation between the automatic metric scores and human annotations. the metrics most correlating with human judg- ments. Of those two, we prefer the M2 0.5 score, not due to its marginal superiority in correlation (Table 10), but rather because it is much more language-agnostic compared to ERRANT, which requires a POS tagger, lemmatizer, morphological dictionary, and language-specific rules. Our results confirm that both M2-scorer and ERRANT with β = 0.5 (chosen only by intu- ition for the CoNLL 2014 Shared task; Ng et al., 2014) correlate much better with human judg- ments, compared to β = 0.2 and β = 1. The detailed plots of correlations of M2 β score and ERRANTβ score with human judgments for β ranging between 0 and 2, presented in Figure 2, show that optimal β in our case lies between 0.4 and 0.5. However, we opt to employ the widely used β = 0.5 because of its prevalence and be- cause the difference to the optimal β is marginal. Our results are distinct from the results of Grundkiewicz et al. (2015), where β = 0.18 correlates best on the CoNLL 14 test set. Nev- ertheless, Napoles et al. (2019) demonstrate that β = 0.5 correlates slightly better than β = 0.2 on the FCE dataset, but that β = 0.2 correlates substantially better than β = 0.5 on Wikipedia and also on Yahoo discussions (a dataset contain- ing paragraphs of Yahoo! Answers, which are in- formal user answers to other users’ questions). In the latter work, Napoles et al. (2019) propose that larger β = 0.5 correlate better on datasets with higher error rate and vice versa, given that the FCE dataset has 20.2% token error rate, com- pared to the error rates of 9.8% and 10.5% of Wikipedia and Yahoo, respectively. The hypothe- sis seems to extend to our results and the results of Grundkiewicz et al. (2015), considering that the GECCC dataset and the CoNLL 14 test set have token error rates of 18.2% and 8.2%, respectively. 5.4 GEC Systems Results Table 7 presents both human scores for the GEC systems described in Section 4 and also results obtained by the chosen M2 0.5 metric. The results are presented both on the individual domains and the entire dataset. Measuring over the entire dataset, human judgments and the M 2-scorer rank the systems in accordance. Judged by the human annotators, all systems are better than the ‘‘do nothing’’ baseline (the Original) measured over the entire dataset, al- though Korektor makes harmful changes in two domains: Native Formal and Native Web Infor- mal. These two domains contain frequent named entities, which upon an eager change disturb the meaning of a sentence, leading to severe penal- ization by human annotators. Korektor is also not capable of deleting, inserting, splitting or joining 462 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 0 2 0 0 8 0 5 0 / / t l a c _ a _ 0 0 4 7 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 tokens. The fact that Korektor sometimes per- forms detrimental changes cannot be revealed by the M 2-scorer as it assigns zero score to the Orig- inal baseline and does not allow negative scores. The human judgments confirm that there is still a large gap between the optimal Reference score and the best performing models. Regarding the domains, the neural models in the finetuned mode that had access to data from all domains seemed to improve the results consistently across each domain. However, given the fact that the source sentences in the Second Learners domain received the worst scores by human annotators, this domain seems to hold the greatest potential for future improvements. 6 Conclusions We release a new Czech GEC corpus, the Grammar Error Correction Corpus for Czech (GECCC). This large corpus with 83 058 sen- tences covers four diverse domains, including essays written by native students, informal website texts, essays written by Romani ethnic minor- ity children and teenagers and essays written by non-native speakers. All domains are profession- ally annotated for GEC errors in a unified man- ner, and errors were automatically categorized with a Czech-specific version of ERRANT re- leased at https://github.com/ufal/errant czech. We compare several strong Czech GEC systems, and finally, we provide a meta-evaluation of common GEC metrics across domains in our data. We conclude that M2 and ERRANT scores with β = 0.5 are the measures most correlating with human judgments on our dataset, and we choose the M2 0.5 as the preferred metric for the GECCC dataset. The corpus is publicly available under the CC BY-SA 4.0 license at http://hdl .handle.net/11234/1-4639. Acknowledgments This work has been supported by the Grant Agency of the Czech Republic, project EX- PRO LUSyD (GX20-16819X). This research was also partially supported by SVV project num- ber 260 575 and GAUK 578218 of the Charles University. The work described herein has been supported by and has been using language re- sources stored by the LINDAT/CLARIAH-CZ Research Infrastructure (https://lindat.cz) of the Ministry of Education, Youth and Sports of the Czech Republic (project no. LM2018101). This work was supported by the European Re- gional Development Fund project ‘‘Creativity and Adaptability as Conditions of the Success of Europe in an Interrelated World’’ (reg. no.: CZ.02.1.01/0.0/0.0/16 019/0000734). We would also like to thank the reviewers and the TACL action editor for their thoughtful comments, which helped to improve this work. References Leonard E. Baum and Ted Petrie. 1966. Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathe- matical Statistics, 37(6):1554–1563. https:// doi.org/10.1214/aoms/1177699147 Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aur´elie N´ev´eol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. Findings of the 2016 conference on machine translation. In Proceedings of the First Confer- ence on Machine Translation: Volume 2, Shared Task Papers, pages 131–198, Berlin, Germany. Association for Computational Linguistics. https://doi.org/10.18653/v1/W16 -2301 M´aˇsa Boˇrkovcov´a. 2007. Romsk´y etnolekt ˇceˇstiny. Signeta, Praha. M´aˇsa Boˇrkovcov´a. 2017. Romsk´y etnolekt ˇceˇstiny. In Petr Karl´yk, Marek Nekula, and Jana Pleskalov´a, editors, Nov´y encyklopedick´y slovn´ık ˇceˇstiny. Nakladatelstv´y Lidov´e Noviny. Adriane Boyd. 2018. Using Wikipedia edits in low resource grammatical error correction. In Proceedings of the 4th Workshop on Noisy User-generated Text. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/W18-6111 Adriane Boyd, Jirka Hana, Lionel Nicolas, Detmar Meurers, Katrin Wisniewski, Andrea Abel, Karin Sch¨one, Barbora ˇStindlov´a, and 463 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 0 2 0 0 8 0 5 0 / / t l a c _ a _ 0 0 4 7 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Chiara Vettori. 2014. The MERLIN corpus: Learner language and the CEFR. In Proceed- ings of the Ninth International Conference on Language Resources and Evaluation (LREC’ 14), pages 1281–1288, Reykjavik, Iceland. European Language Resources Association (ELRA). In Proceedings of Christopher Bryant, Mariano Felice, Øistein E. Andersen, and Ted Briscoe. 2019. The BEA- 2019 Shared task on grammatical error cor- the Fourteenth rection. Workshop on Innovative Use of NLP for Build- ing Educational Applications, pages 52–75, Florence, Italy. Association for Computational Linguistics. https://doi.org/10.18653 /v1/W19-4406 Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. Automatic annotation and eval- uation of error types for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 793–805, Vancouver, Canada. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/P17-1074 Christopher Bryant and Hwee Tou Ng. 2015. How far are we from fully automatic high quality grammatical error correction? In Pro- ceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 697–707, Beijing, China. Association for Computational Linguistics. https://doi.org/10.3115/v1/P15-1068 Yo Joong Choe, Jiyeon Ham, Kyubyong Park, and Yeoil Yoon. 2019. A neural grammatical error correction system built On better pre- training and sequential transfer learning. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educa- tional Applications, pages 213–227, Florence, Italy. Association for Computational Linguistics. https://doi.org/10.18653/v1/W19 -4423 Shamil Chollampatt, Weiqi Wang, and Hwee Tou Ng. 2019. Cross-sentence grammatical error correction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 435–445. https://doi .org/10.18653/v1/P19-1042 Teodor-Mihai Cotet, Stefan Ruseti, and Mihai Dascalu. 2020. Neural grammatical error cor- rection for romanian. In 2020 IEEE 32nd In- ternational Conference on Tools with Artificial Intelligence (ICTAI), pages 625–631. Daniel Dahlmeier and Hwee Tou Ng. 2012. Better evaluation for grammatical error correction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 568–572, Montr´eal, Canada. Association for Computational Linguistics. Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu. 2013. Building a large annotated cor- pus of learner English: The NUS corpus of learner English. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Build- ing Educational Applications, pages 22–31, Atlanta, Georgia. Association for Computa- tional Linguistics. Vidas Daudaravicius, Rafael E. Banchs, Elena Volodina, and Courtney Napoles. 2016. A re- port on the automatic evaluation of scientific the writing shared task. In Proceedings of 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 53–62, San Diego, CA. Association for Computational Linguistics. https://doi.org/10.18653 /v1/W16-0506 Sam Davidson, Aaron Yamada, Paloma Fernandez Mira, Agustina Carando, Claudia H. Sanchez Gutierrez, and Kenji Sagae. 2020. Developing NLP tools with a new corpus of learner Spanish. In Proceedings of the 12th Lan- guage Resources and Evaluation Conference, pages 7238–7243, Marseille, France. European Language Resources Association. Michael P. Fay and Dean A. Follmann. 2002. Designing Monte Carlo implementations of permutation or bootstrap hypothesis tests. The American Statistician, 56(1):63–70. https:// doi.org/10.1198/000313002753631385 Mariano Felice and Ted Briscoe. 2015. Towards a standard evaluation method for grammatical error detection and correction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational 464 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 0 2 0 0 8 0 5 0 / / t l a c _ a _ 0 0 4 7 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Linguistics: Human Language Technologies, pages 578–587, Denver, Colorado. Association for Computational Linguistics. https://doi .org/10.3115/v1/N15-1060 Simon Flachs, Oph´elie Lacroix, Helen and Anders Yannakoudakis, Marek Rei, Søgaard. 2020. Grammatical error correction in low error density domains: A new benchmark and analyses. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 8467–8478, Online. Association for Computational Lin- guistics. https://doi.org/10.18653 /v1/2020.emnlp-main.680 G. David Forney. 1973. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278. Axel Gandy. 2009. Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk. Journal of the American Statistical Association, 104(488):1504–1511. https://doi.org/10.1198/jasa.2009 .tm08368 Sylvianne Granger. 1998. Learner English on Computer, chapter. The computer learner cor- pus: A versatile new source of data for SLA research. Addison Wesley Longman, London & New York. Roman Grundkiewicz and Marcin Junczys- Dowmunt. 2019. Minimally-augmented gram- matical error correction. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 357–363, Hong Kong, China. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /D19-5546 Roman Grundkiewicz, Marcin Junczys- Dowmunt, and Edward Gillian. 2015. Hu- man evaluation of grammatical error correction systems. In Proceedings of the 2015 Confer- ence on Empirical Methods in Natural Lan- guage Processing, pages 461–470, Lisbon, Portugal. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /D15-1052. and Ivan Habernal, Tom´aˇs Pt´aˇcek, Josef Steinberger. 2013a. Facebook data for senti- ment analysis. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics ( ´UFAL), Faculty of Mathematics and Physics, Charles University. and Ivan Habernal, Tom´aˇs Pt´aˇcek, Josef in Steinberger. 2013b. Sentiment analysis Czech social media using supervised machine learning. In Proceedings of the 4th Workshop on Computational Approaches to Subjectiv- ity, Sentiment and Social Media Analysis, pages 65–74, Atlanta, Georgia. Association for Computational Linguistics. Jan Hajiˇc, Jaroslava Hlav´aˇcov´a, Marie Mikulov´a, Milan Straka, and Barbora ˇStˇep´ankov´a. 2020. MorfFlex CZ 2.0. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics ( ˇUFAL), Faculty of Mathematics and Physics, Charles University. In Proceedings of Maarten Janssen. 2016. TEITOK: Text-faithful the annotated corpora. Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 4037–4043, Paris, France. European Language Resources Association (ELRA). Tom Kocmi, Martin Popel, and Ondrej Bojar. 2020. Announcing CzEng 2.0 parallel corpus with over 2 gigawords. CoRR, abs/2007 .03006v1. Jared Lichtarge, Chris Alberti, and Shankar Kumar. 2020. Data weighted training strategies for grammatical error correction. Transac- tions of the Association for Computational Linguistics, 8:634–646. https://doi.org /10.1162/tacl_a_00336 Jared Lichtarge, Chris Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar, and Simon Tong. 2019. Corpora generation for grammatical er- ror correction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3291–3301, Minneapolis, Minnesota. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/N19-1333 Matouˇs Mach´aˇcek and Ondˇrej Bojar. 2013. Re- sults of the WMT13 metrics shared task. In Proceedings of the Eighth Workshop on Sta- tistical Machine Translation, pages 45–51, Sofia, Bulgaria. Association for Computational Linguistics. Jakub N´aplava, Martin Popel, Milan Straka, and Jana Strakov´a. 2021. Understanding model 465 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 0 2 0 0 8 0 5 0 / / t l a c _ a _ 0 0 4 7 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 robustness to user-generated noisy texts. In the Seventh Workshop on Proceedings of Noisy User-generated Text (W-NUT 2021), pages 340–350, Online. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/2021.wnut-1.38 The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the Seven- teenth Conference on Computational Natural Language Learning: Shared Task, pages 1–12, Sofia, Bulgaria. Association for Computational Linguistics. error Jakub N´aplava and Milan Straka. 2019. Gram- matical correction in low-resource scenarios. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 346–356, Stroudsburg, PA, USA. Associ- ation for Computational Linguistics. https:// doi.org/10.18653/v1/D19-5545 Courtney Napoles, Maria N˘adejde, and Joel Tetreault. 2019. Enabling robust grammati- cal error correction in new domains: Data sets, metrics, and analyses. Transactions of the Association for Computational Linguistics, 7:551–566. https://doi.org/10.1162 /tacl_a_00282 Courtney Napoles, Keisuke Sakaguchi, Matt Post, and Joel Tetreault. 2015. Ground truth for grammatical error correction metrics. In Pro- ceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Nat- ural Language Processing (Volume 2: Short Papers), pages 588–593, Beijing, China. Associa- tion for Computational Linguistics. https:// doi.org/10.3115/v1/P15-2097 Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. 2017. JFLEG: A fluency corpus and benchmark for grammatical error correction. In Proceedings of the 15th Conference of the European Chapter of the Association for Com- putational Linguistics: Volume 2, Short Papers, pages 229–234. Valencia, Spain. Association for Computational Linguistics, https://doi .org/10.18653/v1/E17-2037 Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Lan- guage Learning: Shared Task, pages 1–14, Baltimore, Maryland. Association for Compu- tational Linguistics. Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, Christian Hadiwinoto, and Joel Tetreault. 2013. Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajiˇc, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, and Daniel Zeman. 2020. Universal dependencies v2: An evergrowing multilin- gual treebank collection. In Proceedings of the 12th Language Resources and Evaluation Con- ference, pages 4034–4043, Marseille, France. European Language Resources Association. the 2018 Conference of Jekaterina Novikova, Ondˇrej Duˇsek, and Verena Rieser. 2018. RankME: Reliable human ratings for natural language generation. In Proceed- ings of the North American Chapter of the Association for Com- putational Linguistics: Human Language Tech- nologies, Volume 2 (Short Papers), pages 72–78, New Orleans, Louisiana. Association for Com- putational Linguistics. rewrite. In Proceedings of Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem and Oleksandr Skurzhanskyi. Chernodub, 2020. GECToR – grammatical error correction: the Tag, not Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 163–170, Seattle, WA, USA, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .bea-1.16 Michal Richter, Pavel Straˇn´ak, and Alexandr Rosen. 2012. Korektor – A system for contex- tual spell-checking and diacritics completion. In Proceedings of COLING 2012: Posters, pages 1019–1028, Mumbai, India. The COL- ING 2012 Organizing Committee. Alexandr Rosen, Jiˇr´ı Hana, Barbora Hladk´a, Tom´aˇs Jel´ınek, Svatava ˇSkodov´a, and Barbora ˇStindlov´a. 2020. Compiling and annotating a learner corpus for a morphologically rich lan- guage – CzeSL, a corpus of non-native Czech. Karolinum, Charles University Press, Praha. Sascha Rothe, Jonathan Mallinson, Eric Malmi, Sebastian Krause, and Aliaksei Severyn. 2021. A simple recipe for multilingual grammatical 466 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 0 2 0 0 8 0 5 0 / / t l a c _ a _ 0 0 4 7 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 error correction. In Proceedings of the 59th Annual Meeting of the Association for Com- putational Linguistics and the 11th Interna- tional Joint Conference on Natural Language Processing Short Papers), pages 702–707, Online. Association for Com- putational Linguistics. (Volume 2: Alla Rozovskaya and Dan Roth. 2010. Annotat- ing ESL errors: Challenges and rewards. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Build- ing Educational Applications, pages 28–36, Los Angeles, California. Association for Com- putational Linguistics. Slovenia. European Language Resources Asso- ciation (ELRA). Oleksiy Syvokon and Olena Nahorna. 2021. UA-GEC: Grammatical error correction and fluency corpus for the ukrainian language. CoRR, abs/2103.16997v1. Toshikazu Tajiri, Mamoru Komachi, and Yuji Matsumoto. 2012. Tense and aspect error cor- rection for ESL learners using global context. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 198–202, Jeju Island, Korea. Association for Computational Linguistics. Alla Rozovskaya and Dan Roth. 2019. Gram- mar error correction in morphologically rich languages: The case of Russian. Transactions of the Association for Computational Linguis- tics, 7:1–17. https://doi.org/10.1162 /tacl_a_00251 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Keisuke Sakaguchi, Courtney Napoles, Matt Post, and Joel Tetreault. 2016. Reassessing the goals of grammatical error correction: Flu- ency instead of grammaticality. Transactions of the Association for Computational Linguistics, 4:169–182. https://doi.org/10.1162 /tacl_a_00091 Keisuke Sakaguchi and Benjamin Van Durme. 2018. Efficient online scalar annotation with bounded support. In Proceedings of the 56th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 208–218, Melbourne, Australia. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/P18-1020 Karel ˇSebesta. 2010. Korpusy ˇceˇstiny a osvojov´an´ı jazyka. Studie z aplikovan´e lingvistiky/Studies in Applied Linguistics, 1:11–34. Milan Straka, Jan Hajiˇc, and Jana Strakov´a. 2016. UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, mor- phological analysis, pos tagging and parsing. In Proceedings of the Tenth International Con- ference on Language Resources and Evalua- tion (LREC’16), pages 4290–4297, Portoroˇz, Shuyao Xu, Jiehao Zhang, Jin Chen, and Long Qin. 2019. Erroneous data generation for grammatical error correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 149–158, Florence, Italy. Association for Computational Linguistics. Helen Yannakoudakis, Øistein Andersen, Ardeshir Geranpayeh, Ted Briscoe, and Diane Nicholls. 2018. Developing an automated writ- ing placement system for ESL learners. Applied Measurement in Education, 31. https://doi .org/10.1080/08957347.2018.1464447 Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Pro- ceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Hu- man Language Technologies, pages 180–189, Portland, Oregon, USA. Association for Com- putational Linguistics. Zheng Yuan and Christopher Bryant. 2021. Document-level grammatical error correction. In Proceedings of the 16th Workshop on In- novative Use of NLP for Building Educational Applications, pages 75–84. 467 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 7 0 2 0 0 8 0 5 0 / / t l a c _ a _ 0 0 4 7 0 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3
Download pdf