Neural OCR Post-Hoc Correction of Historical Corpora - IA de Investigación especializada en el MIT

Corrección post-hoc de OCR neuronal de corpus históricos

Lijun Lyu1, Maria Koutraki1, Martin Krickl2, Besnik Fetahu1,3

1L3S Research Center, Leibniz University of Hannover / Hannover, Alemania
2Austrian National Library / Viena, Austria
3Amazonas / seattle, Washington, EE.UU
lyu@L3S.de, koutraki@L3S.de, martin.krickl@onb.ac.at, besnikf@amazon.com

Abstracto

Optical character recognition (OCR) is crucial
for a deeper access to historical collections.
OCR needs to account for orthographic vari-
ations, typefaces, or language evolution (es decir.,
new letters, word spellings), as the main
source of character, palabra, or word segmen-
tation transcription errors. For digital corpora
of historical prints, the errors are further exac-
erbated due to low scan quality and lack of
language standardization.

For the task of OCR post-hoc correction,
we propose a neural approach based on a
combination of recurrent (RNN) and deep
convolutional network (ConvNet) to correct
OCR transcription errors. At character level
we flexibly capture errors, and decode the
corrected output based on a novel attention
mechanism. Accounting for the input and out-
put similarity, we propose a new loss function
that rewards the model’s correcting behavior.

Evaluation on a historical book corpus in
German language shows that our models are
robust in capturing diverse OCR transcription
errors and reduce the word error rate of 32.3%
by more than 89%.

1 Introducción

OCR is at the forefront of digitization projects for
cultural heritage preservation. The main task is
to identify characters from their visual form into
their textual representation.

Scan quality, book layout, visual character sim-
ilarity are some of the factors that impact the
output quality of OCR systems. Este problema es
severe for historical corpora, which is the case in
this work. We deal with historical books in Ger-
man language from the 16th–18th century, dónde
characters are added or removed (p.ej., long s – ),
word spellings change (p.ej., ‘‘vnd’’ vs. ‘‘und’’)
that often lead to word and character transcription

479

errores. Cifra 1 shows examples pages conveying
the complexity of this task.

There are several strategies to correct OCR
transcription errors. Post-hoc correction is the
most common setup (Dong and Smith, 2018;
Xu and Smith, 2017). The input is an OCR tran-
scribed text, and the output is its corrected version
according to the error-free ground-truth transcrip-
ción. Por ejemplo, Dong and Smith (2018) usar
a multi-input attention to leverage redundancy
among textual snippets for correction. Alterna-
activamente, domain specific OCR engines can be
entrenado (Reul et al., 2018a), by using manually
aligned line image segments and line text (Reul
et al., 2018b). Sin embargo, manually acquiring such
ground-truth is highly expensive, and further-
más, típicamente, historical corpora do not contain
redundant information. Además, each book has
its own characteristics—typeface styles, regional
and publisher’s use of language, Etcétera.

En este trabajo, we propose a post-hoc approach to
correct OCR transcription errors, and apply it to a
historical collection of books in German language.
As input we have only the OCR transcription of
book from their scans, for which we output the
corrected text, that we assess with respect to the
ground-truth transcription carried out by human
annotators without any spelling change, idioma
normalization, or any other form of interpretation.
By considering only the textual modality for our
acercarse, we provide greater flexibility of apply-
ing our approach to historical collections where
the image scans are not available. Sin embargo, nota
that since orthography was not standardized, allá
can be parallel spellings of the ‘‘same’’ word
(p.ej., ‘‘und’’ vs. ‘‘vnd’’) within the same book,
which may pose challenges for approaches that
use the text modality only.

Our approach, CR, consists of an encoder-
decoder architecture. It encodes the erroneous

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 479–493, 2021. https://doi.org/10.1162/tacl a 00379
Editor de acciones: Sebastián Padó. Lote de envío: 9/2020; Lote de revisión: 1/2021; Publicado 4/2021.
C(cid:2) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
9
1
9
2
4
0
6
2

/
t

a
C
_
a
_
0
0
3
7
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

text snippets. Eso es, multiple redundant text
snippets are combined and under the majority
voting scheme the correction is carried out. Dong
and Smith (2018) propose a multi-input attention
modelo, which uses redundant
textual snippets
to determine the correct transcription during the
training phase. While there is redundancy for
contemporary texts, this cannot be assumed in
our case, where only the OCR transcriptions are
disponible. Our approach can be seen as com-
plementary to data augmentation techniques that
exploit redundancy.

Rule Based Correction. Regla

based
approaches compute the edit cost between two text
snippets based on weighted finite state machines
(WFSM) (Brill and Moore, 2000; Dreyer et al.,
2008; Wang y cols., 2014; Silfverberg et al., 2016;
Farra et al., 2014). WFSM require predefined
normas (insertion, deletion, etc., of characters) y
a lexicon, which is used to assess the transfor-
mations. The rewrite rules require the mapping to
be done at the word and character level (Wang
et al., 2014; Silfverberg et al., 2016). This process
is expensive and prohibits learning rules at scale.
Además, lexicons are severely affected by
out-of-vocabulary (OOV) problemas, especialmente para
historical corpora. A similar strategy is followed
by Barbaresi (2016), who uses a spell checker to
detect OCR errors and generate correction candi-
dates by computing the edit distance. OCR tran-
scription errors are highly contextual and there are
no one-to-one mappings of misrecognized charac-
ters that can be addressed by rules (cf. Cifra 6).

Máquina traductora. Post-hoc correction can
also be viewed as a special form of machine trans-
lación (Kalchbrenner and Blunsom, 2013; Dar
et al., 2014; Sutskever et al., 2014). For post-hoc
correction of OCR transcription errors, the only
reasonable representation is based on characters.
This is due to the character errors and word seg-
mentation issues, which can only be detected when
encoding the input text at character level. Resultados
from spelling correction (Xie et al., 2016) y
machine translation (Ling et al., 2015; Ballesteros
et al., 2015; Chung et al., 2016; Kim y cols., 2016;
Sahin and Steedman, 2018) indicate that character
based models perform the best. Methods based
on statistical machine translation (SMT) (Afli
et al., 2016) use a combined set of features at word
level and language models for post-hoc correc-
ción. Schulz and Kuhn (2017) use a multi-modular

Cifra 1: Pages with coexisting typefaces (Fraktur
and Antiqua), double columns, and images surrounded
by texts.

text at character level, and outputs the
aporte
corrected text during the decoding phase. Repre-
sentation at character level is necessary given that
OCR transcription errors at the most basic level
are at character level. The input is encoded through
a combination of RNN and deep ConvNet (LeCun
et al., 1995) redes. Our encoder architecture
allows us to flexibly encode the erroneous input
for post-hoc correction. RNNs capture the global
input context, whereas ConvNets construct local
sub-word and word compound structures. During
decoding the errors are corrected through an RNN
decoder, which at each step through an attention
mechanism combines the RNN and ConvNet
representations and outputs the corrected text.

Finalmente, since the input and output snippets are
highly similar, loss functions like cross-entropy
lean heavily towards rewarding copying behavior.
We propose a custom loss function that rewards
the model’s ability to correct transcription errors.
following

En este trabajo, we make

contributions:

• a data collection approach with a parallel
corpus of 800k sentences from 12 books
(16th–18th century) in German language;

• an error analysis, emphasizing the diversity

and difficulty of OCR errors;

• an approach that flexibly captures erroneous
y
transcribed OCR textual
robustly corrects character and word errors
for historical corpora.

snippets

2 Trabajo relacionado

Redundancy Based. The works in Lund et al.
(2013), Lund et al. (2011), Xu and Smith (2017),
and Lund et al. (2014) view the problem of post-
hoc correction under the assumption of redundant

480

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
9
1
9
2
4
0
6
2

/
t

a
C
_
a
_
0
0
3
7
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Barcode

Año

Author

Location

Layout

pages WER

CER

1
2
3
4
5

6
7
8

Z165780108
Z205600207
Z176886605
Z185343407
Z95575503

1557 h. Staden
1562 METRO. Walther
1603 B. Valentinus
1607 W.. Dilich
1616 j. Kepler

Z158515308
Z176799204
Z165708902

1647 A. Olearius
1652 S. von Birken
1672 j. Jacob Saar

Z22179990X 1691 S. von Pufendorf

9
10 Z172274605
11 Z221142405

Leipzig
1693 A. von Sch¨onberg Leipzig
1699 A. a Santa Clara

12 Z124117102

1708 W.. Bosman

single column
Marburg
Wittenberg single column
single column
Leipzig
single column
Kassel
single column w. pg.
Linz
margin
single column
single column
single column w. pg.
margin
single column
single column
single/double column
w. pg. margin
single column

Schleswig
N¨urnberg
N¨urnberg

Hamburg

K¨oln

177
75
134
313
119

600
190
186

665
341
794

66.8% 16.3%
54.1% 12%
46.8% 14.3%
60.8% 17.1%
17.2%
59%

51.8% 17.7%
55.9% 13.8%
10.4%
33%

32.7% 7.7%
67.6% 30.5%
51.4% 16.1%

601

37.8% 6.7%

Mesa 1: Detailed book information can be accessed from the ¨ONB portal using the barcode.

approach combining dictionary lookup and SMT
for word segmentation and error correction. Cómo-
alguna vez, the dataset used for training is limited to
books of the same topic, and requires manual
supervision in terms of feature engineering.

Sequence Learning. As is shown later, carácter-
ter based RNN models (Xie et al., 2016; Schnober
et al., 2016) are insufficient to capture the com-
plexity of compound-rich languages like German.
Alternativamente, ConvNets have been successfully
applied in sequence learning (Gehring et al.,
2017b,a). Although the performance of ConvN
et alone is insufficient for post-hoc correction,
we show that their combination yields optimal
post-hoc correction performance.

OCR Engines. Slightly related are the works of
Reul et al. (2018a,b), which retrain OCR engines
on a specific domain. The assumption is that clean
line scans with the same fontface are available. En
this way, the trained OCR engines are more robust
in transcribing text scans of the same fontface.
Cifra 1 shows that this is rarely the case, y
many characters induce orthographic ambiguity.
Además, in many cases the OCR process
is unknown, with image scans being the only
material available.

3 Recopilación de datos & Ground-Truth

En esta sección, we describe our data collection
efforts and the ground-truth construction process.
Actualmente, there is no large-scale historical corpus
in German language that can be used for post-hoc
correction of OCR transcribed texts. The collected

corpus and constructed ground-truth of more than
854k pairs of OCR transcribed textual snippets
and their corresponding manual transcriptions,
together with the source code are available.1

3.1 Book Corpus

We first describe the process behind selecting our
corpus of historical books in German language.
As our input textual snippets for OCR post-hoc
correction we consider the publicly available his-
torical collection of transcribed books, cuales son
freely accessible by the Austrian National Library
(OeNB).2 The transcription of books from their
image scans is done in partnership with Google
Books project, which uses Google’s proprietary
OCR frameworks. Given that
this process is
an automated process, the transcriptions are not
error free.

For the ground-truth transcriptions we turn
to another publicly available collection, a saber,
Deutsches Textarchiv (DTA).3 It contains man-
ually transcribed books based on community
efforts. The transcriptions are error free and as
such are suitable to be used as our ground-truth.
We consider the overlap of books present in both
DTA and OeNB, providing us with the erroneous
input textual snippet from OeNB and the corre-
sponding target error-free transcription from DTA.
Mesa 1 shows our books corpus, consisting
of the overlap between these two repositories,

1https://github.com/GarfieldLyu/OCR POST DE.
2https://www.onb.ac.at/.
3http://www.deutschestextarchiv.de/.

481

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
9
1
9
2
4
0
6
2

/
t

a
C
_
a
_
0
0
3
7
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
9
1
9
2
4
0
6
2

/
t

a
C
_
a
_
0
0
3
7
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 2: Vocabulary overlap between books.

con 12 books in German language from the
16th–18th centuries. Understandably, considering
the publication period, there is little overlap across
the different books. Cifra 2 shows the vocab-
ulary overlap between books, which on average
is around 20%. This presents an indicator of a
corpus with high diversity and low redundancy,
representing a realistic and challenging evaluation
scenario for post-hoc correction.

3.2 Ground-Truth Construction

The constructed ground-truth consists of
el
mapped OCR transcribed text to their manually
transcribed counterparts, resulting in a paral-
lel corpus of OCRed input text and the target
manually transcribed counterparts.

To construct the parallel corpus is challenging.
OCR transcribed books contain all pages (p.ej.,
content and blank pages), while the manually
transcribed books keep only the content pages.
Además, books are typically transcribed line
by line by OCR systems, which often fail to detect
page layout boundaries (multi-column layouts or
printed margins). Por lo tanto, accurate ground-truth
construction even at page level is challenging.

An important aspect is the granularity of paral-
lel snippets. Cifra 3 shows the average sentence
length distribution for OCR and manually tran-
scribed books. We consider sentences, cuales son
demarcated by the symbol ‘‘/’’, when this infor-
mation is not available we fall back to text lines.
The average sentence length is 5–6 tokens, con
an average of up to 100 characters.

Por lo tanto, we consider snippets of 5 tokens for
mapping, as longer ones (p.ej., párrafos), son

Cifra 3: Sentence length distributions.

highly error prone. Además, depending on
scan quality, page content (p.ej., if it contains
figures or tables),
the error rates from OCR
transcriptions can vary greatly from page to page,
making it impossible to consider lengthier snippets
for the automated and large-scale ground-truth
construction.

To construct an accurate ground-truth for OCR
post-hoc correction, we propose the following two
steps: (i) approximate matching, y (ii) accurate
refinamiento.

3.2.1 Approximate Snippet Matching

From the OCR transcribed books, we generate
textual snippets of 5 tokens length and compute
approximate matches to snippets of 5–104 tokens
from the manually transcribed books. Approxi-
mate matching at this stage is required for two
razones: (i) text lines from OCR and manually
transcribed books are not aligned at line level
in the books, y (ii) an exhaustive pair-wise
comparison of all possible snippets of length 5 es
very expensive.

We rely on an efficient technique known as
locality sensitive hashing (LSH) (Rajaraman and
Ullman, 2011) to put textual snippets that are
loosely similar into the same bucket, y luego
based on the Jaccard similarity determine the
highest matching pair. The hashing signatures and
the Jaccard similarity are computed on character
tri-grams.

The resulting mappings are not error free, y
often contain extra or missing words. Such errors
are introduced often due to the OCR engines

4Lengthier snippets are necessary due to segmentation

errores, resulting in longer snippets.

482

Ideal

Cifra 4:
snippet alignment. Gap
characters ‘‘-’’ and additional characters are removed
from both sides.

textual

breaking over the multi-column layouts of books,
inclusion of table/figure captions, word segmenta-
tion errors (under or over segmentation). Snippets
from OCR transcriptions that do not have a
matching above a threshold (< 0.8) are dropped. Matching Coverage: Finally, to ensure that our ground-truth construction approach does not severely affect coverage of the matched pairs, we conduct a manual analysis of two books with different layouts (books ID 6 and 11, cf. Table 1) for 10 randomly selected pages from each book. For book 6, which has good scan quality, for snippets of 5 tokens, we are able to find a relevant match from the manual transcription on average for 270 out of 300 snippets per page. The dropped snippets in absolute majority of the cases consist of footnotes or page headings. In the case of book 11, which has a bad scan quality and is of double column layout, from 400 snippets, only 200 have a match. Upon inspection, we find that this is mostly due to the erroneous transcription by OCR systems, which mistakenly merges lines from different columns into a single line. These snippets are corrupted, and cannot be matched to snippets extracted from the manually transcribed books. 3.2.2 Accurate Refinement The main issue with the approximate matching through LSH, is that there are extra words appear- ing at the head or end of either the input or output snippets. The extra tokens stem mostly from snippets that match lengthier or shorter ones due to word segmentation errors. Such addi- tional/missing words are not desirable, and thus, in this stage we refine the above snippet mappings. We perform a local pairwise sequence alignment5 that finds the best matching local sub-snippets. The remaining extra characters are removed (e.g., tokens ‘‘fen’’ and ‘‘Willk¨uhr’’ are removed as they are not part of a local alignment). 5https://biopython.org/DIST/docs/api/Bio .pairwise2-module.html. 483 Figure 5: Book OCR error type distribution. 4 Data Analysis Based on a manual analysis of a random sample of 100 snippet pairs taken from each book from our ground-truth, we analyze the various OCR transcription error types. This is a crucial step towards developing post-hoc correction models in a systematic man- ner. OCR errors are highly contextual and are dependent on several factors, and as such there are no one-to-one rules that can be used to cor- rect OCR errors. Furthermore, these errors are increased when dealing with historical corpora, as fontfaces, book layouts, and language use are highly unstandardized. the We between differentiate following errors: (i) over-segmentation is an error when multiple words are merged into one, (ii) under- segmentation when a word is split into two, and (iii) word error, typically caused by misrecog- nized characters, converting it into an invalid word or changing its meaning to a different valid word. 4.1 Error Types and Distribution Figure 5 shows an overview of the error types for the different books in our corpus. Over-segmentation is one of the most common OCR transcription errors, with 54% of the cases. The errors often arise due to OCR systems mis- recognizing spaces between words and characters in a word, as these are often not clearly distin- guishable. These errors are challenging since the words may represent valid words, which is an even more challenging problem for compound rich languages like German. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 9 1 9 2 4 0 6 2 / / t l a c _ a _ 0 0 3 7 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 6: The misrecognized character (x–axis) and their valid transcriptions (y–axis). Under-segmentation errors are less common (with 3%), and are mostly due to line-breaks and book layouts. Word-errors represent the second most fre- quent OCR error category with 43%. These errors are often caused due to the orthographic visual similarity between characters, thus resulting in invalid words or changing the word’s meaning altogether. Other relevant factors are the scan quality or book layouts. Figure 6 shows that word errors are contextual, with no simple mappings between misrecgonized characters. An indicator that they are not solely due to visual character similarity, as they are often misrecognized to completely different characters. 5 Neural OCR Post-Hoc Correction Figure 7 shows an overview of our encoder- decoder architecture for post-hoc OCR correction. At its core, the encoder combines RNN and deep ConvNets for representation of the erroneous OCR transcribed snippets at the character level. During the decoder phase an RNN model corrects the errors one character at a time by using an attention mechanism that combines the encoder representations, a process repeated until an end of a sentence is encountered. 5.1 Encoder Network We encode the erroneous OCR snippets at the character level for three reasons. First, word representation is not feasible due to word errors. Second, only in this way can we capture erroneous characters. Finally, we avoid OOV issues, as there are no vocabularies for historical corpora. 484 Figure 7: Approach overview. The encoder consists of a RNN and a deep ConvNet network. The intuition is that RNNs capture the global context on how OCR errors are situated in the text, while deep ConvNets capture and enforce local sub-word/phrase context. This is necessary for word segmentation errors, which might bias RNN models towards more frequent tokens (e.g., ‘‘alle in’’ vs. ‘‘allein’’).6 5.1.1 Recurrent Encoder (cid:3) . (cid:2)−−−−→ LST M(X), First, we apply a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) that reads the erroneous OCR snippet X = (x1, . . . , xT ), encoding it into ←−−−− LST M(X) hT = We use recurrent models because of their ability to detect erroneous characters and to capture the global context of the OCR errors. In § 4 we showed that for most of the erroneous characters, the target transcribed characters vary, a variation that can be resolved by the general context of the snippet. Finally, we will use hT to conditionally initialize the decoder, since the input and output snippets are highly similar. 5.1.2 Convolutional Encoder We find that 57% of errors are word segmentation errors. Often, these errors have a local behavior, such as merging or splitting words. While in the- ory this information can be captured by the RNN encoder, we notice that they are biased towards frequent sub-words in the corpora with tokens being wrongly split or merged. We apply deep ConvNets to capture the local context (i.e., compound information) of tokens. ConvNets through their kernels limit the influence that characters beyond a token’s context may have 6Both tokens are correct, with the first being more frequent. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 9 1 9 2 4 0 6 2 / / t l a c _ a _ 0 0 3 7 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 in determining whether the subsequent decoded characters forming a token should be split or merged. We set the kernel size to 3 and test several con- figurations in terms of ConvNet layers, which we empirically assess in § 7.2. Since we are encoding the OCR input at character level, determining the right granularity of representation is not trivial. Hence, the multiple layers l will flexibly learn from fine to coarse grained representation of the input. The learned representation at layer l is denoted as hl = l, . . . , hT . In between each of the layers, we apply non-linearity such as gated linear units (Dauphin et al., 2017) to control how much information should pass from the bottom to the top layers. h1 (cid:4) (cid:5) l 5.2 Decoder Network the corrected textual The decoder is a single LSTM layer, which snippet one generates character at a time. We initialize it with the last hidden state from the BiLSTM encoder hT , that is, o1 = hT in Equation (1), which biases the decoder to generate sequences that are similar to the input text. (cid:6) p oi|oi − 1, . . . , o1, x (cid:7) (cid:6) (cid:7) = g oi − 1, di, ci (1) where di is the current hidden state of the decoder, and oi−1 represents the previously generated character. ci is the context vector from the encoded OCR input snippe, which combines the RNN and deep ConvNet input representations through a multi-layer attention mechanism, which we explain below. 5.2.1 Multi-layer Attention Using jointly RNNs with deep ConvNets as encoders allows for greater flexibility in capturing the complexities of OCR errors. Furthermore, the multi-layers of the ConvNets capture from fine to coarse grained local structures of the input. To harness this encoding flexibility, we compute the context vector ci for each decoder step di as following. First, for each decoder state di at step i, we compute the weight of the representations computed by the deep ConvNet at the different layers. The weights, computed in Equation 2, correspond to the softmax scores, which are computed based on the dot product between di 485 and the hidden layers hl ConvNet. j from the l layers of the al ij = (cid:8) exp(el ij) T k=1 exp(el ; el ij = di · hl j (2) ik) At each layer l in the ConvNet encoder, the attention weights assess the importance of the representations at the different granularity levels in correcting the OCR errors during the decoder phase. To compute ci, we combine the RNN and deep ConvNet representations, as scaled by the attention weights as following: l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 9 1 9 2 4 0 6 2 / / t l a c _ a _ 0 0 3 7 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 ci = 1 L L(cid:9) T(cid:9) l=1 j=1 al ij · [hl j, hj] (3) 5.3 Weighted Loss Training Conventionally, encoder-decoder architectures are trained using the cross-entropy loss, L = −Ptgt · log Ppred, with Ptgt and Ppred being the target and the predicted probability distributions of some discrete vocabulary. For OCR post-hoc correction, cross-entropy does not properly capture the nature of this task. Models are biased to simply copy from input to output, which in this task represent the majority of cases. In this way, failure at correcting erroneous characters diminish, as all time-steps are treated equally. We propose a weighted loss function that rewards higher models for their correcting behavior. The modified loss function is shown in Equation 4. (cid:6) (cid:7) L(cid:5) = L · 1 − λPsrc · Ptgt ; 0 < λ < 1 (4) The new loss function combines the cross- entropy loss L and an additional factor that considers the source and target characters. The second part of the equation captures the amount of desirable copying from input to output. If the input and output characters are the same, then Psrc · Ptgt yields 1, otherwise 0, where Psrc and Ptgt are one-hot character encodings of the input and output snippets. λ controls by how much we want to dampen this behavior. L(cid:5) rewards higher the model’s ability to correct erroneous sequences. 6 Experimental Setup In this section, we introduce the experimental setup and the competing methods for the task of post-hoc OCR correction. 6.1 Evaluation Scenarios According to our error analysis in § 4 and the highly diverging vocabularies across books (cf. § 3.1), we distinguish two evaluation scenarios. Here we use part of the ground-truth, where we select instances by first sampling pages from the books, namely the instance pairs coming from the sampled pages. We assess the performance of models for two significant factors that may impact their correction behavior: (i) eval–1 assesses the model’s post-hoc correction behavior on unseen OCR transcription errors related to the book source and publication date, and diverging book content (cf. Figure 2), and (ii) eval–2 tests the impact on correction performance when models have encountered all OCR errors based on random sampling. eval–1: We split the data along the temporal axis, with training instances coming from books from the 16th and 18th centuries, and test instances from the 17th century. This scenario is challeng- ing as there are diverging error types due to scan quality, and other orthographic variations related to the publishers and other book characteristics. The 17th century books have more diverse errors, as there are more books, and the initial OCR transcription error rates are higher. We use 70% of the data for training, and 10% and 20% for validation and testing, with 269k, 27k, and 89k instances respectively. eval–2: We randomly construct the training, validation, and testing splits, thus ensuring that the models have observed all error types, which should result in better post-hoc correction behav- ior. Furthermore, contrary to eval–1, where the splits are dictated by the publication date of the books, in this case, we use slightly different splits for training, validation, and testing. We use 65%, 10%, and 25%, for training, validation, and test- ing, respectively. The absolute number is 417k, 42k, and 166k, respectively. 6.2 Evaluation Metrics length of the transcribed sequence, in characters for CER and number of words for WER. 6.3 Baselines In the following we describe the approaches we compare against. In all cases, the input is represented at character level with 128 embed- ding dimensions. The cell units (i.e., LSTMs and ConvNets) are of 256 dimensions. CH: Xie et al. (2016) use an RNN model for spelling correction, a task slightly similar to OCR post-hoc correction. Yet, the error types and their distribution are of a different nature. CH is a stan- dard attention based encoder-decoder (Bahdanau et al., 2015), that corresponds to our CR model without ConvNets and the custom loss function. CHλ: To assess the impact of the introduced loss function, we train CH with the custom loss (cf. § 5.3). The optimal λ is set based on the validation set. This presents the ablated model of our approach CR without the ConvNet encoder and mult-layer attention. PB: Cohn et al. (2016) propose a symmetric attention mechanism for RNN based encoder- decoder models. That is, encoder and decoder timesteps are strongly aligned. A similar align- ment between input and output is expected for this task. Transformer: By pretraining on large corpora, Transformers have (Vaswani et al., 2017) achieved the state-of-the-art results in various NLP tasks. In our case, pretraining on historical corpora is not possible due to the scarcity of such data, while pretraining on contemporary German corpora did not show any improvement. The self-attention mechanism is highly flexible in capturing intra- input and input-output dependencies, which is very important for post-hoc correction. We use the implementation in TK (Gagnon-Marchand and LJQ., n.d.) with 3 layers and 8 attention heads, and 512 dimensions for the output model, and encode input at character level. To assess the post-hoc correction performance of the models, we use standard evaluation metrics for this task: (i) word error rate (WER), and (ii) character error rate (CER). The error rates mea- sure the number of word/character substitutions, insertions, and deletions, normalized by the total Other Approaches: ConvSeq (Gehring et al., 2017b), part of our encoder network, yields per- formance below all the other competitors, hence we do not include its results here. Similarly, rule- based models based on FST (Silfverberg et al., 2016) yield poor performance. We believe this is 486 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 9 1 9 2 4 0 6 2 / / t l a c _ a _ 0 0 3 7 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 due to the inability to establish one-to-one map- ping of rules for correction, and the requirement for valid word vocabularies. 6.4 CR: Approach Configuration For our approach CR, based on a validation set, the number of ConvNet layers is set to k = 1 and k = 3, and set λ = 0.3 and λ = 0.1, for eval–1 and eval–2, respectively. 7 Evaluation In this section, we provide a detailed evaluation discussion and discuss limitations. 1. Post-hoc correction evaluation as measured through WER and CER metrics. 2. Ablation study for our approach CR. 3. Performance of CR for post-hoc correction at page level. 4. Robustness and generalizability of our approach for post-hoc correction. 5. CR model behavior error analysis. 7.1 Post-Hoc OCR Error Correction All post-hoc OCR correction approaches under comparison significantly reduce the amount of OCR errors. Tables 2 and 3 provide an overview of the performance as measured through WER and CER metrics. eval–1. Table 2 shows the results for competing approaches for the eval–1 scenario. This scenario mainly shows how well the models generalize in terms of language evolution, where instances come from books written in a different century. Note that, apart from the temporal dimension, another important aspect is that of publisher’s spe- cific attributes. Dependent on the publisher, there are orthographic variations, vocabulary, and other stylistic features, such as font-face, and so on. In principle, low WER translates into fewer word segmentation (WS) errors, with WS errors being some of the most frequent errors (cf. Figure 5). Hence, reducing WER is critical for post-hoc OCR correction models. Our model, CR, achieves the best performance with the lowest score of WER=5.98%. This presents a relative decrease of Δ = 82% compared to the WER in the original OCR text snippets. In terms of CER WER CER OCR CH CHλ=0.4 PB Transformer CR 33.3 7.64 ((cid:2)77%) 7.46 ((cid:2)78%) 11.45 ((cid:2)66%) 8.11 ((cid:2)76%) 5.98 ((cid:2)82%)∗ 6.1 2.79 ((cid:2)54%) 2.53 ((cid:2)59%) 3.05 ((cid:2)50%) 2.24 ((cid:2)63%) 2.07 ((cid:2)66%)∗ Table 2: Correction results for eval–1. CR achieves highly significant (∗) improvements over the best baseline CHλ. WER CER OCR CH CHλ=0.3 PB Transformer CR 32.3 4.08 ((cid:2)87%) 4.09 ((cid:2)87%) 9.21 ((cid:2)71%) 4.50 ((cid:2)86%) 3.59 ((cid:2)89%)∗ 5.4 1.32 ((cid:2)76%) 1.35 ((cid:2)75%) 1.93 ((cid:2)64%) 1.07 ((cid:2)80%)∗ 1.31 ((cid:2)76%) Table 3: Correction results for eval–2. CR obtains highly significant improvements over the best baseline CHλ for WER, while Transformer has significantly the lowest CER. (∗) we have a relative decrease of Δ = 66%, namely, with CER=2.07%. Comparing our approach CR against CHλ (the best competing approach in eval–1), we achieve highly significant (p < .001) lower WER and CER scores, as measured according to the non- parametric Wilcoxon signed-rank test with correc- tion.7 For WER and CER, CR compared to CHλ obtains a relative error reduction of 21.7% and 25.8%, respectively. This shows that ConvNets allow for flexibility in capturing the different con- stituents of a word compound, that in turn may result in either over or under segmentation error. Against the other competitors the reduction rates are even greater. Transformers has the low- est CER among the competitors, yet compared to CR its CER has a 8% relative increase. PB, performs the worst, mainly due to the character shifts (left or right) incurred due to word segmen- tation errors. Thus, strictly enforcing the attention mechanism along very close or the same positions 7We test for normality of distributions, and conclude that the produced WER and CER measures do not follow a normal distribution. 487 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 7 9 1 9 2 4 0 6 2 / / t l a c _ a _ 0 0 3 7 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 in the encoder-decoder results in sub-optimal post-hoc OCR correction behavior. eval–2. Table 3 shows the results for the eval–2 scenario. Due to the randomized instances for the models have greater training and testing, ability in correcting OCR errors. Contrary to eval–1, where the models were tested on instances coming from later centuries, in this scenario, the models do not suffer from language evolution aspects and other book specific characteristics. Therefore, this presents an easier evaluation scenario. Here too the models show a similar behavior as for eval–1. The only difference in this case being that our approach CR does not achieve the best CER reduction rates. CR obtains highly significant lower (p < .001) WER rates than the Transformer. On the other hand, Transformer achieves the best CER rates among all competitors (p < .001). The significance tests are measured using the non-parametric Wilcoxon signed-rank test. This presents an interesting observation, show- ing that Transformers are capable in learning all the complex cases of character errors. This behav- ior can be attributed to their capability in learning complex intra-input and input-output dependen- cies. However, in terms of WER, we see that a large reduction is achieved through ConvNets in CR, yielding the lowest WER rates, with a relative decrease of 89% in terms of WER. This conclusion can be achieved when we inspect CHλ, which is the ablated CR model without ConvNet encoders. 7.2 Ablation Study In the ablation study we analyze the impact of the varying components introduced in CR. levels of abstractions ConvNet Layers. The number of layers in provides different encoding the OCR input. Table 4 shows CR’s performance with varying number of layers trained using the standard cross-entropy loss. Increasing the number of layers for k > 5 does not yield
performance improvements. We note that for
the different evaluation scenarios, the number of
necessary layers varies. Por ejemplo, in eval–2
the number of optimal layers is 3. This can be
attributed to the higher diversity of errors in the
randomized validation instances, y por lo tanto, el
need for more layers to capture the OCR errors.

Loss Function. The loss function in § 5.3
rewards higher the model’s correcting behavior.

488

eval–1

eval–2

WER CER WER CER

CRk=1
CRk=2
CRk=3
CRk=4
CRk=5
CRk=6
CRk=7
CRk=8
CRk=9
CRk=10

6.18
6.46
6.47
6.93
6.63
6.68
6.58
6.56
6.59
6.32

2.15
2.30
2.26
2.51
2.40
2.52
2.60
2.48
2.69
2.52

3.72
4.18
3.61
3.54
3.92
3.94
3.90
3.64
3.84
3.61

1.29
1.46
1.26
1.31
1.38
1.50
1.50
1.54
1.60
1.62

Mesa 4: WER and CER values for CR with
varying number of ConvNet layers trained
using standard loss function.

eval–1

eval–2

WER CER WER CER

CRλ=0.1
CRλ=0.2
CRλ=0.3
CRλ=0.4
CRλ=0.5
CRλ=0.6

6.22
6.31
5.98
6.37
6.37
6.63

2.16
2.17
2.07
2.17
2.16
2.22

3.59
3.79
4.24
3.90
3.83
3.90

1.31
1.42
1.51
1.33
1.45
1.41

Mesa 5: WER and CER results for CR with
different λ for custom loss function.

Mesa 5 shows the ablation results for CR with
varying λ values for L(cid:5) and fixed ConvNet layers
(k = 1 y k = 3) as the best performing
configurations in Table 4. Here too due to
the different characteristics of the evaluation
escenarios, different λ values are optimal for CR.
We note that for eval–1, a higher λ of 0.3 yields the
best performance. This shows that for diverging
train and test sets (p.ej., eval–1), the models need
more stringent guidance in distinguishing copying
from correcting behavior.

7.3 Page Level Performance

Evaluation results in § 7.1 convey the ability
of models to correct erroneous input at snippet
nivel. Sin embargo, there are challenges on applying
post-hoc correction models on real-world OCR
transcriptions, which do not have their textual con-
tent separated into coherent and non-overlapping
snippets.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
9
1
9
2
4
0
6
2

/
t

a
C
_
a
_
0
0
3
7
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

acción

descripción

#Tren

#desarrollador

#Test Book IDs

S
METRO
R

accuracy of token segmentation
accuracy of token merging
accuracy of token character replace-
mento (insertion/update/delete)

Mesa 6: Page level actions are used to measure
the model’s performance at page level.

página

METRO

comportamiento

9
10
16
17

–
–

0.737 (19)
0.878 (66)
0.586 (29)
0.976 (83)
0.960 (50) 0.0 (1) 0.652 (23)
0.621 (29)
0.933 (90)

–

85
112
73
119

Mesa 7: Precision for S, METRO, R actions. In brackets
are the number of undertaken actions, y el
rightmost column has all actions.

En esta sección, for our model CR, at page level
we assess the accuracy of undertaken actions in
correcting the erroneous input text to its target
forma. Mesa 6 shows the set of actions that a
model can undertake. We carry out a manual
evaluation on an out-of-corpus book (book code
Z168355305), that is not present in our ground-
truth, for which we randomly sample a set of
4 paginas.

We apply CR, a saber, assess the accuracy of
actions of correction during the decoding phase,
over the OCR transcribed pages line by line with
a window of 5 tokens. For each decoding step that
produces an output that is different from the input,
we assess the accuracy of that action. Mesa 7
shows the precision of CR for the different set of
actions for the different pages. The results show
that CR is robust and can be applied without much
change even at page level with high accuracy of
post-hoc correction behavior.

7.4 Robustez

We conduct a robustness test of the CR approach
to check: (i) in-group post-hoc correction perfor-
mance, where test instances come from the same
books as the training ones, y (ii) out-of-group,
where we train on one group and test on the rest
of the groups. Mesa 8 shows the groups of books
we use for (i) y (ii).

Mesa 9 shows the in-group and out-of-group
post-hoc correction scores for CR when using

G1
G2
G3

312k
58.9k
217.3k

34.7k
6.5k
24k

86.1k
17.2k
59.8k

(8, 5, 12, 11)
(2, 1, 3, 10)
(4, 7, 6, 9)

Mesa 8: Book splits for assessing CR robustness.

WER CER WER CER WER CER
5.8

31.2

OCR 28.1

7.1

5.7

2.9
5.6
4.4

6.3
4.4
4.8

34.0
standard loss function
24.7
15.9
18.9
custom loss function
24.2
17.0
18.7

5.6
4.1
4.6

2.8
5.1
4.3

18.9
20.2
10.4

18.4
20.3
10.3

4.9
4.9
2.6

4.4
4.4
2.5

G1
G2
G3

10.1
21.5
16.9

Mesa 9: CR results with k = 1 trained using the
standard and custom loss function with λ = 0.1.

a single ConvNet layer, using the standard and
the custom loss functions, respectivamente. It can
be seen that when the models are trained on a
similar corpus (in-group), the error reduction is
significantly higher compared to the evaluation
on the out-of-group corpus. Además, we note
that the custom loss function consistently provides
better trained models for post-hoc correction.

The results in Table 9 show that CR is robust
providing highly significant decrease in terms of
WER and CER, with an average of WER decrease
de 52% for in-group with both the standard and
custom loss. Whereas the out-of-group WER
reduction is with 34% y 35% using the standard
and custom loss, respectivamente. In terms of CER,
for in-group we get a CER decrease of 47.6% y
50% for standard and custom loss, respectivamente.
The advantage of the custom loss is shown for
out-of-group evaluation, where the CER decrease
is much more significant with 16.71% for standard
loss function compared to 23.3% using the custom
loss function.

From the three groups, when training on G3
the out-of-group post-hoc correction performance
is the highest. This shows that on historical
corpus, depending on the initial OCR error rate
and possibly the error types due to the book’s

489

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
9
1
9
2
4
0
6
2

/
t

a
C
_
a
_
0
0
3
7
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

characteristics impact significantly the correction
actuación.

7.5 Análisis de errores

Here we analyze the structure of some typical
errors that we fail to correct.

terms

Segmentación de palabras.
encima-
segmentation,
the importance of the ConvNet
layers in CR is shown when compared against
CH and CHλ. Common word segmentation errors
for CH and CHλ are, Por ejemplo, ‘‘Jndem’’
to ‘‘Jn’’ and ‘‘dem’’, ‘‘Jedoch’’ to ‘‘Je’’ and
‘‘doch’’.
to ‘‘vor beyftre-
ichen’’. Most of these errors can be traced back to
frequent constituents of the compound that exist
in isolation too.

‘‘vorbeyftreichen’’

Character Error. There are easy charac-
ter errors such as ‘‘mein’’ which is OCRed to
‘‘mcin’’ and is fixed by all approaches. Sin embargo,
for some words like ‘‘l¨ofcken’’, models like CH
and Transformer correct them to the right word
‘‘l¨ofeten’’. CR fails to do so due to some frequent
character bigrams such as ‘‘ck’’ that are very
frequent in the dataset.

8 Conclusión

In this work we assessed several approaches
towards post-hoc correction. We find out that
OCR transcription errors are contextual, y un
large set are due word-segmentation, seguido por
word-errors. Models like Transformers have lim-
ited utility in this task, as pre-training is difficult to
undertake, given the scarcity of historical corpora.
We proposed a OCR post-hoc correction
approach for historical corpora, which provides
flexible means to capturing various OCR tran-
to language
scription errors that are subject
evolution,
asuntos.
Through our approach CR we achieve great
WER reduction rates with 82% y 89% para
eval–1 and eval–2 scenarios, respectivamente.

typeface and book layout

Además, ablation studies show that all the
introduced components in CR yield consistent
improvement over the competitors. Apart from
post-hoc correction performance at snippet level,
CR proved to be robust at page level too, donde el
undertaken correction steps are highly accurate.

Finalmente, we construct a release a new dataset
for post-hoc correction of historical corpora in
German language, consisting of more than 850k
parallel textual snippets, which can help facilitate
research for historical and low-resource corpora.

7.6 Dataset Limitations

Expresiones de gratitud

The OCR quality can vary greatly across books,
and from page to page. Based on manual inspec-
ción, we note that in some cases the WER can go
well beyond 80%. It is expected that in such cases
that the post-hoc OCR correction will vary too.
Other possible issues include competing spellings
for the same word, which may cause the models
to encode conflicting information, todavía, for tran-
scribing historical texts, language normalization
(es decir., opting for one spelling) is not recommended,
as the meaning of the texts may change.

Language Evolution. There is a significant
difference between eval–1 and eval–2 in terms of
correction results. One explanation is due to the
word spelling variations across centuries. Alguno
examples include the substitution of single char-
acters in words, which if not known would lead to
systematic correction mistakes, p.ej., j → i, v → u,
a. Respectivamente, due to the missing
information about the spelling change in eval–1,
the corresponding WER and CER rates are higher.

→ s, ¨a → e

This work was partially funded by Travelogues
(DFG: 398697847 and FWF: I 3795-G28).

Referencias

Haithem Afli, Zhengwei Qiu, Andy Way, y
P´araic Sheridan. 2016. Using SMT for OCR
textos. En profesional-
error correction of historical
ceedings of the Tenth International Conference
on Language Resources and Evaluation LREC
2016, Portoroˇz, Slovenia, May 23–28, 2016.
European Language Resources Association
(ELRA).

Dzmitry Bahdanau, Kyunghyun Cho, y yoshua
bengio. 2015. Traducción automática neuronal por
aprender juntos a alinear y traducir. en 3ro
Conferencia Internacional sobre Aprendizaje Repre-
sentaciones, ICLR 2015, San Diego, California, EE.UU,
May 7–9, 2015, Conference Track Proceedings.

Miguel Ballesteros, Chris Dyer, y Noé A..
Herrero. 2015. Improved transition-based parsing

490

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
9
1
9
2
4
0
6
2

/
t

a
C
_
a
_
0
0
3
7
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

by modeling characters instead of words with
lstms. En Actas de la 2015 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando, EMNLP 2015, Lisbon, Portugal,
September 17–21, 2015, pages 349–359. El
Asociación de Lingüística Computacional.
DOI: https://doi.org/10.18653/v1
/D15-1041

Adrien Barbaresi. 2016. Bootstrapped OCR
error detection for a less-resourced language
variante. In Proceedings of the 13th Conference
sobre el procesamiento del lenguaje natural, KONVENS
2016, Bochum, Alemania, September 19–21,
2016, volumen 16 of Bochumer Linguistische
Arbeitsberichte.

Eric Brill and Robert C. moore. 2000. Un
improved error model
for noisy channel
spelling correction. In 38th Annual Meeting
de la Asociación de Linguis Computacional-
tics, Hong Kong, Porcelana, October 1–8, 2000,
pages 286–293. LCA. DOI: https://doi
.org/10.3115/1075218.1075255

Kyunghyun Cho, Bart van Merrienboer, C¸ aglar
G¨ulc¸ehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. 2014.
Learning phrase representations using RNN
encoder-decoder for statistical machine trans-
lación. En Actas de la 2014 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando, EMNLP 2014, October 25–29,
2014, Doha, Qatar, A meeting of SIGDAT,
a Special
the ACL,
Interest Group of
pages 1724–1734. LCA.

Junyoung Chung, Kyunghyun Cho, y yoshua
bengio. 2016. A character-level decoder with-
out explicit segmentation for neural machine
traducción. In Proceedings of the 54th Annual
Meeting of the Association for Computational
Lingüística, LCA 2016, August 7–12, 2016,
Berlina, Alemania, Volumen 1: Artículos largos. El
Association for Computer Linguistics. DOI:
https://doi.org/10.18653/v1/P16
-1160

Trevor Cohn, Cong Duy Vu Hoang, Ekaterina
Vymolova, Kaisheng Yao, Chris Dyer, y
Incorporating
Gholamreza Haffari.
structural alignment biases into an attentional
In NAACL HLT
neural

translation model.

2016.

the North
2016, El 2016 Conference of
la Asociación para
American Chapter of
Ligüística computacional: Human Language
Technologies, San Diego California, EE.UU,
June 12–17, 2016, pages 876–885. DOI:
https://doi.org/10.18653/v1/N16
-1102

Yann N. Dauphin, Angela Fan, Michael Auli, y
David Grangier. 2017. Language modeling with
gated convolutional networks. En procedimientos
de
the 34th International Conference on
Machine Learning, ICML 2017, Sídney, NSW,
Australia, 6–11 August 2017, volumen 70 de
Actas de investigación sobre aprendizaje automático,
pages 933–941. PMLR.

Rui Dong and David Smith. 2018. Multi-input
attention for unsupervised OCR correction. En
Proceedings of the 56th Annual Meeting of
the Association for Computational Linguis-
tics, LCA 2018, Melbourne, Australia, Julio
15–20, 2018, Volumen 1: Artículos largos,
pages 2363–2372. Asociación de Computación-
lingüística nacional. DOI: https://doi.org
/10.18653/v1/P18-1220

Markus Dreyer, Jason Smith, and Jason Eisner.
2008. Latent-variable modeling of string trans-
ductions with finite-state methods. En 2008
Conference on Empirical Methods in Na-
tural Language Processing, EMNLP 2008,
Proceedings of the Conference, 25–27 October
2008, Honolulu, Hawaii, EE.UU, A meeting of
SIGDAT, a Special Interest Group of the ACL,
pages 1080–1089. LCA.

Noura Farra, Nadi Tomeh, Alla Rozovskaya, y
Nizar Habash. 2014. Generalized character-
level spelling error correction. En procedimientos
of the 52nd Annual Meeting of the Association
para Lingüística Computacional, LCA 2014,
June 22–27, 2014, baltimore, Maryland, EE.UU,
Volumen 2: Artículos breves, pages 161–167. El
Association for Computer Linguistics. DOI:
https://doi.org/10.3115/v1/P14
-2027

Jonas Gehring, Michael Auli, David Grangier,
and Yann N. Dauphin. 2017a. A convo-
lutional encoder model for neural machine
traducción. In Proceedings of the 55th Annual
reunión de
la Asociación de Computación-
lingüística nacional, LCA 2017, vancouver,

491

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
9
1
9
2
4
0
6
2

/
t

a
C
_
a
_
0
0
3
7
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Canada, Julio 30 – August 4, Volumen 1: Largo
Documentos, pages 123–135. Asociación para Com-
Lingüística putacional. DOI: https://doi
.org/10.18653/v1/P17-1012, PMID:
28964987, PMCID: PMC6754825

Jonas Gehring, Michael Auli, David Grangier,
Denis Yarats, and Yann N. Dauphin. 2017b.
Convolutional sequence to sequence learning.
the 34th International
En procedimientos de
Conference on Machine Learning, ICML 2017,
Sídney, NSW, Australia, 6–11 August 2017,
volumen 70 of Proceedings of Machine Learning
Investigación, pages 1243–1252. PMLR.

Sepp Hochreiter y Jürgen Schmidhuber. 1997.
Memoria larga a corto plazo. Neural Computa-
ción, 9(8):1735–1780. DOI: https://doi
.org/10.1162/neco.1997.9.8.1735,
PMID: 9377276

Nal Kalchbrenner and Phil Blunsom. 2013.
Recurrent continuous translation models. En
Actas de la 2013 Conference on Em-
pirical Methods in Natural Language Pro-
cesando, EMNLP 2013, 18–21 October 2013,
Grand Hyatt Seattle, seattle, Washington, EE.UU,
A meeting of SIGDAT, a Special Interest Group
of the ACL, pages 1700–1709. LCA.

Yoon Kim, Yacine Jernite, David A. Sontag, y
Alejandro M.. Rush. 2016. Character-aware
language models. En procedimientos de
neural
the Thirtieth AAAI Conference on Artificial
Inteligencia, February 12–17, 2016, Phoenix,
Arizona, EE.UU, pages 2741–2749. AAAI Press.

Yann LeCun, Yoshua Bengio, y otros. 1995.
Convolutional networks for images, speech, y
time series. The handbook of brain theory and
neural networks, 3361(10):1995.

Wang Ling, Isabel Trancoso, Chris Dyer, y
Alan W. Negro. 2015. Character-based neural
machine translation. CORR, abs/1511.04586.

Guillermo B.. Lund, douglas

binarization

thresholding

j. Kennard,
and Eric K. Ringger. 2013. Combining
valores
multiple
In Document
to improve OCR output.
Recognition and Retrieval XX, part of
el
IS&T-SPIE Electronic Imaging Symposium,
Burlingame, California, EE.UU, February 5–7,
2013, Actas, volumen 8658 of SPIE
Actas, page 86580R. SPIE.

Guillermo B.. Lund, Eric K. Ringger,

y
Daniel David Walker. 2014. How well does
multiple OCR error correction generalize? En
Document Recognition and Retrieval XXI, san
Francisco, California, EE.UU, February 5–6,
2014, volumen 9021 of SPIE Proceedings,
pages 90210A–90210A–13. SPIE.

Guillermo B.. Lund, Daniel David Walker, y
Eric K. Ringger. 2011. Progressive alignment
and discriminative error correction for multiple
OCR engines. En 2011 International Conference
on Document Analysis and Recognition, ICDAR
2011, Beijing, Porcelana, September 18–21, 2011,
pages 764–768. IEEE Computer Society.

Anand Rajaraman and Jeffrey David Ullman.
2011. Mining of Massive Datasets, Cambridge
Prensa universitaria. 3.

Christian Reul, Uwe Springmann, Christoph
Wick, and Frank Puppe. 2018a. Improving OCR
accuracy on early printed books by utilizing
cross fold training and voting. In 13th IAPR
International Workshop on Document Analysis
Sistemas, DAS 2018, Viena, Austria, Abril
24–27, 2018, pages 423–428.

Christian Reul, Uwe Springmann, Christoph
Wick, and Frank Puppe. 2018b. State of the
art optical character recognition of 19th century
fraktur scripts using open source engines.
CORR, abs/1810.03436.

G¨ozde G¨ul Sahin and Mark Steedman. 2018.
Character-level models versus morphology
in semantic role labeling. En procedimientos de
the 56th Annual Meeting of the Association
para Lingüística Computacional, LCA 2018,
July 15–20, 2018,
Melbourne, Australia,
Volumen 1: Artículos largos, pages 386–396. también-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P18
-1036, PMID: 30102173

Carsten Schnober, Steffen Eger, Erik-Lˆan Do
Dinh, and Iryna Gurevych. 2016. Still not there?
Comparing traditional sequence-to-sequence
models to encoder-decoder neural networks
on monotone string translation tasks.
En
COLECCIONAR 2016, 26th International Conference
on Computational Linguistics, Actas de
the Conference: Technical Papers, December

492

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

a
r
t
i
C
mi
–
pag
d

F
/

d
oh

i
/

1
0
1
1
6
2

/
t

a
C
_
a
_
0
0
3
7
9
1
9
2
4
0
6
2

/
t

a
C
_
a
_
0
0
3
7
9
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Jules Gagnon-Marchand and LJQ. n.d. https://
github.com/Lsdefine/attention-is
-all-you-need-keras/blob/master
/transformer.py.

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Leon Jones, Aidan N.. Gómez,
Lukasz Kaiser, y Illia Polosukhin. 2017.
Attention is all you need. In Advances in
Neural Information Processing Systems 30:
Annual Conference on Neural
Información
Sistemas de procesamiento 2017, 4–9 December 2017,
Long Beach, California, EE.UU, pages 5998–6008.

Ziqi Wang, Gu Xu, Hang Li, and Ming Zhang.
2014. A probabilistic approach to string trans-
formación. IEEE Transactions on Knowledge
and Data Engineering, 26(5):1063–1075. DOI:
https://doi.org/10.1109/TKDE.2013
.11

Ziang Xie, Anand Avati, Naveen Arivazhagan,
Dan Jurafsky, and Andrew Y. Ng. 2016.
Neural
language correction with character-
based attention. CORR, abs/1603.09727.

Shaobin Xu and David A. Herrero. 2017. Retrieving
and combining repeated passages to improve
OCR. En 2017 ACM/IEEE Joint Conference on
Digital Libraries, JCDL 2017, toronto, ON,
Canada, June 19–23, 2017, pages 269–272.
IEEE Computer Society.

11-dieciséis, 2016, Osaka, Japón, pages 1703–1714.
LCA.

sara

Schulz

Jonas Kuhn.

2017.
Multi-modular domain-tailored OCR post-
el 2017
En procedimientos de
correction.
en
en
Conferencia
Natural
EMNLP
Idioma
2017, Copenhague, Dinamarca,
Septiembre
9–11, 2017, pages 2716–2726. asociación-
ción para la Lingüística Computacional. DOI:
https://doi.org/10.18653/v1/D17
-1288

Empirical Methods
Procesando,

Miikka Silfverberg, Pekka Kauppinen,

y
Krister Lind´en. 2016. Data-driven spelling
correction using weighted finite-state methods.
En procedimientos de
the SIGFSM Workshop
on Statistical NLP and Weighted Automata,
pages 51–59. Asociación de Computación-
lingüística nacional, Berlina, Alemania. DOI:
https://doi.org/10.18653/v1/W16
-2406

Ilya Sutskever, Oriol Vinyals, and Quoc V.
Le. 2014. Sequence to sequence learning
with neural networks. En avances en neurología
Sistemas de procesamiento de información 27: Annual
Conference on Neural Information Processing
Sistemas 2014, December 8–13 2014, Montréal,
Quebec, Canada, pages 3104–3112.