Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Shruti Rijhwani1, Daisy Rosenblum2, Antonios Anastasopoulos3, Graham Neubig1

1Language Technologies Institute, Carnegie Mellon University, EE.UU
2University of British Columbia, Canada
3Departamento de Ciencias de la Computación, George Mason University, EE.UU
srijhwan@cs.cmu.edu, daisy.rosenblum@ubc.ca,
antonis@gmu.edu, gneubig@cs.cmu.edu

Abstracto

Much of the existing linguistic data in many
languages of the world is locked away in non-
digitized books and documents. Optical char-
acter recognition (OCR) can be used to produce
digitized text, and previous work has demon-
strated the utility of neural post-correction
methods that improve the results of general-
purpose OCR systems on recognition of less-
well-resourced languages. Sin embargo,
estos
methods rely on manually curated post-
correction data, which are relatively scarce
compared to the non-annotated raw images
that need to be digitized. en este documento, nosotros
present a semi-supervised learning method
that makes it possible to utilize these raw
images to improve performance, específicamente
through the use of self-training, a technique
where a model is iteratively trained on its own
outputs. Además, to enforce consistency
in the recognized vocabulary, we introduce
a lexically aware decoding method that aug-
ments the neural post-correction model with
a count-based language model constructed
from the recognized texts, implemented us-
ing weighted finite-state automata (WFSA)
for efficient and effective decoding. Resultados
on four endangered languages demonstrate
the utility of the proposed method, with rela-
tive error reductions of 15%–29%, where we
find the combination of self-training and lex-
ically aware decoding essential for achieving
consistent improvements.1

1

Introducción

There is a vast amount of textual data available
in printed form (Dong and Smith, 2018). En esto
paper, we address the task of digitizing printed
materials that contain text in endangered lan-
calibres, eso es, languages with small populations

1Data and code are available at https://shrutirij

.github.io/ocr-el/.

of first-language speakers and limited acquisi-
tion among younger speakers. Printed texts in
endangered languages come from various sources,
including linguistic documentation and cultural
and educational books.

Extracting text data from these documents is
valuable for a multitude of reasons. Automatic dig-
itization can aid language documentation, preser-
vación, and accessibility efforts by archiving the
texts and making them searchable for language
learners, profesores, and speakers, contributing to
essential resources for community-based language
revitalization. Más, most endangered languages
are under-represented in natural language process-
ing technologies, primarily because there is little
to no data available for training and evaluation
(Joshi et al., 2020). This challenge can be mit-
igated by converting printed materials in these
languages to a machine-readable format.

Optical character recognition (OCR) sistemas
can be used to produce digitized text, and recent
trabajar (Rijhwani et al., 2020) has demonstrated
that post-correction improves the performance of
existing general-purpose OCR systems on endan-
gered languages (an example is in Figure 1). Mayoría
state-of-the-art OCR post-correction methods use
neural sequence-to-sequence models and rely on
considerable resources such as a large number
of manual transcriptions (Schnober et al., 2016;
Rigaud et al., 2019) or substantial textual data to
train a language model (Dong and Smith, 2018).
To adapt these methods for the less-well-resourced
endangered languages setting, Rijhwani et al.
(2020) add translations and structural biases to
el modelo.

Sin embargo, even with such methods targeted
to low-resource learning, post-correction perfor-
mance is still dependent on manually curated data,
which are minimally available for most endan-
gered languages. Por otro lado, unannotated

1285

Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 1285–1302, 2021. https://doi.org/10.1162/tacl a 00427
Editor de acciones: Yang Liu. Lote de envío: 6/2021; Lote de revisión: 8/2021; Publicado 11/2021.
C(cid:2) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

increasing number of correctly predicted words,
resulting in a better count-based language model.
Como consecuencia, joint decoding reinforces the pre-
diction of more accurate words and mitigates
the noise introduced by incorrect words in the
pseudo-training data.

We conduct experiments on four endangered
idiomas: Ainu, Griko, Kwak’wala, and Yakkha.
Our proposed method reduces the character and
word error rates by 15%–29% over a state-of-
the-art OCR post-correction method for endan-
gered languages. We find that the combination
of self-training and lexically aware decoding is
essential for achieving consistent improvements
en desempeño.

2 Problem Formulation

Optical Character Recognition The OCR task
involves generating a transcription of the text
contained in an image. en este documento, we use
existing OCR tools (detailed in Section 6.4) a
obtain a first pass transcription for the images
in our dataset. The first pass transcription is
a text sequence of N characters, denoted as
x = [x1, . . . , xN ].

OCR Post-Correction Even state-of-the-art
OCR models are susceptible to making recogni-
tion errors (Dong and Smith, 2018). Errors are
particularly frequent in the case of endangered
languages because most off-the-shelf OCR tools
do not directly support these languages and train-
ing a high-performance OCR system is challeng-
ing given the small amount of data that
es
typically available (Rijhwani et al., 2020). Nosotros
use OCR post-correction to correct these errors
and improve the quality of the transcription.

The post-correction model takes the first pass
transcription x as input and generates the cor-
rected transcription, a sequence of T characters
denoted as y = [y1, . . . , yT ]:

y = arg max

y(cid:2)

pcorr(y(cid:2)|X)

3 Base Model

As the base post-correction model, we use the
model from Rijhwani et al. (2020): a sequence-
to-sequence model that uses an attention-based
LSTM encoder-decoder (Bahdanau et al., 2015),

Cifra 1: OCR post-correction on a scanned docu-
ment that contains text in the endangered language
Kwak’wala. The goal of post-correction is to fix the
recognition errors made by the first pass OCR system.

raw images that need to be digitized are relatively
less scarce; for many endangered languages, hun-
dreds of printed pages exist, with only a small
subset manually transcribed. en este documento, nosotros
propose a semi-supervised learning method for
OCR post-correction that efficiently utilizes these
unannotated pages to improve performance.

The method has two key components. We first
present a self-training method for OCR post-
correction (Sección 4) to create pseudo-training
datos. A baseline post-correction model is used to
correct the initial OCR output on the unannotated
paginas, and the generated ‘‘post-corrected’’ text is
then used as pseudo-training data to improve the
post-correction model. The self-training process
is repeated to iteratively obtain better predictions
on the unannotated pages.

While self-training is a straightforward way to
use the unannotated data, incorrect predictions
in the pseudo-training data may introduce noise
into the model (Zhu and Goldberg, 2009). A
counterbalance the influence of this noise, nosotros
propose lexically aware decoding (Sección 5),
an inference strategy that encourages the model
to generate predictions that contain ‘‘known’’
palabras. We use the pseudo-training data to train
a count-based language model, represented with
a weighted finite-state automaton (WFSA). Nuestro
proposed decoding method jointly uses an LSTM
decoder and the WFSA to make OCR post-
correction predictions.

The intuition behind the joint decoding strategy
is simple. As the model iteratively improves with
self-training, the quality of the pseudo-training
data is also likely to improve and contain an

1286

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

with adaptations for low-resource OCR post-
correction. We briefly describe the method here
but refer readers to the original paper for details.
The OCR post-correction model takes the first
pass transcription x as input, with the aim of pre-
dicting an error-free transcription y. Primero, cada
character in the input sequence x is mapped to
a vector representation using character embed-
dings. This forms a sequence of vectors, x =
[x1, . . . , xN ]. The encoder is a character-level bi-
directional LSTM (Hochreiter and Schmidhuber,
1997), which transforms x into a sequence of
hidden state vectors h = [h1, . . . , hN ].2

The model’s decoding process uses an attention
mechanism to provide context from the encoder
hidden states. At each decoding timestep t, el
attention layer uses h and the decoder state from
the previous timestep, st−1, to produce the context
vector ct. The LSTM decoder, given ct, calcula
the output state st and subsequently the probability
distribution yt for generating the next character of
the target sequence y:

pag (yt) = softmax (Wst + b)

(1)

Rijhwani et al. (2020) adapt the encoder-decoder
model above for low-resource post-correction
by adding pretraining and three structural biases:

• Diagonal Attention Loss: OCR post-
correction is a monotonic sequence-to-
sequence task. Por eso, the attention weights
are expected to be higher closer to the
diagonal—adding attention elements off the
diagonal to the training loss encourages mo-
notonic attention (Cohn et al., 2016).

• Copy Mechanism: The copy mechanism
enables the model to choose between gen-
erating a character based on the decoder state
(Ecuación 1) or copying a character directly
from the input sequence x by sampling from
the attention distribution (Gu et al., 2016; Ver
et al., 2017).

2Rijhwani et al. (2020) incorporate translations into the
model with a multi-source encoder. We omit this from our
formulation, considering applicability to texts without avail-
able translations. Sin embargo, adding an encoder into our
framework remains straightforward and can be used if
translations exist.

• Coverage: The coverage vector keeps track
of attention weights from previous timesteps.
It is used as additional information when
computing ct and is added to the training loss
to discourage the model from repeatedly at-
tending to the same character (Mi et al., 2016;
Tu et al., 2016).

The model is trained in a supervised manner
with a small number of manual transcriptions:
The training data includes pairs of first-pass OCR
text with its corresponding error-free transcription.
The post-correction training loss function (de-
noted as L) is a combination of cross-entropy loss
along with the diagonal attention loss and the cov-
erage loss from the structural biases. Inference with
a trained model is performed using beam search.
In the following sections, we use the method
described above as a base model for our pro-
posed semi-supervised learning technique for
OCR post-correction. Given the minimal man-
ually transcribed data we have in endangered lan-
calibres, our approach aims to efficiently use the
relatively larger number of pages without gold tran-
scriptions to improve performance. Para tal fin,
we introduce two methodological improvements:
(1) self-training and (2) lexically aware decoding.

4 Self-Training

a

Self-training is
semi-supervised learning
método, where a trained model is used to make
predictions on unlabeled data, and the model is
then retrained on its own predictions (Zhu and
Goldberg, 2009).

Consider that we have a set of images with
manually created transcriptions and a set of images
without gold transcriptions. We can obtain a first
pass transcription for the text contained in the
images (both sets) with existing OCR tools.

Más formalmente, we have a gold-transcribed da-
taset D = {(cid:3)X(i), y(i)(cid:4)}d
yo=1, where x(i) is the first
pass transcription and y(i) is the error-free man-
ual transcription of the ith training instance.3 We
also have a dataset for which only the first pass
OCR is available (es decir., no manual transcriptions),

3In our dataset, the source and target data instances are
either at the line-level or the sentence-level (mira la sección 6.1).

1287

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

U = {X(j)}tu
gered languages setting,
transcriptions is much larger, eso es, tu (cid:5) d.

j=1. For most cases in the endan-
the set without gold

Since self-training requires a baseline model to
get an initial set of predictions on U , we first train
the base model described in Section 3 on the gold-
transcribed set D. Let the trained base model be
. Próximo, we use the predictions on U from fθ to
self-train the model. We follow the self-training
strategy recommended in He et al. (2020), cual
involves two steps: ‘‘pseudo-training’’ and ‘‘fine-
tuning’’. We describe each step of the self-training
procedure in detail below:

1. Apply the initial OCR post-correction model
fθ to each instance in the set U to obtain
predictions using beam search inference.

For an instance x, let the prediction be fθ(X).

2. Create a pseudo-annotated dataset with the
predictions from step 1. Let this be S =
{(cid:3)X, (X)(cid:4) | x ∈ U }.

3. Train the model fθ on sets U and S.

This is the pseudo-training step. Aquí, nosotros
first train the encoder and the decoder com-
ponents with a language modeling objective,
and then train the end-to-end post-correction
modelo. The procedure is as follows:

(a) Train the encoder with a character-level
language modeling (LM) objective on U .
As discussed in Section 3, the encoder
component of the model is an LSTM
that operates at the character-level. Nosotros
pseudo-train this LSTM with a language
model objective on each text sequence
x ∈ U .
Eso es, at each timestep t, the LSTM is
trained to predict the next character in
the input sequence. Given a sequence of
characters x = [x1, . . . , xN ], the train-
norte
t=1 P (xt |
ing objective maximizes
x1, . . . , xt−1).
This is the standard LM objective func-
tion and has been proven helpful for
pretraining LSTMs to improve hidden
representaciones (Dai and Le, 2015;
Ramachandran et al., 2017).

(cid:2)

(b) Train the decoder LSTM with the LM
objective described above, using the
baseline model’s predictions {(X) |
x ∈ U }.

(C) Train the sequence-to-sequence model
on the pseudo-annotated dataset S with
the post-correction loss function L from
Sección 3.

4. Given the pseudo-trained model fθ, fine-tune
the model on the gold-transcribed dataset D,
with the loss function L.

5. Repeat step 1 to step 4 until a specified
stopping criterion (p.ej., no improvement in
the validation set performance or reaching
the maximum permitted iterations).

As indicated above, self-training is a straight-
forward semi-supervised technique to leverage
documents without gold transcriptions to improve
OCR post-correction performance. We note that
some self-training methods (Yarowsky, 1995; Sotavento
et al., 2013; Zoph et al., 2020, inter alia) replace
steps 4 y 5 with a single step that trains fθ on
S ∪ D. Sin embargo, this led to slightly worse perfor-
mance in our preliminary experiments. Nosotros también
observed that pseudo-training the LSTMs with an
LM objective (steps 3(a) y 3(b) arriba) is neces-
sary for good performance and that applying the
self-training steps on fθ from the previous iteration
led to better results than re-initializing the model.4
Más, as recommended in He et al. (2020) a
improve self-training for neural sequence genera-
ción, we add a dropout layer into the base model
at the encoder and decoder hidden states during
pseudo-training and fine-tuning (steps 3 y 4).

5 Lexically Aware Decoding

Although self-training is a simple approach that
leads to improvements in post-correction per-
formance without additional manual annotation,
incorrect predictions in the pseudo-annotated data
may introduce noise into the model, potentially
reinforcing the errors in the next self-training

4En experimentos preliminares, we also tried using S ∪ D
in step 3(C). Sin embargo, the post-correction performance was
approximately the same as using only the set S.

1288

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

iteration (Zhu and Goldberg, 2009). Such noise is
more likely to occur in the endangered languages
configuración, where the base model is trained on min-
imal data and thus sometimes generates errone-
ous predictions.

While some self-training methods use confi-
dence scores to remove noisy predictions (como
Yarowsky, 1995), these are typically designed for
classification tasks. Designing such heuristics is
challenging for OCR post-correction because the
predictions are generated at the character-level;
specific characters may be incorrect, but discard-
ing the entire predicted sequence (es decir., the line or
oración) is inefficient, particularly in a low-
resource scenario. To mitigate these issues, nosotros pro-
pose lexically aware decoding, an inference strategy
based on our observations of the challenges asso-
ciated with the OCR post-correction task.

More specifically, our preliminary experiments
with self-training indicated that the errors made
by the model are typically inconsistent. Para
particular word, some instances may be correctly
predicted by the model. For the instances of the
word that are incorrect, we observe that they
are likely to be erroneous in different ways, es decir.,
different subsets of characters in the word are
incorrectly predicted. This is expected since the
same word can appear in varied contexts, or the
first pass OCR for the word can differ. Our empir-
ical observations on the pseudo-annotated dataset
S showed that, since the errors are inconsistent,
the correct form of the word is more frequent
than incorrect forms. Lexically aware decoding
is designed to influence the OCR post-correction
model to generate words that frequently occur in
the set S, in the expectation that these are correct
word forms.

We first describe the construction of a model
that accounts for word frequency in the predictions
along with a character n-gram model to enable the
prediction of unseen words. Entonces, we present a
joint decoding method that uses the frequency-
based models in combination with the LSTM
decoder for improved OCR post-correction.

5.1 Count-Based Language Model

From the self-training method in Section 4, nosotros
have a pseudo-annotated dataset S = {(cid:3)X, (X)(cid:4) |
x ∈ U }, where fθ(X) is the model’s prediction
for input sequence x. We train a count-based
word-level unigram language model (LM) en

{(X) | x ∈ U }. The LM is built by computing
frequency-based probabilities for each word found
in the predictions.

We also have to reserve some probability mass
in the LM to account for unknown words (palabras
unseen in the predictions). We use modified
Kneser-Ney smoothing to derive the unknown
palabra (‘‘'') probabilidad. Since the model is
a unigram LM, the smoothing process is simi-
lar to absolute discounting. Sin embargo, we use the
discount values based on the modified Kneser-
Ney method, which are derived from word counts
in the dataset, as opposed to using a fixed dis-
count value (Kneser and Ney, 1995; Chen and
Buen hombre, 1999). We denote the probability from
the smoothed LM for a known word w as
pword(w) and the unknown word probability as
pword().

A count-based unigram LM is a simple model
but is suitable given our empirical observations
on word-level errors (described earlier in this
sección) porque (1) it explicitly models word
frequency, (2) él
is straightforward to update
as the pseudo annotated dataset improves over
self-training iterations, y (3) it can be expressed
as a weighted finite-state automaton which, as we
discuss next, has several properties useful for our
decoding method.

5.2 Weighted Finite State Automaton

A weighted finite-state automaton (WFSA) is a set
of states and transitions between the states. Cada
transition accepts a particular symbol as input and
has a weight associated with it. The symbols come
from a finite alphabet Σ. A sequence of consecu-
tive transitions is referred to as a ‘‘path’’, y el
label of a path is the concatenation of all sym-
bols consumed by its constituent transitions. El
WFSA has a start state and a set of final states. A
successful path is a path from the start state to a fi-
nal state, and a sequence of symbols is ‘‘accepted’’
by the WFSA if there exists a successful path that
consumes this sequence (Mohri et al., 2002).

Since we are focused on decoding and only need
the best scoring path for any given sequence (es decir.,
Viterbi search), we consider the weights over the
tropical semiring. Eso es, the weight of a path is
the sum of its transition weights, and the score of
a sequence of symbols is the minimum weight of
all the successful paths that accept that sequence.

1289

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 2: El (a) WFSA and (b) minimized WFSA we construct, for a hypothetical language model with a two
word vocabulary: PAG (dog) = 0.75; PAG (door) = 0.2; PAG () = 0.05. The transition weights are negative log
probabilities. En (b), por simplicidad, we show only the known word states after determinization and minimization.

Decoding with the post-correction model is at
the character-level (Ecuación 1), so in order to
leverage word frequency in the decoding process,
we convert the count-based word-level LM de-
scribed in Section 5.1 to a WFSA representation
that consumes and scores sequences at
el
character-level.

The WFSA is constructed to accept the words
known to the LM by consuming each character in
la palabra (in sequence) as input. The score of the
path that accepts a known word w is the negative
log of its probability from the LM: −log pword(w).
A simple example is shown in Figure 2(a).

The WFSA, as described above, can only accept
a single word. Sin embargo, the input and correspond-
ing predictions of the post-correction model are
sequences of words, typically lines or sentences.
To enable the WFSA to accept such sequences,
we add transitions that accept a set B of word
boundary symbols (whitespace, punctuation, y
end-of-sequence) from the states at the end of the
known words (p.ej., estados 1 y 2 En figura 2(a))
back to the start state. Once in the start state, el
model can begin consuming characters from the
next word.

Más, we modify the WFSA such that the start
state is also the only final (accepting) state since
the predicted sequence is considered complete
only when the model predicts an end-of-sequence
symbol after the last character.

based LM, we include an unknown word state in
the WFSA, as shown in Figure 2(a). We add an
(cid:2)-transition (a transition that consumes no input),
with an associated cost −log pword() (es decir.,
the probability mass reserved for unknown words
en la sección 5.1) to enter the unknown word state
from the start state. The model remains in the
unknown state, accepting symbols that form an
unseen word, until a word boundary symbol from
the set B is consumed to return to the start state.

We design the unknown word state to accept
any combination of the symbols in Σ, thereby
permitting the prediction of words unseen by the
word-level LM. To score each character consumed
at the unknown word state, we use a character-
level n-gram language model.5 We denote the
probabilities from this character n-gram LM as
pchar. The probability distribution is estimated
with modified Kneser-Ney smoothing on char-
acter n-grams from unique word forms in the
colocar {(X) | x ∈ U }. We use unique word
forms because unknown words are likely rare, y
using count-based word forms would undesir-
ably shift the probability mass towards more fre-
quent words.

5.3 Efficient Scoring with the WFSA

The constructed WFSA has states to score char-
acter sequences that form known words and an

Character LM for Unknown Words To enable
the prediction of words unknown to the count-

5We use n = 6 in this paper. We experimented with
different values of n in early experiments but found that
norte = 6 gave the best results for all languages in the dataset.

1290

unknown word state that relies on a character
n-gram LM to score unknown sequences.

During inference, we independently score the
next character through the known word model
and the unknown word model and then choose the
best scoring path. This formulation has two advan-
tages: (1) separate scoring allows us to compactly
represent the WFSA states for known words and
(2) instead of representing the character n-gram
LM directly in the WFSA, leading to the number
of states exponentially increasing with n, podemos
use highly optimized LM toolkits such as KenLM
(Heafield et al., 2013) for scoring unknown words.

Known Word Model Consider the WFSA with
only known word states. We apply standard al-
gorithms for determinization and minimization on
these states, which leads to an efficient and com-
pact representation of the count-based language
modelo (Mohri, 1996). As shown in Figure 2(b), el
resultant minimized WFSA has several properties
useful for our decoding method, discussed below.
Determinization ensures that each state has
at most one outgoing transition that consumes a
given input symbol, and minimization eliminates
redundant states and transitions, reducing the time
and space needed to process an input sequence.

Más, minimization includes pushing the
transition weights towards the start state of the
WFSA as much as possible (Mohri et al., 2002).
This lends itself well to our method since inference
in the OCR post-correction model is performed
with beam search; if the cost of a path is es-
tablished closer to the start state, unfavorable
hypotheses can be pruned at an earlier timestep,
which allows us to avoid errors more effec-
tively within an approximate search algorithm like
beam search.

Por último, since each state in the WFSA has at
most one outgoing transition for each symbol,
the transition scores can be precomputed and
stored as a matrix, allowing efficient retrieval
during decoding.

At decoding timestep t, let the previous timestep
score from the known word model be known(yt−1)
and the current WFSA state be st−1. The score for
predicting the next character yt is the weight of
the transition from state st−1 that consumes yt in
the minimized WFSA (ver figura 2(b)). De este modo,

conocido(yt) = known(yt−1) + scorewfsa(yt | st−1)

where known(y0) = 0. If yt does not continue
the path of any known word, then scorewfsa(yt)
is inf.

Unknown Word Model We use the probability
pchar from the character n-gram language model
to score unknown words. En general, at decod-
ing timestep t, the unknown model score for yt
will be:

unk(yt) = unk(yt−1) − log pchar(yt | yt−n, . . . , yt−1)

Sin embargo, if yt−1 ∈ B (es decir., the previous word is
complete) or t = 0, the WFSA is currently in the
start state. To begin an unknown word, nosotros también
need to add the weight of entering the unknown
word state to unk(yt), eso es, − log pword().

Best Scoring Path The scores are in the tropical
semiring (negative log probabilities). At timestep
t, the best score for yt from the lexical models is:

scorelex(yt) = min(conocido(yt), unk(yt))

(2)

During decoding, we keep track of both the
known and unknown model scores for the current
word being generated in the hypothesis. Cuando
the word is completed (when yt ∈ B), both the
known and unknown word models return to the
start state of the WFSA (ver figura 2). Desde
the two paths are in the same state and are thus
indistinguishable with respect to future predictions
in the hypothesis, we choose the best scoring path
to continue decoding. This is known as hypoth-
esis recombination.

The WFSA framework, de este modo, allows us to effi-
ciently represent the word-level LM in a manner
that scores symbols at the character-level and
leverage a character n-gram model to score un-
known words. This enables joint inference with
the character-level LSTM decoder in the OCR
post-correction model, as discussed below.

5.4 Joint Decoding with the LSTM

At decoding timestep t, let plstm(yt) be the prob-
ability of generating a character yt based on the
LSTM decoder’s hidden state (Ecuación 1). Nosotros
also compute scorelex(yt), which is a negative log
probabilidad, as defined in Equation 2. The final
probability of predicting yt is obtained through

1291

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

linear interpolation between these two scores,6
weighted by a hyperparameter λ:

pag(yt) = (1 − λ) · plstm(yt) + λ · plex(yt)

(3)

where plex(yt) = exp (−scorelex(yt)).

This joint decoding strategy is applied when
performing inference with beam search using a
trained OCR post-correction model. When used
in combination with self-training, the predictions
made by the model improve as we repeat the self-
training process, iteratively improving the count-
based LM and resulting in a better distribution
of plex(yt).

6 experimentos

En esta sección, we present experiments with our
semi-supervised post-correction method on four
typologically diverse endangered languages.

6.1 Datasets

We use the OCR post-correction dataset from
Rijhwani et al. (2020), which contains transcribed
documents in three endangered languages: Ainu,
Griko, and Yakkha. Además, in this paper,
we create a similar dataset in the endangered
language Kwak’wala. We describe the datasets
abajo, including the sizes of the gold transcribed
and unannotated sets we use for semi-supervised
training:

Ainu (ain) is a severely endangered language
from northern Japan. The dataset contains pages
from a book of Ainu epic poetry (Kindaichi, 1931).
The Ainu text is written in the Latin script. El
dataset contains 816 manually transcribed lines as
well as 7,646 lines without gold transcriptions.

Griko (grk), an endangered Greek dialect spo-
ken in southern Italy, is written with a combination
of Latin and Greek alphabet. The document in
the dataset is a book of Griko folk tales (Stomeo,
1980). Hay 807 y 3,084 sentences with and
without gold transcriptions, respectivamente.

Yakkha (ybh) is an endangered language spo-
ken in Nepal and is written in the Devanagari
script. The dataset contains transcriptions of three
children’s books (Schackow, 2012). In total, allá
son 159 manually transcribed sentences and no

6We leave other interpolation techniques like log-linear
interpolation and more complex combinations of scores
(such as the WFSA-based reranking and rescoring methods
proposed by Ryskina et al. [2021]) as potential future work.

unannotated lines in the dataset. Por lo tanto, como el
unannotated set, we use the first pass OCR on
the validation and test sets in a transductive learn-
ing setting (≈ 30 oraciones: mira la sección 6.2 para
data splits).

Kwak’wala (kwk) is spoken on Northern Van-
couver Island, nearby small
islands, y el
opposing mainland. The language is severely en-
dangered, with estimates of ≈150 first-language
speakers, all over the age of 70 años. The Kwak’
wala language includes 42 consonantal pho-
nemes (twice as many as English) and a wide
range of allophonic vowels. Several writing sys-
tems exist and community preference varies
between two orthographies:
the U’mista and
Liq’wala systems.

Sin embargo, much of the written documentation
for Kwak’wala is in another orthography that was
developed by anthropologist Franz Boas. El
Boas orthography (Boas, 1900) was used in the
extensive documentation of the Kwak’wala lan-
guage and its speakers produced by Boas in col-
laboration with native-speaker George Hunt. El
Boas writing system uses Latin script characters
as well as diacritics and digraphs to represent
phonemic differences. Although the Boas orthog-
raphy is not widely used today, the cultural and
linguistic materials previously written by Boas
are of tremendous value to community-based re-
searchers. Sin embargo, they are minimally accessible
since they currently exist only as non-searchable
scanned images.

In consultation with members of language revi-
talization projects in three Kwakiutl communities
(Tsulquate, Fort Rupert, Quatsino), we focus on
digitizing these significant cultural resources. Nosotros
create a dataset with pages from the ‘‘Ethnology
of the Kwakiutl’’ (Boas, 1921), containing 262
gold-transcribed lines and 2,255 unannotated lines.

6.2 Experimental Setup

Data Splits We follow Rijhwani et al. (2020)
and perform 10-fold cross-validation for all ex-
perimentos. For each language, the gold-transcribed
data is split into 10 segments, and for each cross-
validation fold, eight segments are used for train-
En g, one for validation, and one for testing.

Metrics We evaluate our systems in terms of
character error rate (CER) and word error rate
(WER), both standard metrics for measuring
OCR and OCR post-correction performance

1292

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

(Berg-Kirkpatrick et al., 2013; Schulz and Kuhn,
2017). CER is the character-level edit distance
between the predicted text and the corresponding
gold transcription, divided by the total number
of characters in the gold transcription. WER is
similar but is calculated at the word-level. Para
readability, we report CER and WER as percent-
ages for all experiments.

Métodos
performance of the following methods:

En nuestros experimentos, we compare the

• FIRST-PASS: To obtain the first pass OCR
transcription, we experiment with two ex-
isting OCR systems: Google Vision (Fujii
et al., 2017) and Ocular (Berg-Kirkpatrick
et al., 2013).

For each language, we choose the best per-
forming first pass system,
the details of
which are in Section 6.4. We use Ocular
for Kwak’wala and Google Vision for Ainu,
Griko, and Yakkha.

• BASE: The current state-of-the-art in OCR
post-correction for endangered language
textos (Rijhwani et al., 2020; described in
Sección 3).

• SEMI-SUPERVISED: Our proposed method as

descrito en la Sección 4 y Sección 5.

Implementation The neural post-correction
models are implemented using the DyNet neural
network toolkit (Neubig et al., 2017). The WFSA
is implemented using the MFST Python wrapper
on OpenFST (Francis-Landau, 2020), and we use
the KenLM toolkit (Heafield et al., 2013) to train
and query the character n-gram language model.
Following Rijhwani et al. (2020), results reported
are the average of five randomly seeded runs (es decir.,
five runs for each of the 10 cross-validation folds).

6.3 Main Results

Mesa 1 shows the performance of the baselines
and our proposed semi-supervised approaches for
the four languages in the dataset. For all languages,
using semi-supervised learning leads to substantial
reductions in both CER and WER.

best models based on the validation set WER. Ex-
tensive analysis of these factors is in Section 6.6.
Primero, we note that the BASE post-correction
method improves error rates over the first pass for
all languages. With our proposed semi-supervised
learning method, combining self-training with
lexically aware decoding leads to the best per-
formance across all the languages, with error rate
reductions in the range of 15%–29%.

This is especially noticeable in Ainu, dónde
using either self-training or lexical decoding in-
dependently results in worse performance than
the BASE system, but jointly using them improves
the CER by 21%. For the other languages, el
independent components improve over the base
model but less so than their combination. Este
indicates the complementary nature of the two
componentes: The language model used for lex-
ically aware decoding is improved by self-training.
Sucesivamente, it reinforces correctly predicted words
to counteract the influence of incorrect pseudo-
annotated instances.

6.4 First Pass OCR Systems

We experiment with two existing OCR systems
to obtain a first pass transcription on our dataset.
The first of these is the Google Vision system
(Fujii et al., 2017; Ingle et al., 2019). This off-
the-shelf OCR model supports 60 languages in
27 scripts—these are primarily higher-resourced
languages and do not include our target endan-
gered languages.

The second system is Ocular (Berg-Kirkpatrick
et al., 2013). Ocular uses a generative model to
transcribe scanned documents: The model gen-
erates the image by learning the font of the
documento. Ocular relies on a character n-gram
language model trained on the target language.
We initialize the LM with the small number of
gold-transcribed pages in our dataset. Para esto, nosotros
use the 10-fold cross-validation setup described in
Sección 6.2: We use the training segments to train
the Ocular LM and the test segment to evaluate
OCR performance. The font model has parame-
ters to learn the shape of each character in the LM
vocabulary. After initialization, the parameters
are updated in an unsupervised manner with EM
until convergence.

We note that we did a hyperparameter search
over the number of self-training iterations and the
weight of the WFSA λ, and Table 1 presenta el

Results are presented in Table 2. We note that
although the Google OCR system is not trained on
our target languages, it is trained on large amounts

1293

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Modelo
FIRST-PASS
BASE
SEMI-SUPERVISED
Self-Training
Lexical Decoding
Ambos
Error Reduction

(cid:3)

BASE−Both
BASE

% Character Error Rate

% Word Error Rate

ain
1.34
0.80

grk
3.27
1.70

ybh
8.90
8.44

kwk
7.90
4.97

ain
6.27
5.19

grk
15.63
7.51

ybh
31.64
21.33

kwk
38.22
27.65

1.45
1.51

6.47
5.31
7.20
0.82
6.60
0.81
5.18
7.56
6.36
0.63 1.37 5.98 3.82 4.43
21% 19% 29% 23% 15% 15%

4.00
4.28

23.98
18.09
19.13
25.09
16.65 22.61
18%
22%

(cid:4)

Mesa 1: Our semi-supervised approach improves performance over the baselines (10-fold
cross-validation averaged over five randomly seeded runs). ‘‘Self-Training’’ and ‘‘Lexical
Decoding’’ refer to experiments where we use these methods independently. ‘‘Both’’ refers to
their combination. We highlight the best model for each language.

% Character Error Rate

OCR System
Ocular
Google Vision

ain
10.49
1.34

grk
4.58
3.27

ybh
75.60
8.90

kwk
7.90
21.12

ain
47.47
6.27

% Word Error Rate
ybh
99.37
31.64

grk
15.71
15.63

kwk
38.22
82.08

Mesa 2: First pass OCR system performance. If the language’s script is not covered by Google Vision
(as for Kwak’wala), then Ocular results in better recognition. De lo contrario, Google Vision OCR is usually
significantly better.

of data in high-resource languages that share writ-
ing systems with Ainu, Griko, and Yakkha (latín,
Griego, Devanagari scripts) and thus can recognize
characters in these scripts with reasonable accu-
racy (see Rijhwani et al. [2020] for a more detailed
análisis). Por otro lado, the performance is
much worse on Kwak’wala since the system has
not been trained on the Boas orthography, cual
is unique to the Kwak’wala language. The Boas
orthography uses several Latin script characters,
which the system is able to recognize, but it also
includes characters unique to the writing system
that are incorrectly transcribed by the OCR model.
We find that the performance on Kwak’wala
is considerably better with the Ocular system
because the LM is trained on Kwak’wala text.
De este modo, unlike Google Vision, the model vocabu-
lary contains the Boas writing system’s alphabet.
Por otro lado, Ocular’s performance on Ainu
and Griko is worse than Google Vision, likely
due to the minimal data available for training it.
Además, we observe that performance is cor-
related with the word overlap between test data
and the data used for training the LM, demon-
strating Ocular’s reliance on a strong language

model—the word overlap is 73% for Griko, 56%
for Kwak’wala, y 48% for Ainu.

Finalmente, we find that Ocular does not perform
well on the Yakkha dataset. This is because the
design of Ocular’s font model does not work
with how the Devanagari script is written. Más
específicamente, when a vowel diacritic is applied
the characters are combined:
to a consonant,
. In Unicode, this is repre-
=
+
p.ej.,
'', dónde
sented by two characters ‘‘
the dotted circle is the character combination
marker in the Unicode Standard.7

’’ and ‘‘

Sin embargo, since Ocular’s font model operates at
the character-level, it tries to generate the images
of these two characters separately. Generating the
’’ on its own is not meaningful: El
diacritic ‘‘
dotted circle never appears in the input image
because it is supposed to be combined. De este modo, el
font model is unable to converge as it cannot
handle character combinations when generating
the image.

7https://www.unicode.org/versions/Unicode

13.0.0/ch02.pdf.

1294

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Known Word Unknown Word
Modelo
CHARLM
WORDLM
OURS

ain
Modelo
(not needed)
0.64
Character uniform 0.64
Character n-gram 0.63

% Character Error Rate

% Word Error Rate

grk
1.43
1.42
1.37

ybh
6.22
6.12
5.98

kwk
3.85
3.95
3.82

ain
4.50
4.50
4.43

grk
6.44
6.39
6.36

ybh
16.78
16.71
16.65

kwk
22.90
23.11
22.61

Mesa 3: A more informed unknown word model (character n-gram) in combination with the word-level
known word model consistently performs better than the alternatives for all four languages in our
conjunto de datos.

6.5 Comparing Language Models

Our proposed decoding method uses a count-based
word-level LM in combination with a character
n-gram LM to compute plex for joint decoding with
the LSTM decoder (Ecuación 3). En esta sección,
we substitute this model with two other variants
of count-based LMs to compute plex and evaluate
their performance:

Lang.
Code
ain
grk
ybh
kwk
Average

LM

Known

Unknown
Coverage BASE OURS BASE OURS
0.25
0.95
0.71
0.89
0.59
0.90
0.89
0.58
0.91 0.95 0.40 0.53

0.97
0.94
0.68
0.59
0.80

0.08
0.51
0.51
0.50

0.98
0.96
0.95
0.92

• CHARLM: We use a character 6-gram lan-
guage model trained on the model predictions
from self-training {(X) | x ∈ U }, esti-
mated with modified Kneser-Ney smoothing.

• WORDLM: We use the word-level LM de-
scribed in Section 5.1, but do not use a
character n-gram model for unknown words.
En cambio, we score unknown words with a sim-
ple uniform probability over all characters in
the vocabulary.

We tune λ on the validation set for each model
independently and report results with the best set-
ting in Table 3. Using either CHARLM or WORDLM
for lexically aware decoding improves the er-
ror rates with respect to the BASE model. El
word-level model performs better for all languages
except Kwak’wala, likely due to the large per-
centage of unknown words in this language. Nosotros
also see that our proposed method, which lever-
ages a count-based word-level LM for known
words combined with a character-level LM for
scoring unknown words, results in the best per-
formance overall.

Although not observed in our dataset, we note
that some printed materials have a high degree
of spelling variation or contain texts for which
word tokenization is difficult. In such cases, el
word-level model may not be as effective, pero
CHARLM can still be used with the proposed

Mesa 4: Our method improves over the base
model on words that are both known and unknown
to the WFSA. We show the fraction of known test
palabras, and the fraction of correctly predicted
known and unknown words.

lexically aware decoding framework to obtain
improved performance over the baseline method.

6.6 Análisis

We analyze specific components of our model
to understand the advantages of our proposed
acercarse.

Known vs. Unknown Words We first identify
the source of the improvements that our approach
makes over the baseline. Mesa 4 presenta el
fraction of correctly predicted words, split on
whether these words are ‘‘known’’ to the WFSA
(es decir., in the vocabulary of the word-level LM) o
‘‘unknown’’. Intuitivamente, we expect that decoding
with the WFSA will improve prediction on the
known words.

Compared to the baseline, our method improves
on words known to the WFSA, moving from
91% a 95% accuracy on average. Our method
also improves unknown word prediction over the
baseline from an average accuracy of 40% a 53%.
In cases like Kwak’wala, dónde, due to the rich
morphology of the language, más que 40% del

1295

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 3: The weight of the WFST during joint decod-
ing can affect word error rate (sometimes significantly,
as in Griko; arriba). All other hyperparameters are kept
equal and correspond to the best systems in each
idioma.

Cifra 5: Even a small amount of unannotated data
is useful for our semi-supervised method, improv-
ing WER over BASE (WER=7.51) en (a). Varying the
size of gold-annotated data has a stronger effect on
post-correction performance in (b). Results are shown
with Griko.

This hyperparameter is less important in the
other three languages, leading to smaller variations
en desempeño. Como ejemplo, we depict the
effect on Yakkha in Figure 3, where increasing λ
does not affect performance as much as in Griko.

Self-Training Iterations The
evolution of
WER across 5 self-training iterations for Ainu
and Yakkha is shown in Figure 4. Particularly
for Ainu, we see that combination with lexically
aware decoding is crucial for the success of
self-training does
self-training. For Yakkha,
improve performance independently but is more
effective when lexically aware decoding is used
(error rates on Griko and Kwak’wala follow a
similar trend).

Dataset Size We study the effect of varying the
amount of gold-transcribed and unannotated data
used for training. The WER when varying the size
of the Griko datasets is shown in Figure 5 (the size
of each set is varied while keeping the other set at
its full size). We see that reducing the amount of
gold-transcribed data worsens WER significantly.
Por otro lado, reducing the unannotated data
has a smaller effect: even with a little unannotated
datos, our method improves over the BASE model.

Error Rate in the First Pass OCR To evaluate
how the error rate in the first pass OCR tran-
scription affects subsequent post-correction, nosotros
measure the performance of our proposed method
when applied to first pass outputs from two OCR

Cifra 4: Integrating lexically aware decoding through
interpolation with a WFSA (red lines) aids self-training
in improving WER across iterations. Black dashed lines
correspond to self-training without lexical decoding.

test words are unseen, including an unknown word
model in the WFSA is particularly important.

WFSA Weight One of the important hyperpa-
rameters of our lexically aware method is the
weight that we place on the WFSA score dur-
ing inference (λ in Equation 3). Específicamente,
in the case of Griko, we find that the value of
this hyperparameter can significantly affect per-
rendimiento. As shown in Figure 3, high weights
de λ (es decir., more weight on the WFSA) lead to
suboptimal WER, while lower λ leads to much
better performance.

1296

Cifra 6: Our post-correction model fixes many of the first pass OCR errors that the base model does not fix such
como (a) y (b). En casos raros, our method introduces errors into the transcription such as (C) y (d).

sistemas: Google Vision and Ocular (described in
Sección 6.4). Cifra 7 shows the WER on the
Kwak’wala dataset. We see that, although Google
Vision has a much higher first pass error rate
than Ocular, the post-correction model improves
performance over both OCR systems. Nosotros también
note that the relative error reduction is higher for
the Google Vision system (68%) than for Ocular
(41%), likely because the Ocular LM is trained on
the same data as the post-correction model.

Qualitative Analysis
En figura 6, we show ex-
amples of errors fixed as well as errors introduced
by our post-correction model, as compared to the
baseline system. En figura 6 (a) y (b), we see
that although the baseline corrects some of the
errors in the first pass OCR, it also introduces
errors such as extra diacritics and incorrect sub-
stitutions. Using our proposed method leads to
an error-free transcription of these images. Cómo-
alguna vez, En figura 6 (C) y (d), we see that our
method occasionally introduces errors in predic-
ciones. Específicamente, although the model fixes the
first pass errors, it generates words that are con-
siderably different from the target. Such errors
likely occur when the model follows an incorrect
path in the WFSA during lexically aware decod-
En g. Since we are using beam search, the correct
path cannot be recovered if it was pruned at an
earlier timestep.

7 Trabajo relacionado

OCR post-correction is well-studied in the high-
resource setting, particularly for English. Reciente
methods primarily use neural encoder-decoder
modelos (Dong and Smith, 2018; Rigaud et al.,
2019; H¨am¨al¨ainen and Hengchen, 2019). Allá
has been relatively little work on lower-resourced

Cifra 7: Our post-correction model significantly im-
proves recognition accuracy over different first pass
OCR systems that have varied error rates (Google Vi-
sion and Ocular). Results are shown with Kwak’wala.

idiomas. Kolak and Resnik (2005) present
a probabilistic edit distance model for post-
correction on Cebuano and Igbo, and Krishna
et al. (2018) use a sequence-to-sequence model
with a copy mechanism for improved performance
on Romanized Sanskrit OCR.

While existing neural post-correction methods
do not rely on lexical information, some earlier
methods use dictionaries to improve performance.
Por ejemplo, Tong and Evans (1996) and Niklas
(2010) use lexicons in combination with n-gram
context to generate post-correction candidates for
erroneous words. These methods are typically
evaluated on English and assume the presence
of high-coverage lexicons (Schulz and Kuhn,
2017), making them difficult to adapt to endan-
gered languages.

Related to our decoding method are models
that incorporate lexical knowledge into neural
machine translation. Arthur et al. (2016) propose

1297

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

adding a dictionary for translating low-frequency
words and Zhang et al. (2018) improve decoding
by upweighting translations that contain relevant
palabras. Además, there are methods which add
hard lexical constraints by forcing predictions
to contain user-specified words and phrases
(Hokamp and Liu, 2017; Post and Vilar, 2018).

Por último, we note that our proposed approach
combines information from a neural model and
a finite-state machine to leverage the advantages
of both. In a similar direction, Rastogi et al.
(2016) and Lin et al. (2019) design finite state
architectures with paths weighted by contextual
features from an LSTM. These methods use joint
parameterizations of the models and are thus more
complex to train (particularly in the low-resource
configuración) than the joint decoding method we pro-
pose in this paper.

8 Conclusión

Digitization at scale for documents in under-
represented languages is a promising avenue to-
wards tackling one aspect of their marginalization,
the lack of data. With this work, we take a step to-
wards better digitization for extremely data-scarce
escenarios. We develop a semi-supervised method
that combines self-training with lexically aware
decoding, reducing error rates by up to 29% encima
a state-of-the-art OCR post-correction model on
four typologically diverse endangered languages.
In future work, we plan to expand our method
to take advantage of additional outputs of the
language documentation process. Por ejemplo,
documentary linguists typically collect word lists
(which range from lists of common words like the
Swadesh lists (Swadesh, 1955) to domain-specific
vocabularies). Using such word lists within the
lexically aware decoding framework could further
improve performance and enable the application of
our technique to even lower-resourced languages.
the improvements we achieve
through semi-supervised learning are potentially
orthogonal to the improvements Rijhwani et al.
(2020) achieve by incorporating information from
translations of the target text. As future work, nosotros
plan to investigate the combination of these two
approaches in an attempt to utilize all available
sources of information to improve performance.

Además,

Finalmente, while using a character-level n-gram
LM improves performance on unknown words,
it does not explicitly utilize morphological struc-

ture to generate unseen inflections of words. En
the future, we plan to incorporate morphological
analysis during post-correction decoding, cual
will be helpful for morphologically rich endan-
gered languages.

Expresiones de gratitud

Shruti Rijhwani was supported by the Bloomberg
Data Science Ph.D. Fellowship for this work.

This work was also supported by grant PR-
276810-21 (‘‘Unlocking Endangered Language
Resources’’) from the National Endowment for
the Humanities, grant 1761548 (‘‘Discover-
ing and Demonstrating Linguistic Features for
Language Documentation’’) from the National
Science Foundation, National Research Council
Indigenous Language Technology grant ‘‘Kwak’
wala Corpus Collection Project’’, and Govern-
ment of Canada Social Sciences and Humanities
Research Council
Insight Development grant
GR002807 ‘‘K’ank’otłaxants Awi’nagwis (Know-
ing our land)''. Any views, findings, conclusions
or recommendations expressed in this publication
do not necessarily represent those of the National
Endowment for the Humanities.

We would like to thank Jaymyn La Vallee for
assisting with the Kwak’wala dataset annotation
as well as Samridhi Choudhary, Siddharth Dalmia,
Deepak Gopinath, Maria Ryskina, the reviewers,
and the Action Editors for feedback on the paper.

Referencias

Philip Arthur, Graham Neubig, and Satoshi
Nakamura. 2016. Incorporating discrete transla-
tion lexicons into neural machine translation. En
Actas de la 2016 Conferencia sobre el Imperio-
Métodos icales en el procesamiento del lenguaje natural,
pages 1557–1567, austin, Texas. Asociación
para Lingüística Computacional. https://
doi.org/10.18653/v1/D16-1162

Dzmitry Bahdanau, Kyunghyun Cho, y yoshua
bengio. 2015. Traducción automática neuronal por
En
aprender juntos a alinear y traducir.
3rd International Conference on Learning
Representaciones, ICLR 2015.

Taylor Berg-Kirkpatrick, Greg Durrett, and Dan
Klein. 2013. Unsupervised transcription of his-
torical documents. In Proceedings of the 51st

1298

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Annual Meeting of the Association for Compu-
lingüística nacional (Volumen 1: Artículos largos),
pages 207–217, Sofia, Bulgaria. Asociación para
Ligüística computacional.

Conference on Document Analysis and Recog-
nition (ICDAR), volumen 1, pages 161–168.
IEEE. https://doi.org/10.1109/ICDAR
.2017.35

Franz Boas. 1900. Sketch of the Kwakiutl lan-
guage. American Anthropologist, 2(4):708–721.
https://doi.org/10.1525/aa.1900
.2.4.02a00080

Franz Boas. 1921. Ethnology of the Kwakiutl.
https://doi.org/10.1006/csla.1999
.0128

Stanley F. Chen and Joshua Goodman. 1999.
An empirical study of smoothing techniques
for language modeling. Computer Speech &
Idioma, 13(4):359–394.

Trevor Cohn, Cong Duy Vu Hoang, Ekaterina
Vymolova, Kaisheng Yao, Chris Dyer, y
Gholamreza Haffari. 2016. Incorporating struc-
tural alignment biases into an attentional
neural translation model. En procedimientos de
el 2016 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
pages 876–885, San Diego, California. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/N16-1102

Andrew M. Dai and Quoc V. Le. 2015. Semi-
supervised sequence learning. En procedimientos
of the 28th International Conference on Neural
Sistemas de procesamiento de información – Volumen 2,
NIPS’15, page 3079–3087, Cambridge, MAMÁ,
EE.UU. CON prensa.

Rui Dong and David Smith. 2018. Multi-input
attention for unsupervised OCR correction. En
Proceedings of the 56th Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 2363–2372,
Melbourne, Australia. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/P18-1220

Matthew Francis-Landau. 2020. MFST: A python
for custom

openfst wrapper with support
semirings and jupyter notebooks.

Yasuhisa Fujii, Karel Driesen, Jonathan Baccash,
Ash Hurst, and Ashok C. Popat. 2017.
Sequence-to-label script identification for mul-
tilingual OCR. En 2017 14th IAPR International

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor
O.K. li. 2016. Incorporating copying mech-
anism in sequence-to-sequence learning. En
Proceedings of the 54th Annual Meeting of
la Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 1631–1640,
Berlina, Alemania. Asociación de Computación-
lingüística nacional.

Mika H¨am¨al¨ainen and Simon Hengchen. 2019.
From the paft to the fiiture: A fully automatic
NMT and word embeddings method for OCR
post-correction. In Proceedings of the Inter-
national Conference on Recent Advances in
Natural Language Processing (RANLP 2019),
pages 431–436, Varna, Bulgaria. INCOMA Ltd.
https://doi.org/10.26615/978-954
-452-056-4 051

Junxian He,

Jiatao Gu,

Jiajun Shen,

y
Marc’Aurelio Ranzato. 2020. Revisiting self-
training for neural sequence generation. 8th
Conferencia Internacional sobre Aprendizaje Repre-
sentaciones, ICLR 2020, Addis Ababa, Ethio-
pia, Abril 26-30, 2020, Conference Track
Actas.

Kenneth Heafield, Ivan Pouzyrevsky, Jonathan
h. clark, and Philipp Koehn. 2013. Scalable
modified Kneser-Ney language model estima-
ción. In Proceedings of the 51st Annual Meeting
de la Asociación de Linguis Computacional-
tics (Volumen 2: Artículos breves), pages 690–696,
Sofia, Bulgaria. Asociación de Computación
Lingüística.

Sepp Hochreiter y Jürgen Schmidhuber. 1997.
Memoria larga a corto plazo. Computación neuronal,
9(8):1735–1780. https://doi.org/10.1162
/neco.1997.9.8.1735, Pubmed: 9377276

Chris Hokamp and Qun Liu. 2017. Lexically con-
strained decoding for sequence generation using
grid beam search. In Proceedings of the 55th
Annual Meeting of the Association for Compu-
lingüística nacional (Volumen 1: Artículos largos),
pages 1535–1546, vancouver, Canada. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/P17-1141

1299

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

R. Reeve Ingle, Yasuhisa Fujii, Thomas Deselaers,
Jonathan Baccash, and Ashok C. Popat. 2019.
A scalable handwritten text recognition system.
arXiv preimpresión arXiv:1904.09150. https://
doi.org/10.1109/ICDAR.2019.00013

Joshi Pratik, Sebastián Santy, Amar Budhiraja,
Kalika Bali, y Monojit Choudhury. 2020.
The state and fate of linguistic diversity and
inclusion in the NLP world. En procedimientos de
la 58ª Reunión Anual de la Asociación de
Ligüística computacional, páginas 6282–6293,
En línea. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/2020.acl-main.560

Ky¯osuke Kindaichi. 1931. Ainu Jojishi Y¯ukara no
Kenky¯u [Research on Ainu Epic Yukar], T¯oky¯o:
T¯oky¯o Bunko.

Reinhard Kneser and Hermann Ney. 1995.
Improved backing-off for m-gram language
modelado. En 1995 International Conference
on Acoustics, Discurso, y procesamiento de señales,
volumen 1, pages 181–184. IEEE.

Okan Kolak and Philip Resnik. 2005. OCR
post-processing for low density languages. En
Proceedings of Human Language Technol-
ogy Conference and Conference on Empirical
Métodos en el procesamiento del lenguaje natural,
pages 867–874, vancouver, British Columbia,
Canada. Asociación
for Computational
Lingüística. https://doi.org/10.3115
/1220575.1220684

Amrith Krishna, Bodhisattwa P. Majumder,
Rajesh Bhat, and Pawan Goyal. 2018. Upcycle
your OCR: Reusing OCRs for post-OCR text
correction in Romanised Sanskrit. En curso-
ings of the 22nd Conference on Computational
Natural Language Learning, pages 345–355,
Bruselas, Bélgica. Asociación de Computación-
lingüística nacional. https://doi.org/10
.18653/v1/K18-1034

Dong-Hyun Lee. 2013. Pseudo-label: The simple
and efficient semi-supervised learning method
for deep neural networks. In Workshop on
Challenges in Representation Learning, ICML.

Chu-Cheng Lin, Hao Zhu, Matthew R. Gormley,
and Jason Eisner. 2019. Neural finite-state

transducers: Beyond rational relations. En profesional-
cesiones de la 2019 Conference of the North
American Chapter of
la Asociación para
Ligüística computacional: Human Language
Technologies, Volumen 1 (Long and Short Pa-
pers), pages 272–283, Mineápolis, Minnesota.
Asociación de Lingüística Computacional.

Haitao Mi, Baskaran Sankaran, Zhiguo Wang,
and Abe Ittycheriah. 2016. Coverage embed-
ding models for neural machine translation. En
Actas de la 2016 Conferencia sobre el Imperio-
Métodos icales en el procesamiento del lenguaje natural,
pages 955–960, austin, Texas. Asociación para
Ligüística computacional.

Mehryar Mohri. 1996. On some applications of
finite-state automata theory to natural language
Procesando. Natural Language Engineering,
2(1):61–80. https://doi.org/10.1017
/S135132499600126X

Mehryar Mohri, Fernando Pereira, y miguel
Riley. 2002. Weighted finite-state transduc-
ers in speech recognition. Computer Speech &
Idioma, 16(1):69–88. https://doi.org
/10.1006/csla.2001.0184

Graham Neubig, Chris Dyer, Yoav Goldberg,
Austin Matthews, Waleed Ammar, Antonios
Anastasopoulos, Miguel Ballesteros, David
Chiang, Daniel Clothiaux, Trevor Cohn, Kevin
Duh, Manaal Faruqui, Cynthia Gan, Dan
Garrette, Yangfeng
Ji, Lingpeng Kong,
Adhiguna Kuncoro, Gaurav Kumar, Chaitanya
Paul Michel, Yusuke Oda,
Malaviya,
Matthew Richardson, Naomi Saphra, Swabha
Swayamdipta, and Pengcheng Yin. 2017.
DyNet: The dynamic neural network toolkit.
arXiv preimpresión arXiv:1701.03980.

Kai Niklas. 2010. Unsupervised post-correction
of OCR errors. Tesis de maestría. Leibniz Uni-
versit¨at Hannover.

Matt Post and David Vilar. 2018. Fast lexically
constrained decoding with dynamic beam
allocation for neural machine translation. En
Actas de la 2018 Conference of the
North American Chapter of the Association
para Lingüística Computacional: Human Lan-
guage Technologies, Volumen 1 (Artículos largos),
pages 1314–1324, Nueva Orleans, Luisiana.

1300

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Asociación de Lingüística Computacional.
https://doi.org/10.18653/v1/N18
-1119

Prajit Ramachandran, Peter Liu, and Quoc Le.
2017. Unsupervised pretraining for sequence
to sequence learning. En Actas de la
2017 Conference on Empirical Methods in
Natural Language Processing, pages 383–391,
Copenhague, Dinamarca. Asociación para Com-
Lingüística putacional. https://doi.org
/10.18653/v1/D17-1039

Pushpendre Rastogi, Ryan Cotterell, and Jason
Eisner. 2016. Weighting finite-state transduc-
tions with neural context. En procedimientos de
el 2016 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
pages 623–633, San Diego, California. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/N16-1076

C. Rigaud, A. Doucet, METRO. Coustaty, y j. Moreux.
2019. ICDAR 2019 competition on post-OCR
text correction. En 2019 International Confer-
ence on Document Analysis and Recognition
(ICDAR), pages 1588–1593. https://doi
.org/10.1109/ICDAR.2019.00255

Shruti Rijhwani, Antonios Anastasopoulos, y
Graham Neubig. 2020. OCR Post Correction for
Endangered Language Texts. En procedimientos
del 2020 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 5931–5942, En línea. Asociación para
Ligüística computacional. https://doi.org
/10.18653/v1/2020.emnlp-main.478

María

Azul,

Ryskina,

taylor
Eduard
Berg-Kirkpatrick, and Matthew R. Gormley.
2021. Comparative error analysis in neural and
finite-state models for unsupervised character-
level transduction. In Proceedings of the 18th
SIGMORPHON Workshop on Computational
Investigación
y
Morphology, pages 198–211, En línea. asociación-
ción para la Lingüística Computacional. https://
doi.org/10.18653/v1/2021.sigmorphon
-1.22

in Phonetics, Phonology,

https://elar.soas.ac.uk/Collection
/MPI186180. Accedido: 2020-02-02.

Carsten Schnober, Steffen Eger, Erik-Lˆan Do
Dinh, and Iryna Gurevych. 2016. Still not there?
Comparing traditional sequence-to-sequence
models to encoder-decoder neural networks on
monotone string translation tasks. En curso-
ings of COLING 2016, the 26th International
Congreso sobre Lingüística Computacional:
Technical Papers, pages 1703–1714, Osaka,
Japón. The COLING 2016 Organizing Committee.

Sarah Schulz and Jonas Kuhn. 2017. Multi-
modular domain-tailored OCR post-correction.
el 2017 Conferencia sobre
En procedimientos de
in Natural Language
Empirical Methods
Procesando, pages 2716–2726, Copenhague,
Dinamarca. Asociación de Lin Computacional-
guísticos. https://doi.org/10.18653
/v1/D17-1288

Abigail See, Peter J. Liu, and Christopher D.
Manning. 2017. Get to the point: Summariza-
tion with pointer-generator networks. En profesional-
ceedings of the 55th Annual Meeting of the
Asociación de Lingüística Computacional
(Volumen 1: Artículos largos), pages 1073–1083,
vancouver, Canada. Asociación de Computación-
lingüística nacional.

Paolo Stomeo. 1980. Racconti greci inediti di
Sternat´ıa, La nuova Ellade, s.I. https://doi
.org/10.1086/464321

Morris Swadesh. 1955. Towards greater accuracy
in lexicostatistic dating. International Journal
of American Linguistics, 21(2):121–137.

Xiang Tong and David A. evans. 1996. A statisti-
cal approach to automatic OCR error correction
in context. In Fourth Workshop on Very Large
corpus.

Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua
Liu, and Hang Li. 2016. Modeling coverage
for neural machine translation. En procedimientos
of the 54th Annual Meeting of the Associa-
ción para la Lingüística Computacional (Volumen 1:
Artículos largos), pages 76–85, Berlina, Alemania.
Asociación de Lingüística Computacional.

Diana Schackow. 2012. Documentation and
grammatical description of Yakkha, Nepal.

David Yarowsky. 1995. Unsupervised word sense
disambiguation rivaling supervised methods.

1301

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

In 33rd Annual Meeting of the Association
para Lingüística Computacional, pages 189–196,
Cambridge, Massachusetts, EE.UU. Asociación
para Lingüística Computacional. https://
doi.org/10.3115/981658.981684

Jingyi Zhang, Masao Utiyama, Eiichro Sumita,
Graham Neubig, and Satoshi Nakamura. 2018.
Guiding neural machine translation with re-
trieved translation pieces. En procedimientos de
el 2018 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
Volumen 1 (Artículos largos), pages 1325–1335,

Nueva Orleans, Luisiana. Asociación para
Ligüística computacional. https://doi
.org/10.18653/v1/N18-1120

Xiaojin Zhu and Andrew B. Goldberg. 2009. Intro-
duction to semi-supervised learning. Synthesis
Lectures on Artificial Intelligence and Machine
Aprendiendo, 3(1):1–130.

Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin
Cual, Hanxiao Liu, Ekin Dogus Cubuk, y
Quoc Le. 2020. Rethinking pre-training and
self-training. Advances in Neural Information
Sistemas de procesamiento, 33. https://doi.org
/10.2200/S00196ED1V01Y200906AIM006

yo

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu

/
t

a
C
yo
/

yo

a
r
t
i
C
mi

pag
d

F
/

d
oh

i
/

.

1
0
1
1
6
2

/
t

yo

a
C
_
a
_
0
0
4
2
7
1
9
7
4
7
6
3

/

/
t

yo

a
C
_
a
_
0
0
4
2
7
pag
d

.

F

b
y
gramo
tu
mi
s
t

t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

1302Lexically Aware Semi-Supervised Learning for OCR Post-Correction image
Lexically Aware Semi-Supervised Learning for OCR Post-Correction image

Descargar PDF