Lexically Aware Semi-Supervised Learning for OCR Post-Correction
Shruti Rijhwani1, Daisy Rosenblum2, Antonios Anastasopoulos3, Graham Neubig1
1Language Technologies Institute, Carnegie Mellon University, USA
2Universität von British Columbia, Kanada
3Department of Computer Science, George Mason University, USA
srijhwan@cs.cmu.edu, daisy.rosenblum@ubc.ca,
antonis@gmu.edu, gneubig@cs.cmu.edu
Abstrakt
Much of the existing linguistic data in many
languages of the world is locked away in non-
digitized books and documents. Optical char-
acter recognition (OCR) can be used to produce
digitized text, and previous work has demon-
strated the utility of neural post-correction
methods that improve the results of general-
purpose OCR systems on recognition of less-
well-resourced languages. Jedoch,
diese
methods rely on manually curated post-
correction data, which are relatively scarce
compared to the non-annotated raw images
that need to be digitized. In diesem Papier, Wir
present a semi-supervised learning method
that makes it possible to utilize these raw
images to improve performance, speziell
through the use of self-training, a technique
where a model is iteratively trained on its own
outputs. Zusätzlich, to enforce consistency
in the recognized vocabulary, we introduce
a lexically aware decoding method that aug-
ments the neural post-correction model with
a count-based language model constructed
from the recognized texts, implemented us-
ing weighted finite-state automata (WFSA)
for efficient and effective decoding. Ergebnisse
on four endangered languages demonstrate
the utility of the proposed method, with rela-
tive error reductions of 15%–29%, wo wir
find the combination of self-training and lex-
ically aware decoding essential for achieving
consistent improvements.1
1
Einführung
There is a vast amount of textual data available
in printed form (Dong and Smith, 2018). In diesem
Papier, we address the task of digitizing printed
materials that contain text in endangered lan-
guages, das ist, languages with small populations
1Data and code are available at https://shrutirij
.github.io/ocr-el/.
of first-language speakers and limited acquisi-
tion among younger speakers. Printed texts in
endangered languages come from various sources,
including linguistic documentation and cultural
and educational books.
Extracting text data from these documents is
valuable for a multitude of reasons. Automatic dig-
itization can aid language documentation, preser-
vation, and accessibility efforts by archiving the
texts and making them searchable for language
learners, Lehrer, and speakers, contributing to
essential resources for community-based language
revitalization. Weiter, most endangered languages
are under-represented in natural language process-
ing technologies, primarily because there is little
to no data available for training and evaluation
(Joshi et al., 2020). This challenge can be mit-
igated by converting printed materials in these
languages to a machine-readable format.
Optical character recognition (OCR) Systeme
can be used to produce digitized text, and recent
arbeiten (Rijhwani et al., 2020) has demonstrated
that post-correction improves the performance of
existing general-purpose OCR systems on endan-
gered languages (an example is in Figure 1). Most
state-of-the-art OCR post-correction methods use
neural sequence-to-sequence models and rely on
considerable resources such as a large number
of manual transcriptions (Schnober et al., 2016;
Rigaud et al., 2019) or substantial textual data to
train a language model (Dong and Smith, 2018).
To adapt these methods for the less-well-resourced
endangered languages setting, Rijhwani et al.
(2020) add translations and structural biases to
the model.
Jedoch, even with such methods targeted
to low-resource learning, post-correction perfor-
mance is still dependent on manually curated data,
which are minimally available for most endan-
gered languages. Andererseits, unannotated
1285
Transactions of the Association for Computational Linguistics, Bd. 9, S. 1285–1302, 2021. https://doi.org/10.1162/tacl a 00427
Action Editor: Yang Liu. Submission batch: 6/2021; Revision batch: 8/2021; Published 11/2021.
C(cid:2) 2021 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
increasing number of correctly predicted words,
resulting in a better count-based language model.
Folglich, joint decoding reinforces the pre-
diction of more accurate words and mitigates
the noise introduced by incorrect words in the
pseudo-training data.
We conduct experiments on four endangered
languages: Ainu, Griko, Kwak’wala, and Yakkha.
Our proposed method reduces the character and
word error rates by 15%–29% over a state-of-
the-art OCR post-correction method for endan-
gered languages. We find that the combination
of self-training and lexically aware decoding is
essential for achieving consistent improvements
in performance.
2 Problem Formulation
Optical Character Recognition The OCR task
involves generating a transcription of the text
contained in an image. In diesem Papier, wir gebrauchen
existing OCR tools (detailed in Section 6.4) Zu
obtain a first pass transcription for the images
in our dataset. The first pass transcription is
a text sequence of N characters, denoted as
x = [x1, . . . , xN ].
OCR Post-Correction Even state-of-the-art
OCR models are susceptible to making recogni-
tion errors (Dong and Smith, 2018). Errors are
particularly frequent in the case of endangered
languages because most off-the-shelf OCR tools
do not directly support these languages and train-
ing a high-performance OCR system is challeng-
ing given the small amount of data that
Ist
typically available (Rijhwani et al., 2020). Wir
use OCR post-correction to correct these errors
and improve the quality of the transcription.
The post-correction model takes the first pass
transcription x as input and generates the cor-
rected transcription, a sequence of T characters
denoted as y = [y1, . . . , yT ]:
y = arg max
j(cid:2)
pcorr(j(cid:2)|X)
3 Base Model
As the base post-correction model, we use the
model from Rijhwani et al. (2020): a sequence-
to-sequence model that uses an attention-based
LSTM encoder-decoder (Bahdanau et al., 2015),
Figur 1: OCR post-correction on a scanned docu-
ment that contains text in the endangered language
Kwak’wala. The goal of post-correction is to fix the
recognition errors made by the first pass OCR system.
raw images that need to be digitized are relatively
less scarce; for many endangered languages, hun-
dreds of printed pages exist, with only a small
subset manually transcribed. In diesem Papier, Wir
propose a semi-supervised learning method for
OCR post-correction that efficiently utilizes these
unannotated pages to improve performance.
The method has two key components. We first
present a self-training method for OCR post-
correction (Abschnitt 4) to create pseudo-training
Daten. A baseline post-correction model is used to
correct the initial OCR output on the unannotated
Seiten, and the generated ‘‘post-corrected’’ text is
then used as pseudo-training data to improve the
post-correction model. The self-training process
is repeated to iteratively obtain better predictions
on the unannotated pages.
While self-training is a straightforward way to
use the unannotated data, incorrect predictions
in the pseudo-training data may introduce noise
into the model (Zhu and Goldberg, 2009). To
counterbalance the influence of this noise, Wir
propose lexically aware decoding (Abschnitt 5),
an inference strategy that encourages the model
to generate predictions that contain ‘‘known’’
Wörter. We use the pseudo-training data to train
a count-based language model, represented with
a weighted finite-state automaton (WFSA). Unser
proposed decoding method jointly uses an LSTM
decoder and the WFSA to make OCR post-
correction predictions.
The intuition behind the joint decoding strategy
is simple. As the model iteratively improves with
self-training, the quality of the pseudo-training
data is also likely to improve and contain an
1286
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
with adaptations for low-resource OCR post-
correction. We briefly describe the method here
but refer readers to the original paper for details.
The OCR post-correction model takes the first
pass transcription x as input, with the aim of pre-
dicting an error-free transcription y. Erste, jede
character in the input sequence x is mapped to
a vector representation using character embed-
dings. This forms a sequence of vectors, x =
[x1, . . . , xN ]. The encoder is a character-level bi-
directional LSTM (Hochreiter and Schmidhuber,
1997), which transforms x into a sequence of
hidden state vectors h = [h1, . . . , hN ].2
The model’s decoding process uses an attention
mechanism to provide context from the encoder
hidden states. At each decoding timestep t, Die
attention layer uses h and the decoder state from
the previous timestep, st−1, to produce the context
vector ct. The LSTM decoder, given ct, computes
the output state st and subsequently the probability
distribution yt for generating the next character of
the target sequence y:
P (yt) = softmax (Wst + B)
(1)
Rijhwani et al. (2020) adapt the encoder-decoder
model above for low-resource post-correction
by adding pretraining and three structural biases:
• Diagonal Attention Loss: OCR post-
correction is a monotonic sequence-to-
sequence task. Somit, the attention weights
are expected to be higher closer to the
diagonal—adding attention elements off the
diagonal to the training loss encourages mo-
notonic attention (Cohn et al., 2016).
• Copy Mechanism: The copy mechanism
enables the model to choose between gen-
erating a character based on the decoder state
(Gleichung 1) or copying a character directly
from the input sequence x by sampling from
the attention distribution (Gu et al., 2016; Sehen
et al., 2017).
2Rijhwani et al. (2020) incorporate translations into the
model with a multi-source encoder. We omit this from our
formulation, considering applicability to texts without avail-
able translations. Jedoch, adding an encoder into our
framework remains straightforward and can be used if
translations exist.
• Coverage: The coverage vector keeps track
of attention weights from previous timesteps.
It is used as additional information when
computing ct and is added to the training loss
to discourage the model from repeatedly at-
tending to the same character (Mi et al., 2016;
Tu et al., 2016).
The model is trained in a supervised manner
with a small number of manual transcriptions:
The training data includes pairs of first-pass OCR
text with its corresponding error-free transcription.
The post-correction training loss function (von-
noted as L) is a combination of cross-entropy loss
along with the diagonal attention loss and the cov-
erage loss from the structural biases. Inference with
a trained model is performed using beam search.
In the following sections, we use the method
described above as a base model for our pro-
posed semi-supervised learning technique for
OCR post-correction. Given the minimal man-
ually transcribed data we have in endangered lan-
guages, our approach aims to efficiently use the
relatively larger number of pages without gold tran-
scriptions to improve performance. Zu diesem Zweck,
we introduce two methodological improvements:
(1) self-training and (2) lexically aware decoding.
4 Self-Training
A
Self-training is
semi-supervised learning
method, where a trained model is used to make
predictions on unlabeled data, and the model is
then retrained on its own predictions (Zhu and
Goldberg, 2009).
Consider that we have a set of images with
manually created transcriptions and a set of images
without gold transcriptions. We can obtain a first
pass transcription for the text contained in the
Bilder (both sets) with existing OCR tools.
More formally, we have a gold-transcribed da-
taset D = {(cid:3)X(ich), j(ich)(cid:4)}D
i=1, where x(ich) is the first
pass transcription and y(ich) is the error-free man-
ual transcription of the ith training instance.3 We
also have a dataset for which only the first pass
OCR is available (d.h., no manual transcriptions),
3In our dataset, the source and target data instances are
either at the line-level or the sentence-level (see Section 6.1).
1287
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
U = {X(J)}u
gered languages setting,
transcriptions is much larger, das ist, u (cid:5) D.
j=1. For most cases in the endan-
the set without gold
Since self-training requires a baseline model to
get an initial set of predictions on U , we first train
the base model described in Section 3 on the gold-
transcribed set D. Let the trained base model be
fθ. Nächste, we use the predictions on U from fθ to
self-train the model. We follow the self-training
strategy recommended in He et al. (2020), welche
involves two steps: ‘‘pseudo-training’’ and ‘‘fine-
tuning’’. We describe each step of the self-training
procedure in detail below:
1. Apply the initial OCR post-correction model
fθ to each instance in the set U to obtain
predictions using beam search inference.
For an instance x, let the prediction be fθ(X).
2. Create a pseudo-annotated dataset with the
predictions from step 1. Let this be S =
{(cid:3)X, fθ(X)(cid:4) | x ∈ U }.
3. Train the model fθ on sets U and S.
This is the pseudo-training step. Hier, Wir
first train the encoder and the decoder com-
ponents with a language modeling objective,
and then train the end-to-end post-correction
Modell. The procedure is as follows:
(A) Train the encoder with a character-level
language modeling (LM) objective on U .
As discussed in Section 3, the encoder
component of the model is an LSTM
that operates at the character-level. Wir
pseudo-train this LSTM with a language
model objective on each text sequence
x ∈ U .
Das ist, at each timestep t, the LSTM is
trained to predict the next character in
the input sequence. Given a sequence of
characters x = [x1, . . . , xN ], the train-
N
t=1 P (xt |
ing objective maximizes
x1, . . . , xt−1).
This is the standard LM objective func-
tion and has been proven helpful for
pretraining LSTMs to improve hidden
Darstellungen (Dai and Le, 2015;
Ramachandran et al., 2017).
(cid:2)
(B) Train the decoder LSTM with the LM
objective described above, using the
baseline model’s predictions {fθ(X) |
x ∈ U }.
(C) Train the sequence-to-sequence model
on the pseudo-annotated dataset S with
the post-correction loss function L from
Abschnitt 3.
4. Given the pseudo-trained model fθ, fine-tune
the model on the gold-transcribed dataset D,
with the loss function L.
5. Repeat step 1 to step 4 until a specified
stopping criterion (z.B., no improvement in
the validation set performance or reaching
the maximum permitted iterations).
As indicated above, self-training is a straight-
forward semi-supervised technique to leverage
documents without gold transcriptions to improve
OCR post-correction performance. We note that
some self-training methods (Yarowsky, 1995; Lee
et al., 2013; Zoph et al., 2020, Unter anderem) replace
Schritte 4 Und 5 with a single step that trains fθ on
S ∪ D. Jedoch, this led to slightly worse perfor-
mance in our preliminary experiments. We also
observed that pseudo-training the LSTMs with an
LM objective (Schritte 3(A) Und 3(B) über) is neces-
sary for good performance and that applying the
self-training steps on fθ from the previous iteration
led to better results than re-initializing the model.4
Weiter, as recommended in He et al. (2020) Zu
improve self-training for neural sequence genera-
tion, we add a dropout layer into the base model
at the encoder and decoder hidden states during
pseudo-training and fine-tuning (Schritte 3 Und 4).
5 Lexically Aware Decoding
Although self-training is a simple approach that
leads to improvements in post-correction per-
formance without additional manual annotation,
incorrect predictions in the pseudo-annotated data
may introduce noise into the model, potentially
reinforcing the errors in the next self-training
4In preliminary experiments, we also tried using S ∪ D
in step 3(C). Jedoch, the post-correction performance was
approximately the same as using only the set S.
1288
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
iteration (Zhu and Goldberg, 2009). Such noise is
more likely to occur in the endangered languages
setting, where the base model is trained on min-
imal data and thus sometimes generates errone-
ous predictions.
While some self-training methods use confi-
dence scores to remove noisy predictions (wie zum Beispiel
Yarowsky, 1995), these are typically designed for
classification tasks. Designing such heuristics is
challenging for OCR post-correction because the
predictions are generated at the character-level;
specific characters may be incorrect, but discard-
ing the entire predicted sequence (d.h., the line or
Satz) is inefficient, particularly in a low-
resource scenario. To mitigate these issues, we pro-
pose lexically aware decoding, an inference strategy
based on our observations of the challenges asso-
ciated with the OCR post-correction task.
Genauer, our preliminary experiments
with self-training indicated that the errors made
by the model are typically inconsistent. For a
particular word, some instances may be correctly
predicted by the model. For the instances of the
word that are incorrect, we observe that they
are likely to be erroneous in different ways, d.h.,
different subsets of characters in the word are
incorrectly predicted. This is expected since the
same word can appear in varied contexts, oder der
first pass OCR for the word can differ. Our empir-
ical observations on the pseudo-annotated dataset
S showed that, since the errors are inconsistent,
the correct form of the word is more frequent
than incorrect forms. Lexically aware decoding
is designed to influence the OCR post-correction
model to generate words that frequently occur in
the set S, in the expectation that these are correct
word forms.
We first describe the construction of a model
that accounts for word frequency in the predictions
along with a character n-gram model to enable the
prediction of unseen words. Dann, we present a
joint decoding method that uses the frequency-
based models in combination with the LSTM
decoder for improved OCR post-correction.
5.1 Count-Based Language Model
From the self-training method in Section 4, Wir
have a pseudo-annotated dataset S = {(cid:3)X, fθ(X)(cid:4) |
x ∈ U }, where fθ(X) is the model’s prediction
for input sequence x. We train a count-based
word-level unigram language model (LM) An
{fθ(X) | x ∈ U }. The LM is built by computing
frequency-based probabilities for each word found
in the predictions.
We also have to reserve some probability mass
in the LM to account for unknown words (Wörter
unseen in the predictions). We use modified
Kneser-Ney smoothing to derive the unknown
word (‘‘
a unigram LM, the smoothing process is simi-
lar to absolute discounting. Jedoch, we use the
discount values based on the modified Kneser-
Ney method, which are derived from word counts
in the dataset, as opposed to using a fixed dis-
count value (Kneser and Ney, 1995; Chen and
Guter Mann, 1999). We denote the probability from
the smoothed LM for a known word w as
pword(w) and the unknown word probability as
pword(
A count-based unigram LM is a simple model
but is suitable given our empirical observations
on word-level errors (described earlier in this
section) Weil (1) it explicitly models word
frequency, (2) Es
is straightforward to update
as the pseudo annotated dataset improves over
self-training iterations, Und (3) it can be expressed
as a weighted finite-state automaton which, as we
discuss next, has several properties useful for our
decoding method.
5.2 Weighted Finite State Automaton
A weighted finite-state automaton (WFSA) is a set
of states and transitions between the states. Jede
transition accepts a particular symbol as input and
has a weight associated with it. The symbols come
from a finite alphabet Σ. A sequence of consecu-
tive transitions is referred to as a ‘‘path’’, und das
label of a path is the concatenation of all sym-
bols consumed by its constituent transitions. Der
WFSA has a start state and a set of final states. A
successful path is a path from the start state to a fi-
nal state, and a sequence of symbols is ‘‘accepted’’
by the WFSA if there exists a successful path that
consumes this sequence (Mohri et al., 2002).
Since we are focused on decoding and only need
the best scoring path for any given sequence (d.h.,
Viterbi search), we consider the weights over the
tropical semiring. Das ist, the weight of a path is
the sum of its transition weights, and the score of
a sequence of symbols is the minimum weight of
all the successful paths that accept that sequence.
1289
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figur 2: Der (A) WFSA and (B) minimized WFSA we construct, for a hypothetical language model with a two
word vocabulary: P (Hund) = 0.75; P (door) = 0.2; P (
probabilities. In (B), for simplicity, we show only the known word states after determinization and minimization.
Decoding with the post-correction model is at
the character-level (Gleichung 1), so in order to
leverage word frequency in the decoding process,
we convert the count-based word-level LM de-
scribed in Section 5.1 to a WFSA representation
that consumes and scores sequences at
Die
character-level.
The WFSA is constructed to accept the words
known to the LM by consuming each character in
the word (in sequence) as input. The score of the
path that accepts a known word w is the negative
log of its probability from the LM: −log pword(w).
A simple example is shown in Figure 2(A).
The WFSA, as described above, can only accept
a single word. Jedoch, the input and correspond-
ing predictions of the post-correction model are
sequences of words, typically lines or sentences.
To enable the WFSA to accept such sequences,
we add transitions that accept a set B of word
boundary symbols (whitespace, punctuation, Und
end-of-sequence) from the states at the end of the
known words (z.B., Staaten 1 Und 2 in Abbildung 2(A))
back to the start state. Once in the start state, Die
model can begin consuming characters from the
next word.
Weiter, we modify the WFSA such that the start
state is also the only final (accepting) state since
the predicted sequence is considered complete
only when the model predicts an end-of-sequence
symbol after the last character.
based LM, we include an unknown word state in
the WFSA, wie in der Abbildung gezeigt 2(A). We add an
(cid:2)-Übergang (a transition that consumes no input),
with an associated cost −log pword(
the probability mass reserved for unknown words
in Section 5.1) to enter the unknown word state
from the start state. The model remains in the
unknown state, accepting symbols that form an
unseen word, until a word boundary symbol from
the set B is consumed to return to the start state.
We design the unknown word state to accept
any combination of the symbols in Σ, thereby
permitting the prediction of words unseen by the
word-level LM. To score each character consumed
at the unknown word state, we use a character-
level n-gram language model.5 We denote the
probabilities from this character n-gram LM as
pchar. The probability distribution is estimated
with modified Kneser-Ney smoothing on char-
acter n-grams from unique word forms in the
set {fθ(X) | x ∈ U }. We use unique word
forms because unknown words are likely rare, Und
using count-based word forms would undesir-
ably shift the probability mass towards more fre-
quent words.
5.3 Efficient Scoring with the WFSA
The constructed WFSA has states to score char-
acter sequences that form known words and an
Character LM for Unknown Words To enable
the prediction of words unknown to the count-
5We use n = 6 in this paper. We experimented with
different values of n in early experiments but found that
n = 6 gave the best results for all languages in the dataset.
1290
unknown word state that relies on a character
n-gram LM to score unknown sequences.
During inference, we independently score the
next character through the known word model
and the unknown word model and then choose the
best scoring path. This formulation has two advan-
tages: (1) separate scoring allows us to compactly
represent the WFSA states for known words and
(2) instead of representing the character n-gram
LM directly in the WFSA, leading to the number
of states exponentially increasing with n, we can
use highly optimized LM toolkits such as KenLM
(Heafield et al., 2013) for scoring unknown words.
Known Word Model Consider the WFSA with
only known word states. We apply standard al-
gorithms for determinization and minimization on
these states, which leads to an efficient and com-
pact representation of the count-based language
Modell (Mohri, 1996). As shown in Figure 2(B), Die
resultant minimized WFSA has several properties
useful for our decoding method, discussed below.
Determinization ensures that each state has
at most one outgoing transition that consumes a
given input symbol, and minimization eliminates
redundant states and transitions, reducing the time
and space needed to process an input sequence.
Weiter, minimization includes pushing the
transition weights towards the start state of the
WFSA as much as possible (Mohri et al., 2002).
This lends itself well to our method since inference
in the OCR post-correction model is performed
with beam search; if the cost of a path is es-
tablished closer to the start state, unfavorable
hypotheses can be pruned at an earlier timestep,
which allows us to avoid errors more effec-
tively within an approximate search algorithm like
beam search.
zuletzt, since each state in the WFSA has at
most one outgoing transition for each symbol,
the transition scores can be precomputed and
stored as a matrix, allowing efficient retrieval
during decoding.
At decoding timestep t, let the previous timestep
score from the known word model be known(yt−1)
and the current WFSA state be st−1. The score for
predicting the next character yt is the weight of
the transition from state st−1 that consumes yt in
the minimized WFSA (siehe Abbildung 2(B)). Daher,
known(yt) = known(yt−1) + scorewfsa(yt | st−1)
where known(y0) = 0. If yt does not continue
the path of any known word, then scorewfsa(yt)
is inf.
Unknown Word Model We use the probability
pchar from the character n-gram language model
to score unknown words. Allgemein, at decod-
ing timestep t, the unknown model score for yt
will be:
unk(yt) = unk(yt−1) − log pchar(yt | yt−n, . . . , yt−1)
Jedoch, if yt−1 ∈ B (d.h., the previous word is
complete) or t = 0, the WFSA is currently in the
start state. To begin an unknown word, we also
need to add the weight of entering the unknown
word state to unk(yt), das ist, − log pword(
Best Scoring Path The scores are in the tropical
semiring (negative log probabilities). At timestep
T, the best score for yt from the lexical models is:
scorelex(yt) = min(known(yt), unk(yt))
(2)
During decoding, we keep track of both the
known and unknown model scores for the current
word being generated in the hypothesis. Wann
the word is completed (when yt ∈ B), beide
known and unknown word models return to the
start state of the WFSA (siehe Abbildung 2). Seit
the two paths are in the same state and are thus
indistinguishable with respect to future predictions
in the hypothesis, we choose the best scoring path
to continue decoding. This is known as hypoth-
esis recombination.
The WFSA framework, daher, allows us to effi-
ciently represent the word-level LM in a manner
that scores symbols at the character-level and
leverage a character n-gram model to score un-
known words. This enables joint inference with
the character-level LSTM decoder in the OCR
post-correction model, as discussed below.
5.4 Joint Decoding with the LSTM
At decoding timestep t, let plstm(yt) be the prob-
ability of generating a character yt based on the
LSTM decoder’s hidden state (Gleichung 1). Wir
also compute scorelex(yt), which is a negative log
probability, as defined in Equation 2. The final
probability of predicting yt is obtained through
1291
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
linear interpolation between these two scores,6
weighted by a hyperparameter λ:
P(yt) = (1 − λ) · plstm(yt) + λ · plex(yt)
(3)
where plex(yt) = exp (−scorelex(yt)).
This joint decoding strategy is applied when
performing inference with beam search using a
trained OCR post-correction model. When used
in combination with self-training, the predictions
made by the model improve as we repeat the self-
training process, iteratively improving the count-
based LM and resulting in a better distribution
of plex(yt).
6 Experimente
In diesem Abschnitt, we present experiments with our
semi-supervised post-correction method on four
typologically diverse endangered languages.
6.1 Datasets
We use the OCR post-correction dataset from
Rijhwani et al. (2020), which contains transcribed
documents in three endangered languages: Ainu,
Griko, and Yakkha. Zusätzlich, in this paper,
we create a similar dataset in the endangered
language Kwak’wala. We describe the datasets
below, including the sizes of the gold transcribed
and unannotated sets we use for semi-supervised
Ausbildung:
Ainu (ain) is a severely endangered language
from northern Japan. The dataset contains pages
from a book of Ainu epic poetry (Kindaichi, 1931).
The Ainu text is written in the Latin script. Der
dataset contains 816 manually transcribed lines as
well as 7,646 lines without gold transcriptions.
Griko (grk), an endangered Greek dialect spo-
ken in southern Italy, is written with a combination
of Latin and Greek alphabet. The document in
the dataset is a book of Griko folk tales (Stomeo,
1980). Es gibt 807 Und 3,084 sentences with and
without gold transcriptions, jeweils.
Yakkha (ybh) is an endangered language spo-
ken in Nepal and is written in the Devanagari
Skript. The dataset contains transcriptions of three
children’s books (Schackow, 2012). In Summe, Dort
Sind 159 manually transcribed sentences and no
6We leave other interpolation techniques like log-linear
interpolation and more complex combinations of scores
(such as the WFSA-based reranking and rescoring methods
proposed by Ryskina et al. [2021]) as potential future work.
unannotated lines in the dataset. daher, as the
unannotated set, we use the first pass OCR on
the validation and test sets in a transductive learn-
ing setting (≈ 30 Sätze: see Section 6.2 für
data splits).
Kwak’wala (kwk) is spoken on Northern Van-
couver Island, nearby small
islands, und das
opposing mainland. The language is severely en-
dangered, with estimates of ≈150 first-language
speakers, all over the age of 70 Jahre. The Kwak’
wala language includes 42 consonantal pho-
nemes (twice as many as English) and a wide
range of allophonic vowels. Several writing sys-
tems exist and community preference varies
between two orthographies:
the U’mista and
Liq’wala systems.
Jedoch, much of the written documentation
for Kwak’wala is in another orthography that was
developed by anthropologist Franz Boas. Der
Boas orthography (Boas, 1900) was used in the
extensive documentation of the Kwak’wala lan-
guage and its speakers produced by Boas in col-
laboration with native-speaker George Hunt. Der
Boas writing system uses Latin script characters
as well as diacritics and digraphs to represent
phonemic differences. Although the Boas orthog-
raphy is not widely used today, the cultural and
linguistic materials previously written by Boas
are of tremendous value to community-based re-
searchers. Jedoch, they are minimally accessible
since they currently exist only as non-searchable
scanned images.
In consultation with members of language revi-
talization projects in three Kwakiutl communities
(Tsulquate, Fort Rupert, Quatsino), we focus on
digitizing these significant cultural resources. Wir
create a dataset with pages from the ‘‘Ethnology
of the Kwakiutl’’ (Boas, 1921), containing 262
gold-transcribed lines and 2,255 unannotated lines.
6.2 Experimental Setup
Data Splits We follow Rijhwani et al. (2020)
and perform 10-fold cross-validation for all ex-
periments. For each language, the gold-transcribed
data is split into 10 segments, and for each cross-
validation fold, eight segments are used for train-
ing, one for validation, and one for testing.
Metrics We evaluate our systems in terms of
character error rate (CER) and word error rate
(WER), both standard metrics for measuring
OCR and OCR post-correction performance
1292
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
(Berg-Kirkpatrick et al., 2013; Schulz and Kuhn,
2017). CER is the character-level edit distance
between the predicted text and the corresponding
gold transcription, divided by the total number
of characters in the gold transcription. WER is
similar but is calculated at the word-level. Für
readability, we report CER and WER as percent-
ages for all experiments.
Methoden
performance of the following methods:
In our experiments, we compare the
• FIRST-PASS: To obtain the first pass OCR
transcription, we experiment with two ex-
isting OCR systems: Google Vision (Fujii
et al., 2017) and Ocular (Berg-Kirkpatrick
et al., 2013).
For each language, we choose the best per-
forming first pass system,
the details of
which are in Section 6.4. We use Ocular
for Kwak’wala and Google Vision for Ainu,
Griko, and Yakkha.
• BASE: The current state-of-the-art in OCR
post-correction for endangered language
texts (Rijhwani et al., 2020; described in
Abschnitt 3).
• SEMI-SUPERVISED: Our proposed method as
described in Section 4 and Section 5.
Implementation The neural post-correction
models are implemented using the DyNet neural
network toolkit (Neubig et al., 2017). The WFSA
is implemented using the MFST Python wrapper
on OpenFST (Francis-Landau, 2020), and we use
the KenLM toolkit (Heafield et al., 2013) to train
and query the character n-gram language model.
Following Rijhwani et al. (2020), results reported
are the average of five randomly seeded runs (d.h.,
five runs for each of the 10 cross-validation folds).
6.3 Main Results
Tisch 1 shows the performance of the baselines
and our proposed semi-supervised approaches for
the four languages in the dataset. For all languages,
using semi-supervised learning leads to substantial
reductions in both CER and WER.
best models based on the validation set WER. Ex-
tensive analysis of these factors is in Section 6.6.
Erste, we note that the BASE post-correction
method improves error rates over the first pass for
all languages. With our proposed semi-supervised
learning method, combining self-training with
lexically aware decoding leads to the best per-
formance across all the languages, with error rate
reductions in the range of 15%–29%.
This is especially noticeable in Ainu, Wo
using either self-training or lexical decoding in-
dependently results in worse performance than
the BASE system, but jointly using them improves
the CER by 21%. For the other languages, Die
independent components improve over the base
model but less so than their combination. Das
indicates the complementary nature of the two
components: The language model used for lex-
ically aware decoding is improved by self-training.
Im Gegenzug, it reinforces correctly predicted words
to counteract the influence of incorrect pseudo-
annotated instances.
6.4 First Pass OCR Systems
We experiment with two existing OCR systems
to obtain a first pass transcription on our dataset.
The first of these is the Google Vision system
(Fujii et al., 2017; Ingle et al., 2019). This off-
the-shelf OCR model supports 60 languages in
27 scripts—these are primarily higher-resourced
languages and do not include our target endan-
gered languages.
The second system is Ocular (Berg-Kirkpatrick
et al., 2013). Ocular uses a generative model to
transcribe scanned documents: The model gen-
erates the image by learning the font of the
document. Ocular relies on a character n-gram
language model trained on the target language.
We initialize the LM with the small number of
gold-transcribed pages in our dataset. Dafür, Wir
use the 10-fold cross-validation setup described in
Abschnitt 6.2: We use the training segments to train
the Ocular LM and the test segment to evaluate
OCR performance. The font model has parame-
ters to learn the shape of each character in the LM
vocabulary. After initialization, the parameters
are updated in an unsupervised manner with EM
until convergence.
We note that we did a hyperparameter search
over the number of self-training iterations and the
weight of the WFSA λ, und Tisch 1 presents the
Results are presented in Table 2. We note that
although the Google OCR system is not trained on
our target languages, it is trained on large amounts
1293
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Modell
FIRST-PASS
BASE
SEMI-SUPERVISED
Self-Training
Lexical Decoding
Beide
Error Reduction
(cid:3)
BASE−Both
BASE
% Character Error Rate
% Word Error Rate
ain
1.34
0.80
grk
3.27
1.70
ybh
8.90
8.44
kwk
7.90
4.97
ain
6.27
5.19
grk
15.63
7.51
ybh
31.64
21.33
kwk
38.22
27.65
1.45
1.51
6.47
5.31
7.20
0.82
6.60
0.81
5.18
7.56
6.36
0.63 1.37 5.98 3.82 4.43
21% 19% 29% 23% 15% 15%
4.00
4.28
23.98
18.09
19.13
25.09
16.65 22.61
18%
22%
(cid:4)
Tisch 1: Our semi-supervised approach improves performance over the baselines (10-fold
cross-validation averaged over five randomly seeded runs). ‘‘Self-Training’’ and ‘‘Lexical
Decoding’’ refer to experiments where we use these methods independently. ‘‘Both’’ refers to
their combination. We highlight the best model for each language.
% Character Error Rate
OCR System
Ocular
Google Vision
ain
10.49
1.34
grk
4.58
3.27
ybh
75.60
8.90
kwk
7.90
21.12
ain
47.47
6.27
% Word Error Rate
ybh
99.37
31.64
grk
15.71
15.63
kwk
38.22
82.08
Tisch 2: First pass OCR system performance. If the language’s script is not covered by Google Vision
(as for Kwak’wala), then Ocular results in better recognition. Ansonsten, Google Vision OCR is usually
significantly better.
of data in high-resource languages that share writ-
ing systems with Ainu, Griko, and Yakkha (Latin,
Greek, Devanagari scripts) and thus can recognize
characters in these scripts with reasonable accu-
racy (see Rijhwani et al. [2020] for a more detailed
Analyse). Andererseits, the performance is
much worse on Kwak’wala since the system has
not been trained on the Boas orthography, welche
is unique to the Kwak’wala language. The Boas
orthography uses several Latin script characters,
which the system is able to recognize, but it also
includes characters unique to the writing system
that are incorrectly transcribed by the OCR model.
We find that the performance on Kwak’wala
is considerably better with the Ocular system
because the LM is trained on Kwak’wala text.
Daher, unlike Google Vision, the model vocabu-
lary contains the Boas writing system’s alphabet.
Andererseits, Ocular’s performance on Ainu
and Griko is worse than Google Vision, likely
due to the minimal data available for training it.
Darüber hinaus, we observe that performance is cor-
related with the word overlap between test data
and the data used for training the LM, Dämon-
strating Ocular’s reliance on a strong language
model—the word overlap is 73% for Griko, 56%
for Kwak’wala, Und 48% for Ainu.
Endlich, we find that Ocular does not perform
well on the Yakkha dataset. This is because the
design of Ocular’s font model does not work
with how the Devanagari script is written. More
speziell, when a vowel diacritic is applied
the characters are combined:
to a consonant,
. In Unicode, this is repre-
=
+
z.B.,
’’, Wo
sented by two characters ‘‘
the dotted circle is the character combination
marker in the Unicode Standard.7
’’ and ‘‘
Jedoch, since Ocular’s font model operates at
the character-level, it tries to generate the images
of these two characters separately. Generating the
’’ on its own is not meaningful: Der
diacritic ‘‘
dotted circle never appears in the input image
because it is supposed to be combined. Daher, Die
font model is unable to converge as it cannot
handle character combinations when generating
the image.
7https://www.unicode.org/versions/Unicode
13.0.0/ch02.pdf.
1294
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Known Word Unknown Word
Modell
CHARLM
WORDLM
OURS
ain
Modell
(not needed)
0.64
Character uniform 0.64
Character n-gram 0.63
% Character Error Rate
% Word Error Rate
grk
1.43
1.42
1.37
ybh
6.22
6.12
5.98
kwk
3.85
3.95
3.82
ain
4.50
4.50
4.43
grk
6.44
6.39
6.36
ybh
16.78
16.71
16.65
kwk
22.90
23.11
22.61
Tisch 3: A more informed unknown word model (character n-gram) in combination with the word-level
known word model consistently performs better than the alternatives for all four languages in our
dataset.
6.5 Comparing Language Models
Our proposed decoding method uses a count-based
word-level LM in combination with a character
n-gram LM to compute plex for joint decoding with
the LSTM decoder (Gleichung 3). In diesem Abschnitt,
we substitute this model with two other variants
of count-based LMs to compute plex and evaluate
their performance:
Lang.
Code
ain
grk
ybh
kwk
Average
LM
Known
Unknown
Coverage BASE OURS BASE OURS
0.25
0.95
0.71
0.89
0.59
0.90
0.89
0.58
0.91 0.95 0.40 0.53
0.97
0.94
0.68
0.59
0.80
0.08
0.51
0.51
0.50
0.98
0.96
0.95
0.92
• CHARLM: We use a character 6-gram lan-
guage model trained on the model predictions
from self-training {fθ(X) | x ∈ U }, esti-
mated with modified Kneser-Ney smoothing.
• WORDLM: We use the word-level LM de-
scribed in Section 5.1, but do not use a
character n-gram model for unknown words.
Stattdessen, we score unknown words with a sim-
ple uniform probability over all characters in
the vocabulary.
We tune λ on the validation set for each model
independently and report results with the best set-
ting in Table 3. Using either CHARLM or WORDLM
for lexically aware decoding improves the er-
ror rates with respect to the BASE model. Der
word-level model performs better for all languages
except Kwak’wala, likely due to the large per-
centage of unknown words in this language. Wir
also see that our proposed method, which lever-
ages a count-based word-level LM for known
words combined with a character-level LM for
scoring unknown words, results in the best per-
formance overall.
Although not observed in our dataset, we note
that some printed materials have a high degree
of spelling variation or contain texts for which
word tokenization is difficult. In such cases, Die
word-level model may not be as effective, Aber
CHARLM can still be used with the proposed
Tisch 4: Our method improves over the base
model on words that are both known and unknown
to the WFSA. We show the fraction of known test
Wörter, and the fraction of correctly predicted
known and unknown words.
lexically aware decoding framework to obtain
improved performance over the baseline method.
6.6 Analyse
We analyze specific components of our model
to understand the advantages of our proposed
Ansatz.
Known vs. Unknown Words We first identify
the source of the improvements that our approach
makes over the baseline. Tisch 4 presents the
fraction of correctly predicted words, split on
whether these words are ‘‘known’’ to the WFSA
(d.h., in the vocabulary of the word-level LM) oder
‘‘unknown’’. Intuitively, we expect that decoding
with the WFSA will improve prediction on the
known words.
Compared to the baseline, our method improves
on words known to the WFSA, moving from
91% Zu 95% accuracy on average. Our method
also improves unknown word prediction over the
baseline from an average accuracy of 40% Zu 53%.
In cases like Kwak’wala, Wo, due to the rich
morphology of the language, mehr als 40% of the
1295
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figur 3: The weight of the WFST during joint decod-
ing can affect word error rate (sometimes significantly,
as in Griko; top). All other hyperparameters are kept
equal and correspond to the best systems in each
Sprache.
Figur 5: Even a small amount of unannotated data
is useful for our semi-supervised method, improv-
ing WER over BASE (WER=7.51) In (A). Varying the
size of gold-annotated data has a stronger effect on
post-correction performance in (B). Results are shown
with Griko.
This hyperparameter is less important in the
other three languages, leading to smaller variations
in performance. Als Beispiel, we depict the
effect on Yakkha in Figure 3, where increasing λ
does not affect performance as much as in Griko.
Self-Training Iterations The
evolution of
WER across 5 self-training iterations for Ainu
and Yakkha is shown in Figure 4. Insbesondere
for Ainu, we see that combination with lexically
aware decoding is crucial for the success of
self-training does
self-training. For Yakkha,
improve performance independently but is more
effective when lexically aware decoding is used
(error rates on Griko and Kwak’wala follow a
similar trend).
Dataset Size We study the effect of varying the
amount of gold-transcribed and unannotated data
used for training. The WER when varying the size
of the Griko datasets is shown in Figure 5 (the size
of each set is varied while keeping the other set at
its full size). We see that reducing the amount of
gold-transcribed data worsens WER significantly.
Andererseits, reducing the unannotated data
has a smaller effect: even with a little unannotated
Daten, our method improves over the BASE model.
Error Rate in the First Pass OCR To evaluate
how the error rate in the first pass OCR tran-
scription affects subsequent post-correction, Wir
measure the performance of our proposed method
when applied to first pass outputs from two OCR
Figur 4: Integrating lexically aware decoding through
interpolation with a WFSA (red lines) aids self-training
in improving WER across iterations. Black dashed lines
correspond to self-training without lexical decoding.
test words are unseen, including an unknown word
model in the WFSA is particularly important.
WFSA Weight One of the important hyperpa-
rameters of our lexically aware method is the
weight that we place on the WFSA score dur-
ing inference (λ in Equation 3). Speziell,
in the case of Griko, we find that the value of
this hyperparameter can significantly affect per-
Form. As shown in Figure 3, high weights
of λ (d.h., more weight on the WFSA) lead to
suboptimal WER, while lower λ leads to much
better performance.
1296
Figur 6: Our post-correction model fixes many of the first pass OCR errors that the base model does not fix such
als (A) Und (B). In rare cases, our method introduces errors into the transcription such as (C) Und (D).
Systeme: Google Vision and Ocular (described in
Abschnitt 6.4). Figur 7 shows the WER on the
Kwak’wala dataset. We see that, although Google
Vision has a much higher first pass error rate
than Ocular, the post-correction model improves
performance over both OCR systems. We also
note that the relative error reduction is higher for
the Google Vision system (68%) than for Ocular
(41%), likely because the Ocular LM is trained on
the same data as the post-correction model.
Qualitative Analysis
In Abbildung 6, we show ex-
amples of errors fixed as well as errors introduced
by our post-correction model, as compared to the
baseline system. In Abbildung 6 (A) Und (B), we see
that although the baseline corrects some of the
errors in the first pass OCR, it also introduces
errors such as extra diacritics and incorrect sub-
stitutions. Using our proposed method leads to
an error-free transcription of these images. Wie-
immer, in Abbildung 6 (C) Und (D), we see that our
method occasionally introduces errors in predic-
tionen. Speziell, although the model fixes the
first pass errors, it generates words that are con-
siderably different from the target. Such errors
likely occur when the model follows an incorrect
path in the WFSA during lexically aware decod-
ing. Since we are using beam search, the correct
path cannot be recovered if it was pruned at an
earlier timestep.
7 Related Work
OCR post-correction is well-studied in the high-
resource setting, particularly for English. Recent
methods primarily use neural encoder-decoder
Modelle (Dong and Smith, 2018; Rigaud et al.,
2019; H¨am¨al¨ainen and Hengchen, 2019). Dort
has been relatively little work on lower-resourced
Figur 7: Our post-correction model significantly im-
proves recognition accuracy over different first pass
OCR systems that have varied error rates (Google Vi-
sion and Ocular). Results are shown with Kwak’wala.
languages. Kolak and Resnik (2005) present
a probabilistic edit distance model for post-
correction on Cebuano and Igbo, and Krishna
et al. (2018) use a sequence-to-sequence model
with a copy mechanism for improved performance
on Romanized Sanskrit OCR.
While existing neural post-correction methods
do not rely on lexical information, some earlier
methods use dictionaries to improve performance.
Zum Beispiel, Tong and Evans (1996) and Niklas
(2010) use lexicons in combination with n-gram
context to generate post-correction candidates for
erroneous words. These methods are typically
evaluated on English and assume the presence
of high-coverage lexicons (Schulz and Kuhn,
2017), making them difficult to adapt to endan-
gered languages.
Related to our decoding method are models
that incorporate lexical knowledge into neural
maschinelle Übersetzung. Arthur et al. (2016) propose
1297
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
adding a dictionary for translating low-frequency
words and Zhang et al. (2018) improve decoding
by upweighting translations that contain relevant
Wörter. Zusätzlich, there are methods which add
hard lexical constraints by forcing predictions
to contain user-specified words and phrases
(Hokamp and Liu, 2017; Post and Vilar, 2018).
zuletzt, we note that our proposed approach
combines information from a neural model and
a finite-state machine to leverage the advantages
of both. In a similar direction, Rastogi et al.
(2016) and Lin et al. (2019) design finite state
architectures with paths weighted by contextual
features from an LSTM. These methods use joint
parameterizations of the models and are thus more
complex to train (particularly in the low-resource
setting) than the joint decoding method we pro-
pose in this paper.
8 Abschluss
Digitization at scale for documents in under-
represented languages is a promising avenue to-
wards tackling one aspect of their marginalization,
the lack of data. With this work, we take a step to-
wards better digitization for extremely data-scarce
scenarios. We develop a semi-supervised method
that combines self-training with lexically aware
decoding, reducing error rates by up to 29% über
a state-of-the-art OCR post-correction model on
four typologically diverse endangered languages.
In future work, we plan to expand our method
to take advantage of additional outputs of the
language documentation process. Zum Beispiel,
documentary linguists typically collect word lists
(which range from lists of common words like the
Swadesh lists (Swadesh, 1955) to domain-specific
vocabularies). Using such word lists within the
lexically aware decoding framework could further
improve performance and enable the application of
our technique to even lower-resourced languages.
the improvements we achieve
through semi-supervised learning are potentially
orthogonal to the improvements Rijhwani et al.
(2020) achieve by incorporating information from
translations of the target text. As future work, Wir
plan to investigate the combination of these two
approaches in an attempt to utilize all available
sources of information to improve performance.
Zusätzlich,
Endlich, while using a character-level n-gram
LM improves performance on unknown words,
it does not explicitly utilize morphological struc-
ture to generate unseen inflections of words. In
the future, we plan to incorporate morphological
analysis during post-correction decoding, welche
will be helpful for morphologically rich endan-
gered languages.
Danksagungen
Shruti Rijhwani was supported by the Bloomberg
Data Science Ph.D. Fellowship for this work.
This work was also supported by grant PR-
276810-21 (‘‘Unlocking Endangered Language
Resources’’) from the National Endowment for
the Humanities, gewähren 1761548 (‘‘Discover-
ing and Demonstrating Linguistic Features for
Language Documentation’’) from the National
Science Foundation, National Research Council
Indigenous Language Technology grant ‘‘Kwak’
wala Corpus Collection Project’’, and Govern-
ment of Canada Social Sciences and Humanities
Research Council
Insight Development grant
GR002807 ‘‘K’ank’otłaxants Awi’nagwis (Know-
ing our land)’’. Any views, Erkenntnisse, conclusions
or recommendations expressed in this publication
do not necessarily represent those of the National
Endowment for the Humanities.
We would like to thank Jaymyn La Vallee for
assisting with the Kwak’wala dataset annotation
as well as Samridhi Choudhary, Siddharth Dalmia,
Deepak Gopinath, Maria Ryskina, the reviewers,
and the Action Editors for feedback on the paper.
Verweise
Philip Arthur, Graham Neubig, and Satoshi
Nakamura. 2016. Incorporating discrete transla-
tion lexicons into neural machine translation. In
Verfahren der 2016 Conference on Empir-
ical Methods in Natural Language Processing,
pages 1557–1567, Austin, Texas. Association
für Computerlinguistik. https://
doi.org/10.18653/v1/D16-1162
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by
In
jointly learning to align and translate.
3rd International Conference on Learning
Darstellungen, ICLR 2015.
Taylor Berg-Kirkpatrick, Greg Durrett, and Dan
Klein. 2013. Unsupervised transcription of his-
torical documents. In Proceedings of the 51st
1298
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Annual Meeting of the Association for Compu-
tational Linguistics (Volumen 1: Long Papers),
pages 207–217, Sofia, Bulgaria. Association for
Computerlinguistik.
Conference on Document Analysis and Recog-
Nation (ICDAR), Volumen 1, pages 161–168.
IEEE. https://doi.org/10.1109/ICDAR
.2017.35
Franz Boas. 1900. Sketch of the Kwakiutl lan-
Spur. American Anthropologist, 2(4):708–721.
https://doi.org/10.1525/aa.1900
.2.4.02a00080
Franz Boas. 1921. Ethnology of the Kwakiutl.
https://doi.org/10.1006/csla.1999
.0128
Stanley F. Chen and Joshua Goodman. 1999.
An empirical study of smoothing techniques
for language modeling. Computer Speech &
Language, 13(4):359–394.
Trevor Cohn, Cong Duy Vu Hoang, Ekaterina
Vymolova, Kaisheng Yao, Chris Dyer, Und
Gholamreza Haffari. 2016. Incorporating struc-
tural alignment biases into an attentional
neural translation model. In Proceedings of
Die 2016 Conference of the North American
Chapter of the Association for Computational
Linguistik: Human Language Technologies,
pages 876–885, San Diego, Kalifornien. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/N16-1102
Andrew M. Dai and Quoc V. Le. 2015. Semi-
supervised sequence learning. In Proceedings
of the 28th International Conference on Neural
Information Processing Systems – Volumen 2,
NIPS’15, page 3079–3087, Cambridge, MA,
USA. MIT Press.
Rui Dong and David Smith. 2018. Multi-input
attention for unsupervised OCR correction. In
Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics
(Volumen 1: Long Papers), pages 2363–2372,
Melbourne, Australia. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/P18-1220
Matthew Francis-Landau. 2020. MFST: A python
for custom
openfst wrapper with support
semirings and jupyter notebooks.
Yasuhisa Fujii, Karel Driesen, Jonathan Baccash,
Ash Hurst, and Ashok C. Popat. 2017.
Sequence-to-label script identification for mul-
tilingual OCR. In 2017 14th IAPR International
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor
O.K. Li. 2016. Incorporating copying mech-
anism in sequence-to-sequence learning. In
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics
(Volumen 1: Long Papers), pages 1631–1640,
Berlin, Deutschland. Association for Computa-
tional Linguistics.
Mika H¨am¨al¨ainen and Simon Hengchen. 2019.
From the paft to the fiiture: A fully automatic
NMT and word embeddings method for OCR
post-correction. In Proceedings of the Inter-
national Conference on Recent Advances in
Natural Language Processing (RANLP 2019),
pages 431–436, Varna, Bulgaria. INCOMA Ltd.
https://doi.org/10.26615/978-954
-452-056-4 051
Junxian He,
Jiatao Gu,
Jiajun Shen,
Und
Marc’Aurelio Ranzato. 2020. Revisiting self-
training for neural sequence generation. 8th
International Conference on Learning Repre-
Sendungen, ICLR 2020, Addis Ababa, Ethio-
pia, April 26-30, 2020, Conference Track
Verfahren.
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan
H. Clark, and Philipp Koehn. 2013. Scalable
modified Kneser-Ney language model estima-
tion. In Proceedings of the 51st Annual Meeting
of the Association for Computational Linguis-
Tics (Volumen 2: Short Papers), pages 690–696,
Sofia, Bulgaria. Association for Computational
Linguistik.
Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. Neural Computation,
9(8):1735–1780. https://doi.org/10.1162
/neco.1997.9.8.1735, Pubmed: 9377276
Chris Hokamp and Qun Liu. 2017. Lexically con-
strained decoding for sequence generation using
grid beam search. In Proceedings of the 55th
Annual Meeting of the Association for Compu-
tational Linguistics (Volumen 1: Long Papers),
pages 1535–1546, Vancouver, Kanada. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/P17-1141
1299
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
R. Reeve Ingle, Yasuhisa Fujii, Thomas Deselaers,
Jonathan Baccash, and Ashok C. Popat. 2019.
A scalable handwritten text recognition system.
arXiv preprint arXiv:1904.09150. https://
doi.org/10.1109/ICDAR.2019.00013
Pratik Joshi, Sebastin Santy, Amar Budhiraja,
Kalika Bali, and Monojit Choudhury. 2020.
The state and fate of linguistic diversity and
inclusion in the NLP world. In Proceedings of
the 58th Annual Meeting of the Association for
Computerlinguistik, pages 6282–6293,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/2020.acl-main.560
Ky¯osuke Kindaichi. 1931. Ainu Jojishi Y¯ukara no
Kenky¯u [Research on Ainu Epic Yukar], T¯oky¯o:
T¯oky¯o Bunko.
Reinhard Kneser and Hermann Ney. 1995.
Improved backing-off for m-gram language
modeling. In 1995 International Conference
on Acoustics, Speech, and Signal Processing,
Volumen 1, pages 181–184. IEEE.
Okan Kolak and Philip Resnik. 2005. OCR
post-processing for low density languages. In
Proceedings of Human Language Technol-
ogy Conference and Conference on Empirical
Methods in Natural Language Processing,
pages 867–874, Vancouver, Britisch-Kolumbien,
Kanada. Association
for Computational
Linguistik. https://doi.org/10.3115
/1220575.1220684
Amrith Krishna, Bodhisattwa P. Majumder,
Rajesh Bhat, and Pawan Goyal. 2018. Upcycle
your OCR: Reusing OCRs for post-OCR text
correction in Romanised Sanskrit. In Proceed-
ings of the 22nd Conference on Computational
Natural Language Learning, pages 345–355,
Brussels, Belgien. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/K18-1034
Dong-Hyun Lee. 2013. Pseudo-label: The simple
and efficient semi-supervised learning method
for deep neural networks. In Workshop on
Challenges in Representation Learning, ICML.
Chu-Cheng Lin, Hao Zhu, Matthew R. Gormley,
and Jason Eisner. 2019. Neural finite-state
transducers: Beyond rational relations. In Pro-
ceedings of the 2019 Conference of the North
American Chapter of
the Association for
Computerlinguistik: Human Language
Technologies, Volumen 1 (Long and Short Pa-
pers), pages 272–283, Minneapolis, Minnesota.
Verein für Computerlinguistik.
Haitao Mi, Baskaran Sankaran, Zhiguo Wang,
and Abe Ittycheriah. 2016. Coverage embed-
ding models for neural machine translation. In
Verfahren der 2016 Conference on Empir-
ical Methods in Natural Language Processing,
pages 955–960, Austin, Texas. Association for
Computerlinguistik.
Mehryar Mohri. 1996. On some applications of
finite-state automata theory to natural language
Verarbeitung. Natural Language Engineering,
2(1):61–80. https://doi.org/10.1017
/S135132499600126X
Mehryar Mohri, Fernando Pereira, and Michael
Riley. 2002. Weighted finite-state transduc-
ers in speech recognition. Computer Speech &
Language, 16(1):69–88. https://doi.org
/10.1006/csla.2001.0184
Graham Neubig, Chris Dyer, Yoav Goldberg,
Austin Matthews, Waleed Ammar, Antonios
Anastasopoulos, Miguel Ballesteros, David
Chiang, Daniel Clothiaux, Trevor Cohn, Kevin
Duh, Manaal Faruqui, Cynthia Gan, Und
Garrette, Yangfeng
Ji, Lingpeng Kong,
Adhiguna Kuncoro, Gaurav Kumar, Chaitanya
Paul Michel, Yusuke Oda,
Malaviya,
Matthew Richardson, Naomi Saphra, Swabha
Swayamdipta, and Pengcheng Yin. 2017.
DyNet: The dynamic neural network toolkit.
arXiv preprint arXiv:1701.03980.
Kai Niklas. 2010. Unsupervised post-correction
of OCR errors. Master’s thesis. Leibniz Uni-
versit¨at Hannover.
Matt Post and David Vilar. 2018. Fast lexically
constrained decoding with dynamic beam
allocation for neural machine translation. In
Verfahren der 2018 Conference of the
North American Chapter of the Association
für Computerlinguistik: Human Lan-
guage Technologies, Volumen 1 (Long Papers),
pages 1314–1324, New Orleans, Louisiana.
1300
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Verein für Computerlinguistik.
https://doi.org/10.18653/v1/N18
-1119
Prajit Ramachandran, Peter Liu, and Quoc Le.
2017. Unsupervised pretraining for sequence
to sequence learning. In Proceedings of the
2017 Conference on Empirical Methods in
Natural Language Processing, pages 383–391,
Copenhagen, Denmark. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/D17-1039
Pushpendre Rastogi, Ryan Cotterell, and Jason
Eisner. 2016. Weighting finite-state transduc-
tions with neural context. In Proceedings of
Die 2016 Conference of the North American
Chapter of the Association for Computational
Linguistik: Human Language Technologies,
pages 623–633, San Diego, Kalifornien. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/N16-1076
C. Rigaud, A. Doucet, M. Coustaty, and J. Moreux.
2019. ICDAR 2019 competition on post-OCR
text correction. In 2019 International Confer-
ence on Document Analysis and Recognition
(ICDAR), pages 1588–1593. https://doi
.org/10.1109/ICDAR.2019.00255
Shruti Rijhwani, Antonios Anastasopoulos, Und
Graham Neubig. 2020. OCR Post Correction for
Endangered Language Texts. In Proceedings
of the 2020 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 5931–5942, Online. Association for
Computerlinguistik. https://doi.org
/10.18653/v1/2020.emnlp-main.478
Maria
Hovy,
Ryskina,
Taylor
Eduard
Berg-Kirkpatrick, and Matthew R. Gormley.
2021. Comparative error analysis in neural and
finite-state models for unsupervised character-
level transduction. In Proceedings of the 18th
SIGMORPHON Workshop on Computational
Forschung
Und
Morphology, pages 198–211, Online. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/2021.sigmorphon
-1.22
in Phonetics, Phonology,
https://elar.soas.ac.uk/Collection
/MPI186180. Zugriff: 2020-02-02.
Carsten Schnober, Steffen Eger, Erik-Lˆan Do
Dinh, and Iryna Gurevych. 2016. Still not there?
Comparing traditional sequence-to-sequence
models to encoder-decoder neural networks on
monotone string translation tasks. In Proceed-
ings of COLING 2016, the 26th International
Conference on Computational Linguistics:
Technical Papers, pages 1703–1714, Osaka,
Japan. The COLING 2016 Organizing Committee.
Sarah Schulz and Jonas Kuhn. 2017. Multi-
modular domain-tailored OCR post-correction.
Die 2017 Conference on
In Proceedings of
in Natural Language
Empirical Methods
Processing, pages 2716–2726, Copenhagen,
Denmark. Association for Computational Lin-
guistics. https://doi.org/10.18653
/v1/D17-1288
Abigail See, Peter J. Liu, and Christopher D.
Manning. 2017. Get to the point: Summariza-
tion with pointer-generator networks. In Pro-
ceedings of the 55th Annual Meeting of the
Verein für Computerlinguistik
(Volumen 1: Long Papers), pages 1073–1083,
Vancouver, Kanada. Association for Computa-
tional Linguistics.
Paolo Stomeo. 1980. Racconti greci inediti di
Sternat´ıa, La nuova Ellade, s.I. https://doi
.org/10.1086/464321
Morris Swadesh. 1955. Towards greater accuracy
in lexicostatistic dating. International Journal
of American Linguistics, 21(2):121–137.
Xiang Tong and David A. Evans. 1996. A statisti-
cal approach to automatic OCR error correction
in context. In Fourth Workshop on Very Large
Corpora.
Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua
Liu, and Hang Li. 2016. Modeling coverage
for neural machine translation. In Proceedings
of the 54th Annual Meeting of the Associa-
tion for Computational Linguistics (Volumen 1:
Long Papers), pages 76–85, Berlin, Deutschland.
Verein für Computerlinguistik.
Diana Schackow. 2012. Documentation and
grammatical description of Yakkha, Nepal.
David Yarowsky. 1995. Unsupervised word sense
disambiguation rivaling supervised methods.
1301
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
In 33rd Annual Meeting of the Association
für Computerlinguistik, pages 189–196,
Cambridge, Massachusetts, USA. Association
für Computerlinguistik. https://
doi.org/10.3115/981658.981684
Jingyi Zhang, Masao Utiyama, Eiichro Sumita,
Graham Neubig, and Satoshi Nakamura. 2018.
Guiding neural machine translation with re-
trieved translation pieces. In Proceedings of
Die 2018 Conference of the North American
Chapter of the Association for Computational
Linguistik: Human Language Technologies,
Volumen 1 (Long Papers), pages 1325–1335,
New Orleans, Louisiana. Association for
Computerlinguistik. https://doi
.org/10.18653/v1/N18-1120
Xiaojin Zhu and Andrew B. Goldberg. 2009. Intro-
duction to semi-supervised learning. Synthesis
Lectures on Artificial Intelligence and Machine
Learning, 3(1):1–130.
Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin
Cui, Hanxiao Liu, Ekin Dogus Cubuk, Und
Quoc Le. 2020. Rethinking pre-training and
self-training. Advances in Neural Information
Processing Systems, 33. https://doi.org
/10.2200/S00196ED1V01Y200906AIM006
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
T
l
A
C
_
A
_
0
0
4
2
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
1302