Lexically Aware Semi-Supervised Learning for OCR Post-Correction
Shruti Rijhwani1, Daisy Rosenblum2, Antonios Anastasopoulos3, Graham Neubig1
1Language Technologies Institute, 卡内基梅隆大学, 美国
2University of British Columbia, 加拿大
3计算机科学系, George Mason University, 美国
srijhwan@cs.cmu.edu, daisy.rosenblum@ubc.ca,
antonis@gmu.edu, gneubig@cs.cmu.edu
抽象的
Much of the existing linguistic data in many
languages of the world is locked away in non-
digitized books and documents. Optical char-
acter recognition (OCR) can be used to produce
digitized text, and previous work has demon-
strated the utility of neural post-correction
methods that improve the results of general-
purpose OCR systems on recognition of less-
well-resourced languages. 然而,
这些
methods rely on manually curated post-
correction data, which are relatively scarce
compared to the non-annotated raw images
that need to be digitized. 在本文中, 我们
present a semi-supervised learning method
that makes it possible to utilize these raw
images to improve performance, 具体来说
through the use of self-training, a technique
where a model is iteratively trained on its own
outputs. 此外, to enforce consistency
in the recognized vocabulary, we introduce
a lexically aware decoding method that aug-
ments the neural post-correction model with
a count-based language model constructed
from the recognized texts, implemented us-
ing weighted finite-state automata (WFSA)
for efficient and effective decoding. 结果
on four endangered languages demonstrate
the utility of the proposed method, with rela-
tive error reductions of 15%–29%, where we
find the combination of self-training and lex-
ically aware decoding essential for achieving
consistent improvements.1
1
介绍
There is a vast amount of textual data available
in printed form (Dong and Smith, 2018). 在这个
纸, we address the task of digitizing printed
materials that contain text in endangered lan-
guages, 那是, languages with small populations
1Data and code are available at https://shrutirij
.github.io/ocr-el/.
of first-language speakers and limited acquisi-
tion among younger speakers. Printed texts in
endangered languages come from various sources,
including linguistic documentation and cultural
and educational books.
Extracting text data from these documents is
valuable for a multitude of reasons. Automatic dig-
itization can aid language documentation, preser-
休假, and accessibility efforts by archiving the
texts and making them searchable for language
learners, 教师, and speakers, contributing to
essential resources for community-based language
revitalization. 更远, most endangered languages
are under-represented in natural language process-
ing technologies, primarily because there is little
to no data available for training and evaluation
(Joshi et al., 2020). This challenge can be mit-
igated by converting printed materials in these
languages to a machine-readable format.
Optical character recognition (OCR) 系统
can be used to produce digitized text, 和最近的
工作 (Rijhwani et al., 2020) has demonstrated
that post-correction improves the performance of
existing general-purpose OCR systems on endan-
gered languages (an example is in Figure 1). 最多
state-of-the-art OCR post-correction methods use
neural sequence-to-sequence models and rely on
considerable resources such as a large number
of manual transcriptions (Schnober et al., 2016;
Rigaud et al., 2019) or substantial textual data to
train a language model (Dong and Smith, 2018).
To adapt these methods for the less-well-resourced
endangered languages setting, Rijhwani et al.
(2020) add translations and structural biases to
该模型.
然而, even with such methods targeted
to low-resource learning, post-correction perfor-
mance is still dependent on manually curated data,
which are minimally available for most endan-
gered languages. 另一方面, unannotated
1285
计算语言学协会会刊, 卷. 9, PP. 1285–1302, 2021. https://doi.org/10.1162/tacl 00427
动作编辑器: Yang Liu. 提交批次: 6/2021; 修改批次: 8/2021; 已发表 11/2021.
C(西德:2) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
increasing number of correctly predicted words,
resulting in a better count-based language model.
最后, joint decoding reinforces the pre-
diction of more accurate words and mitigates
the noise introduced by incorrect words in the
pseudo-training data.
We conduct experiments on four endangered
语言: Ainu, Griko, Kwak’wala, and Yakkha.
Our proposed method reduces the character and
word error rates by 15%–29% over a state-of-
the-art OCR post-correction method for endan-
gered languages. We find that the combination
of self-training and lexically aware decoding is
essential for achieving consistent improvements
in performance.
2 Problem Formulation
Optical Character Recognition The OCR task
involves generating a transcription of the text
contained in an image. 在本文中, 我们用
existing OCR tools (detailed in Section 6.4) 到
obtain a first pass transcription for the images
in our dataset. The first pass transcription is
a text sequence of N characters, denoted as
x = [x1, . . . , xN ].
OCR Post-Correction Even state-of-the-art
OCR models are susceptible to making recogni-
tion errors (Dong and Smith, 2018). Errors are
particularly frequent in the case of endangered
languages because most off-the-shelf OCR tools
do not directly support these languages and train-
ing a high-performance OCR system is challeng-
ing given the small amount of data that
是
typically available (Rijhwani et al., 2020). 我们
use OCR post-correction to correct these errors
and improve the quality of the transcription.
The post-correction model takes the first pass
transcription x as input and generates the cor-
rected transcription, a sequence of T characters
denoted as y = [y1, . . . , yT ]:
y = arg max
y(西德:2)
pcorr(y(西德:2)|X)
3 Base Model
As the base post-correction model, 我们使用
model from Rijhwani et al. (2020): a sequence-
to-sequence model that uses an attention-based
LSTM encoder-decoder (Bahdanau et al., 2015),
数字 1: OCR post-correction on a scanned docu-
ment that contains text in the endangered language
Kwak’wala. The goal of post-correction is to fix the
recognition errors made by the first pass OCR system.
raw images that need to be digitized are relatively
less scarce; for many endangered languages, hun-
dreds of printed pages exist, with only a small
subset manually transcribed. 在本文中, 我们
propose a semi-supervised learning method for
OCR post-correction that efficiently utilizes these
unannotated pages to improve performance.
The method has two key components. We first
present a self-training method for OCR post-
correction (部分 4) to create pseudo-training
数据. A baseline post-correction model is used to
correct the initial OCR output on the unannotated
页面, and the generated ‘‘post-corrected’’ text is
then used as pseudo-training data to improve the
post-correction model. The self-training process
is repeated to iteratively obtain better predictions
on the unannotated pages.
While self-training is a straightforward way to
use the unannotated data, incorrect predictions
in the pseudo-training data may introduce noise
into the model (Zhu and Goldberg, 2009). 到
counterbalance the influence of this noise, 我们
propose lexically aware decoding (部分 5),
an inference strategy that encourages the model
to generate predictions that contain ‘‘known’’
字. We use the pseudo-training data to train
a count-based language model, represented with
a weighted finite-state automaton (WFSA). 我们的
proposed decoding method jointly uses an LSTM
decoder and the WFSA to make OCR post-
correction predictions.
The intuition behind the joint decoding strategy
is simple. As the model iteratively improves with
self-training, the quality of the pseudo-training
data is also likely to improve and contain an
1286
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
with adaptations for low-resource OCR post-
correction. We briefly describe the method here
but refer readers to the original paper for details.
The OCR post-correction model takes the first
pass transcription x as input, with the aim of pre-
dicting an error-free transcription y. 第一的, each
character in the input sequence x is mapped to
a vector representation using character embed-
丁斯. This forms a sequence of vectors, x =
[x1, . . . , xN ]. The encoder is a character-level bi-
directional LSTM (Hochreiter and Schmidhuber,
1997), which transforms x into a sequence of
hidden state vectors h = [h1, . . . , hN ].2
The model’s decoding process uses an attention
mechanism to provide context from the encoder
hidden states. At each decoding timestep t, 这
attention layer uses h and the decoder state from
the previous timestep, st−1, to produce the context
vector ct. The LSTM decoder, given ct, computes
the output state st and subsequently the probability
distribution yt for generating the next character of
the target sequence y:
p (yt) = softmax (Wst + 乙)
(1)
Rijhwani et al. (2020) adapt the encoder-decoder
model above for low-resource post-correction
by adding pretraining and three structural biases:
• Diagonal Attention Loss: OCR post-
correction is a monotonic sequence-to-
sequence task. 因此, the attention weights
are expected to be higher closer to the
diagonal—adding attention elements off the
diagonal to the training loss encourages mo-
notonic attention (Cohn et al., 2016).
• Copy Mechanism: The copy mechanism
enables the model to choose between gen-
erating a character based on the decoder state
(方程 1) or copying a character directly
from the input sequence x by sampling from
the attention distribution (Gu et al., 2016; 看
等人。, 2017).
2Rijhwani et al. (2020) incorporate translations into the
model with a multi-source encoder. We omit this from our
formulation, considering applicability to texts without avail-
able translations. 然而, adding an encoder into our
framework remains straightforward and can be used if
translations exist.
• Coverage: The coverage vector keeps track
of attention weights from previous timesteps.
It is used as additional information when
computing ct and is added to the training loss
to discourage the model from repeatedly at-
tending to the same character (Mi et al., 2016;
Tu et al., 2016).
The model is trained in a supervised manner
with a small number of manual transcriptions:
The training data includes pairs of first-pass OCR
text with its corresponding error-free transcription.
The post-correction training loss function (的-
noted as L) is a combination of cross-entropy loss
along with the diagonal attention loss and the cov-
erage loss from the structural biases. Inference with
a trained model is performed using beam search.
在以下部分中, we use the method
described above as a base model for our pro-
posed semi-supervised learning technique for
OCR post-correction. Given the minimal man-
ually transcribed data we have in endangered lan-
guages, our approach aims to efficiently use the
relatively larger number of pages without gold tran-
scriptions to improve performance. 为此,
we introduce two methodological improvements:
(1) self-training and (2) lexically aware decoding.
4 Self-Training
A
Self-training is
semi-supervised learning
方法, where a trained model is used to make
predictions on unlabeled data, and the model is
then retrained on its own predictions (Zhu and
Goldberg, 2009).
Consider that we have a set of images with
manually created transcriptions and a set of images
without gold transcriptions. We can obtain a first
pass transcription for the text contained in the
图片 (both sets) with existing OCR tools.
更正式一点, we have a gold-transcribed da-
taset D = {(西德:3)X(我), y(我)(西德:4)}d
我=1, where x(我) is the first
pass transcription and y(我) is the error-free man-
ual transcription of the ith training instance.3 We
also have a dataset for which only the first pass
OCR is available (IE。, no manual transcriptions),
3In our dataset, the source and target data instances are
either at the line-level or the sentence-level (参见章节 6.1).
1287
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
U = {X(j)}你
gered languages setting,
transcriptions is much larger, 那是, 你 (西德:5) d.
j=1. For most cases in the endan-
the set without gold
Since self-training requires a baseline model to
get an initial set of predictions on U , we first train
the base model described in Section 3 on the gold-
transcribed set D. Let the trained base model be
fθ. 下一个, we use the predictions on U from fθ to
self-train the model. We follow the self-training
strategy recommended in He et al. (2020), 哪个
involves two steps: ‘‘pseudo-training’’ and ‘‘fine-
tuning’’. We describe each step of the self-training
procedure in detail below:
1. Apply the initial OCR post-correction model
fθ to each instance in the set U to obtain
predictions using beam search inference.
For an instance x, let the prediction be fθ(X).
2. Create a pseudo-annotated dataset with the
predictions from step 1. Let this be S =
{(西德:3)X, fθ(X)(西德:4) | x ∈ U }.
3. Train the model fθ on sets U and S.
This is the pseudo-training step. 这里, 我们
first train the encoder and the decoder com-
ponents with a language modeling objective,
and then train the end-to-end post-correction
模型. The procedure is as follows:
(A) Train the encoder with a character-level
语言建模 (LM) objective on U .
As discussed in Section 3, the encoder
component of the model is an LSTM
that operates at the character-level. 我们
pseudo-train this LSTM with a language
model objective on each text sequence
x ∈ U .
那是, at each timestep t, the LSTM is
trained to predict the next character in
the input sequence. Given a sequence of
characters x = [x1, . . . , xN ], the train-
氮
t=1 P (xt |
ing objective maximizes
x1, . . . , xt−1).
This is the standard LM objective func-
tion and has been proven helpful for
pretraining LSTMs to improve hidden
陈述 (Dai and Le, 2015;
Ramachandran et al., 2017).
(西德:2)
(乙) Train the decoder LSTM with the LM
objective described above, using the
baseline model’s predictions {fθ(X) |
x ∈ U }.
(C) Train the sequence-to-sequence model
on the pseudo-annotated dataset S with
the post-correction loss function L from
部分 3.
4. Given the pseudo-trained model fθ, fine-tune
the model on the gold-transcribed dataset D,
with the loss function L.
5. Repeat step 1 to step 4 until a specified
stopping criterion (例如, no improvement in
the validation set performance or reaching
the maximum permitted iterations).
As indicated above, self-training is a straight-
forward semi-supervised technique to leverage
documents without gold transcriptions to improve
OCR post-correction performance. We note that
some self-training methods (Yarowsky, 1995; 李
等人。, 2013; Zoph et al., 2020, inter alia) 代替
脚步 4 和 5 with a single step that trains fθ on
S ∪ D. 然而, this led to slightly worse perfor-
mance in our preliminary experiments. 我们也
observed that pseudo-training the LSTMs with an
LM objective (脚步 3(A) 和 3(乙) 多于) is neces-
sary for good performance and that applying the
self-training steps on fθ from the previous iteration
led to better results than re-initializing the model.4
更远, as recommended in He et al. (2020) 到
improve self-training for neural sequence genera-
的, we add a dropout layer into the base model
at the encoder and decoder hidden states during
pseudo-training and fine-tuning (脚步 3 和 4).
5 Lexically Aware Decoding
Although self-training is a simple approach that
leads to improvements in post-correction per-
formance without additional manual annotation,
incorrect predictions in the pseudo-annotated data
may introduce noise into the model, 潜在地
reinforcing the errors in the next self-training
4In preliminary experiments, we also tried using S ∪ D
in step 3(C). 然而, the post-correction performance was
approximately the same as using only the set S.
1288
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
iteration (Zhu and Goldberg, 2009). Such noise is
more likely to occur in the endangered languages
环境, where the base model is trained on min-
imal data and thus sometimes generates errone-
ous predictions.
While some self-training methods use confi-
dence scores to remove noisy predictions (例如
Yarowsky, 1995), these are typically designed for
classification tasks. Designing such heuristics is
challenging for OCR post-correction because the
predictions are generated at the character-level;
specific characters may be incorrect, but discard-
ing the entire predicted sequence (IE。, the line or
句子) is inefficient, particularly in a low-
resource scenario. To mitigate these issues, we pro-
pose lexically aware decoding, an inference strategy
based on our observations of the challenges asso-
ciated with the OCR post-correction task.
进一步来说, our preliminary experiments
with self-training indicated that the errors made
by the model are typically inconsistent. 为一个
particular word, some instances may be correctly
predicted by the model. For the instances of the
word that are incorrect, we observe that they
are likely to be erroneous in different ways, IE。,
different subsets of characters in the word are
incorrectly predicted. This is expected since the
same word can appear in varied contexts, 或者
first pass OCR for the word can differ. Our empir-
ical observations on the pseudo-annotated dataset
S showed that, since the errors are inconsistent,
the correct form of the word is more frequent
than incorrect forms. Lexically aware decoding
is designed to influence the OCR post-correction
model to generate words that frequently occur in
the set S, in the expectation that these are correct
word forms.
We first describe the construction of a model
that accounts for word frequency in the predictions
along with a character n-gram model to enable the
prediction of unseen words. 然后, we present a
joint decoding method that uses the frequency-
based models in combination with the LSTM
decoder for improved OCR post-correction.
5.1 Count-Based Language Model
From the self-training method in Section 4, 我们
have a pseudo-annotated dataset S = {(西德:3)X, fθ(X)(西德:4) |
x ∈ U }, where fθ(X) is the model’s prediction
for input sequence x. We train a count-based
word-level unigram language model (LM) 在
{fθ(X) | x ∈ U }. The LM is built by computing
frequency-based probabilities for each word found
in the predictions.
We also have to reserve some probability mass
in the LM to account for unknown words (字
unseen in the predictions). We use modified
Kneser-Ney smoothing to derive the unknown
word (‘‘
a unigram LM, the smoothing process is simi-
lar to absolute discounting. 然而, 我们使用
discount values based on the modified Kneser-
Ney method, which are derived from word counts
in the dataset, as opposed to using a fixed dis-
count value (Kneser and Ney, 1995; Chen and
古德曼, 1999). We denote the probability from
the smoothed LM for a known word w as
pword(w) and the unknown word probability as
pword(
A count-based unigram LM is a simple model
but is suitable given our empirical observations
on word-level errors (described earlier in this
部分) 因为 (1) it explicitly models word
频率, (2) 它
is straightforward to update
as the pseudo annotated dataset improves over
self-training iterations, 和 (3) it can be expressed
as a weighted finite-state automaton which, 和我们一样
discuss next, has several properties useful for our
decoding method.
5.2 Weighted Finite State Automaton
A weighted finite-state automaton (WFSA) is a set
of states and transitions between the states. 每个
transition accepts a particular symbol as input and
has a weight associated with it. The symbols come
from a finite alphabet Σ. A sequence of consecu-
tive transitions is referred to as a ‘‘path’’, 和
label of a path is the concatenation of all sym-
bols consumed by its constituent transitions. 这
WFSA has a start state and a set of final states. A
successful path is a path from the start state to a fi-
nal state, and a sequence of symbols is ‘‘accepted’’
by the WFSA if there exists a successful path that
consumes this sequence (Mohri et al., 2002).
Since we are focused on decoding and only need
the best scoring path for any given sequence (IE。,
Viterbi search), we consider the weights over the
tropical semiring. 那是, the weight of a path is
the sum of its transition weights, and the score of
a sequence of symbols is the minimum weight of
all the successful paths that accept that sequence.
1289
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 2: 这 (A) WFSA and (乙) minimized WFSA we construct, for a hypothetical language model with a two
word vocabulary: 磷 (狗) = 0.75; 磷 (门) = 0.2; 磷 (
probabilities. 在 (乙), for simplicity, we show only the known word states after determinization and minimization.
Decoding with the post-correction model is at
the character-level (方程 1), so in order to
leverage word frequency in the decoding process,
we convert the count-based word-level LM de-
scribed in Section 5.1 to a WFSA representation
that consumes and scores sequences at
这
character-level.
The WFSA is constructed to accept the words
known to the LM by consuming each character in
这个单词 (in sequence) as input. The score of the
path that accepts a known word w is the negative
log of its probability from the LM: −log pword(w).
A simple example is shown in Figure 2(A).
The WFSA, as described above, can only accept
a single word. 然而, the input and correspond-
ing predictions of the post-correction model are
sequences of words, typically lines or sentences.
To enable the WFSA to accept such sequences,
we add transitions that accept a set B of word
boundary symbols (whitespace, punctuation, 和
end-of-sequence) from the states at the end of the
known words (例如, 状态 1 和 2 图中 2(A))
back to the start state. Once in the start state, 这
model can begin consuming characters from the
next word.
更远, we modify the WFSA such that the start
state is also the only final (accepting) state since
the predicted sequence is considered complete
only when the model predicts an end-of-sequence
symbol after the last character.
based LM, we include an unknown word state in
the WFSA, 如图 2(A). We add an
(西德:2)-过渡 (a transition that consumes no input),
with an associated cost −log pword(
the probability mass reserved for unknown words
in Section 5.1) to enter the unknown word state
from the start state. The model remains in the
unknown state, accepting symbols that form an
unseen word, until a word boundary symbol from
the set B is consumed to return to the start state.
We design the unknown word state to accept
any combination of the symbols in Σ, 从而
permitting the prediction of words unseen by the
word-level LM. To score each character consumed
at the unknown word state, we use a character-
level n-gram language model.5 We denote the
probabilities from this character n-gram LM as
pchar. The probability distribution is estimated
with modified Kneser-Ney smoothing on char-
acter n-grams from unique word forms in the
放 {fθ(X) | x ∈ U }. We use unique word
forms because unknown words are likely rare, 和
using count-based word forms would undesir-
ably shift the probability mass towards more fre-
quent words.
5.3 Efficient Scoring with the WFSA
The constructed WFSA has states to score char-
acter sequences that form known words and an
Character LM for Unknown Words To enable
the prediction of words unknown to the count-
5We use n = 6 在本文中. We experimented with
different values of n in early experiments but found that
n = 6 gave the best results for all languages in the dataset.
1290
unknown word state that relies on a character
n-gram LM to score unknown sequences.
During inference, we independently score the
next character through the known word model
and the unknown word model and then choose the
best scoring path. This formulation has two advan-
塔盖斯: (1) separate scoring allows us to compactly
represent the WFSA states for known words and
(2) instead of representing the character n-gram
LM directly in the WFSA, leading to the number
of states exponentially increasing with n, 我们可以
use highly optimized LM toolkits such as KenLM
(Heafield et al., 2013) for scoring unknown words.
Known Word Model Consider the WFSA with
only known word states. We apply standard al-
gorithms for determinization and minimization on
these states, which leads to an efficient and com-
pact representation of the count-based language
模型 (Mohri, 1996). 如图 2(乙), 这
resultant minimized WFSA has several properties
useful for our decoding method, 下面讨论.
Determinization ensures that each state has
at most one outgoing transition that consumes a
given input symbol, and minimization eliminates
redundant states and transitions, reducing the time
and space needed to process an input sequence.
更远, minimization includes pushing the
transition weights towards the start state of the
WFSA as much as possible (Mohri et al., 2002).
This lends itself well to our method since inference
in the OCR post-correction model is performed
with beam search; if the cost of a path is es-
tablished closer to the start state, unfavorable
hypotheses can be pruned at an earlier timestep,
which allows us to avoid errors more effec-
tively within an approximate search algorithm like
beam search.
最后, since each state in the WFSA has at
most one outgoing transition for each symbol,
the transition scores can be precomputed and
stored as a matrix, allowing efficient retrieval
during decoding.
At decoding timestep t, let the previous timestep
score from the known word model be known(yt−1)
and the current WFSA state be st−1. The score for
predicting the next character yt is the weight of
the transition from state st−1 that consumes yt in
the minimized WFSA (见图 2(乙)). 因此,
已知的(yt) = known(yt−1) + scorewfsa(yt | st−1)
where known(y0) = 0. If yt does not continue
the path of any known word, then scorewfsa(yt)
is inf.
Unknown Word Model We use the probability
pchar from the character n-gram language model
to score unknown words. 一般来说, at decod-
ing timestep t, the unknown model score for yt
will be:
unk(yt) = unk(yt−1) − log pchar(yt | yt−n, . . . , yt−1)
然而, if yt−1 ∈ B (IE。, the previous word is
完全的) or t = 0, the WFSA is currently in the
start state. To begin an unknown word, we also
need to add the weight of entering the unknown
word state to unk(yt), 那是, − log pword(
Best Scoring Path The scores are in the tropical
semiring (negative log probabilities). At timestep
t, the best score for yt from the lexical models is:
scorelex(yt) = min(已知的(yt), unk(yt))
(2)
During decoding, we keep track of both the
known and unknown model scores for the current
word being generated in the hypothesis. 什么时候
the word is completed (when yt ∈ B), 两者都
known and unknown word models return to the
start state of the WFSA (见图 2). 自从
the two paths are in the same state and are thus
indistinguishable with respect to future predictions
in the hypothesis, we choose the best scoring path
to continue decoding. This is known as hypoth-
esis recombination.
The WFSA framework, 因此, allows us to effi-
ciently represent the word-level LM in a manner
that scores symbols at the character-level and
leverage a character n-gram model to score un-
known words. This enables joint inference with
the character-level LSTM decoder in the OCR
post-correction model, as discussed below.
5.4 Joint Decoding with the LSTM
At decoding timestep t, let plstm(yt) be the prob-
ability of generating a character yt based on the
LSTM decoder’s hidden state (方程 1). 我们
also compute scorelex(yt), which is a negative log
probability, as defined in Equation 2. The final
probability of predicting yt is obtained through
1291
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
linear interpolation between these two scores,6
weighted by a hyperparameter λ:
p(yt) = (1 − λ) · plstm(yt) + λ · plex(yt)
(3)
where plex(yt) = exp (−scorelex(yt)).
This joint decoding strategy is applied when
performing inference with beam search using a
trained OCR post-correction model. When used
in combination with self-training, the predictions
made by the model improve as we repeat the self-
training process, iteratively improving the count-
based LM and resulting in a better distribution
of plex(yt).
6 实验
在这个部分, we present experiments with our
semi-supervised post-correction method on four
typologically diverse endangered languages.
6.1 数据集
We use the OCR post-correction dataset from
Rijhwani et al. (2020), which contains transcribed
documents in three endangered languages: Ainu,
Griko, and Yakkha. 此外, 在本文中,
we create a similar dataset in the endangered
language Kwak’wala. We describe the datasets
以下, including the sizes of the gold transcribed
and unannotated sets we use for semi-supervised
训练:
Ainu (ain) is a severely endangered language
from northern Japan. The dataset contains pages
from a book of Ainu epic poetry (Kindaichi, 1931).
The Ainu text is written in the Latin script. 这
dataset contains 816 manually transcribed lines as
well as 7,646 lines without gold transcriptions.
Griko (grk), an endangered Greek dialect spo-
ken in southern Italy, is written with a combination
of Latin and Greek alphabet. The document in
the dataset is a book of Griko folk tales (Stomeo,
1980). 有 807 和 3,084 sentences with and
without gold transcriptions, 分别.
Yakkha (ybh) is an endangered language spo-
ken in Nepal and is written in the Devanagari
script. The dataset contains transcriptions of three
children’s books (Schackow, 2012). 总共, 那里
是 159 manually transcribed sentences and no
6We leave other interpolation techniques like log-linear
interpolation and more complex combinations of scores
(such as the WFSA-based reranking and rescoring methods
proposed by Ryskina et al. [2021]) as potential future work.
unannotated lines in the dataset. 所以, 作为
unannotated set, we use the first pass OCR on
the validation and test sets in a transductive learn-
ing setting (≈ 30 句子: 参见章节 6.2 为了
data splits).
Kwak’wala (kwk) is spoken on Northern Van-
couver Island, nearby small
islands, 和
opposing mainland. The language is severely en-
dangered, with estimates of ≈150 first-language
speakers, all over the age of 70 年. The Kwak’
wala language includes 42 consonantal pho-
nemes (twice as many as English) and a wide
range of allophonic vowels. Several writing sys-
tems exist and community preference varies
between two orthographies:
the U’mista and
Liq’wala systems.
然而, much of the written documentation
for Kwak’wala is in another orthography that was
developed by anthropologist Franz Boas. 这
Boas orthography (Boas, 1900) was used in the
extensive documentation of the Kwak’wala lan-
guage and its speakers produced by Boas in col-
laboration with native-speaker George Hunt. 这
Boas writing system uses Latin script characters
as well as diacritics and digraphs to represent
phonemic differences. Although the Boas orthog-
raphy is not widely used today, the cultural and
linguistic materials previously written by Boas
are of tremendous value to community-based re-
searchers. 然而, they are minimally accessible
since they currently exist only as non-searchable
scanned images.
In consultation with members of language revi-
talization projects in three Kwakiutl communities
(Tsulquate, Fort Rupert, Quatsino), we focus on
digitizing these significant cultural resources. 我们
create a dataset with pages from the ‘‘Ethnology
of the Kwakiutl’’ (Boas, 1921), containing 262
gold-transcribed lines and 2,255 unannotated lines.
6.2 实验装置
Data Splits We follow Rijhwani et al. (2020)
and perform 10-fold cross-validation for all ex-
实验. For each language, the gold-transcribed
data is split into 10 细分市场, and for each cross-
validation fold, eight segments are used for train-
英, one for validation, and one for testing.
Metrics We evaluate our systems in terms of
character error rate (CER) and word error rate
(WER), both standard metrics for measuring
OCR and OCR post-correction performance
1292
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
(Berg-Kirkpatrick et al., 2013; Schulz and Kuhn,
2017). CER is the character-level edit distance
between the predicted text and the corresponding
gold transcription, divided by the total number
of characters in the gold transcription. WER is
similar but is calculated at the word-level. 为了
readability, we report CER and WER as percent-
ages for all experiments.
方法
performance of the following methods:
In our experiments, we compare the
• FIRST-PASS: To obtain the first pass OCR
transcription, we experiment with two ex-
isting OCR systems: Google Vision (Fujii
等人。, 2017) and Ocular (Berg-Kirkpatrick
等人。, 2013).
For each language, we choose the best per-
forming first pass system,
the details of
which are in Section 6.4. We use Ocular
for Kwak’wala and Google Vision for Ainu,
Griko, and Yakkha.
• BASE: The current state-of-the-art in OCR
post-correction for endangered language
文本 (Rijhwani et al., 2020; described in
部分 3).
• SEMI-SUPERVISED: Our proposed method as
节中描述 4 and Section 5.
Implementation The neural post-correction
models are implemented using the DyNet neural
network toolkit (Neubig et al., 2017). The WFSA
is implemented using the MFST Python wrapper
on OpenFST (Francis-Landau, 2020), and we use
the KenLM toolkit (Heafield et al., 2013) to train
and query the character n-gram language model.
Following Rijhwani et al. (2020), results reported
are the average of five randomly seeded runs (IE。,
five runs for each of the 10 cross-validation folds).
6.3 Main Results
桌子 1 shows the performance of the baselines
and our proposed semi-supervised approaches for
the four languages in the dataset. For all languages,
using semi-supervised learning leads to substantial
reductions in both CER and WER.
best models based on the validation set WER. Ex-
tensive analysis of these factors is in Section 6.6.
第一的, we note that the BASE post-correction
method improves error rates over the first pass for
all languages. With our proposed semi-supervised
learning method, combining self-training with
lexically aware decoding leads to the best per-
formance across all the languages, with error rate
reductions in the range of 15%–29%.
This is especially noticeable in Ainu, 在哪里
using either self-training or lexical decoding in-
dependently results in worse performance than
the BASE system, but jointly using them improves
the CER by 21%. For the other languages, 这
independent components improve over the base
model but less so than their combination. 这
indicates the complementary nature of the two
成分: The language model used for lex-
ically aware decoding is improved by self-training.
反过来, it reinforces correctly predicted words
to counteract the influence of incorrect pseudo-
annotated instances.
6.4 First Pass OCR Systems
We experiment with two existing OCR systems
to obtain a first pass transcription on our dataset.
The first of these is the Google Vision system
(Fujii et al., 2017; Ingle et al., 2019). This off-
the-shelf OCR model supports 60 languages in
27 scripts—these are primarily higher-resourced
languages and do not include our target endan-
gered languages.
The second system is Ocular (Berg-Kirkpatrick
等人。, 2013). Ocular uses a generative model to
transcribe scanned documents: The model gen-
erates the image by learning the font of the
文档. Ocular relies on a character n-gram
language model trained on the target language.
We initialize the LM with the small number of
gold-transcribed pages in our dataset. 为了这, 我们
use the 10-fold cross-validation setup described in
部分 6.2: We use the training segments to train
the Ocular LM and the test segment to evaluate
OCR performance. The font model has parame-
ters to learn the shape of each character in the LM
词汇. After initialization, the parameters
are updated in an unsupervised manner with EM
until convergence.
We note that we did a hyperparameter search
over the number of self-training iterations and the
weight of the WFSA λ, 和表 1 presents the
Results are presented in Table 2. We note that
although the Google OCR system is not trained on
our target languages, it is trained on large amounts
1293
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
模型
FIRST-PASS
BASE
SEMI-SUPERVISED
Self-Training
Lexical Decoding
两个都
Error Reduction
(西德:3)
BASE−Both
BASE
% Character Error Rate
% Word Error Rate
ain
1.34
0.80
grk
3.27
1.70
ybh
8.90
8.44
kwk
7.90
4.97
ain
6.27
5.19
grk
15.63
7.51
ybh
31.64
21.33
kwk
38.22
27.65
1.45
1.51
6.47
5.31
7.20
0.82
6.60
0.81
5.18
7.56
6.36
0.63 1.37 5.98 3.82 4.43
21% 19% 29% 23% 15% 15%
4.00
4.28
23.98
18.09
19.13
25.09
16.65 22.61
18%
22%
(西德:4)
桌子 1: Our semi-supervised approach improves performance over the baselines (10-fold
cross-validation averaged over five randomly seeded runs). ‘‘Self-Training’’ and ‘‘Lexical
Decoding’’ refer to experiments where we use these methods independently. ‘‘Both’’ refers to
their combination. We highlight the best model for each language.
% Character Error Rate
OCR System
Ocular
Google Vision
ain
10.49
1.34
grk
4.58
3.27
ybh
75.60
8.90
kwk
7.90
21.12
ain
47.47
6.27
% Word Error Rate
ybh
99.37
31.64
grk
15.71
15.63
kwk
38.22
82.08
桌子 2: First pass OCR system performance. If the language’s script is not covered by Google Vision
(as for Kwak’wala), then Ocular results in better recognition. 否则, Google Vision OCR is usually
significantly better.
of data in high-resource languages that share writ-
ing systems with Ainu, Griko, and Yakkha (Latin,
Greek, Devanagari scripts) and thus can recognize
characters in these scripts with reasonable accu-
活泼的 (see Rijhwani et al. [2020] for a more detailed
分析). 另一方面, the performance is
much worse on Kwak’wala since the system has
not been trained on the Boas orthography, 哪个
is unique to the Kwak’wala language. The Boas
orthography uses several Latin script characters,
which the system is able to recognize, 但它也
includes characters unique to the writing system
that are incorrectly transcribed by the OCR model.
We find that the performance on Kwak’wala
is considerably better with the Ocular system
because the LM is trained on Kwak’wala text.
因此, unlike Google Vision, the model vocabu-
lary contains the Boas writing system’s alphabet.
另一方面, Ocular’s performance on Ainu
and Griko is worse than Google Vision, likely
due to the minimal data available for training it.
而且, we observe that performance is cor-
related with the word overlap between test data
and the data used for training the LM, demon-
strating Ocular’s reliance on a strong language
model—the word overlap is 73% for Griko, 56%
for Kwak’wala, 和 48% for Ainu.
最后, we find that Ocular does not perform
well on the Yakkha dataset. This is because the
design of Ocular’s font model does not work
with how the Devanagari script is written. 更多的
具体来说, when a vowel diacritic is applied
the characters are combined:
to a consonant,
. In Unicode, this is repre-
=
+
例如,
’’, 在哪里
sented by two characters ‘‘
the dotted circle is the character combination
marker in the Unicode Standard.7
’’ and ‘‘
然而, since Ocular’s font model operates at
the character-level, it tries to generate the images
of these two characters separately. Generating the
’’ on its own is not meaningful: 这
diacritic ‘‘
dotted circle never appears in the input image
because it is supposed to be combined. 因此, 这
font model is unable to converge as it cannot
handle character combinations when generating
图像.
7https://www.unicode.org/versions/Unicode
13.0.0/ch02.pdf.
1294
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Known Word Unknown Word
模型
CHARLM
WORDLM
OURS
ain
模型
(not needed)
0.64
Character uniform 0.64
Character n-gram 0.63
% Character Error Rate
% Word Error Rate
grk
1.43
1.42
1.37
ybh
6.22
6.12
5.98
kwk
3.85
3.95
3.82
ain
4.50
4.50
4.43
grk
6.44
6.39
6.36
ybh
16.78
16.71
16.65
kwk
22.90
23.11
22.61
桌子 3: A more informed unknown word model (character n-gram) in combination with the word-level
known word model consistently performs better than the alternatives for all four languages in our
dataset.
6.5 Comparing Language Models
Our proposed decoding method uses a count-based
word-level LM in combination with a character
n-gram LM to compute plex for joint decoding with
the LSTM decoder (方程 3). 在这个部分,
we substitute this model with two other variants
of count-based LMs to compute plex and evaluate
their performance:
Lang.
Code
ain
grk
ybh
kwk
Average
LM
Known
Unknown
Coverage BASE OURS BASE OURS
0.25
0.95
0.71
0.89
0.59
0.90
0.89
0.58
0.91 0.95 0.40 0.53
0.97
0.94
0.68
0.59
0.80
0.08
0.51
0.51
0.50
0.98
0.96
0.95
0.92
• CHARLM: We use a character 6-gram lan-
guage model trained on the model predictions
from self-training {fθ(X) | x ∈ U }, esti-
mated with modified Kneser-Ney smoothing.
• WORDLM: We use the word-level LM de-
scribed in Section 5.1, but do not use a
character n-gram model for unknown words.
反而, we score unknown words with a sim-
ple uniform probability over all characters in
the vocabulary.
We tune λ on the validation set for each model
independently and report results with the best set-
ting in Table 3. Using either CHARLM or WORDLM
for lexically aware decoding improves the er-
ror rates with respect to the BASE model. 这
word-level model performs better for all languages
except Kwak’wala, likely due to the large per-
centage of unknown words in this language. 我们
also see that our proposed method, which lever-
ages a count-based word-level LM for known
words combined with a character-level LM for
scoring unknown words, results in the best per-
formance overall.
Although not observed in our dataset, we note
that some printed materials have a high degree
of spelling variation or contain texts for which
word tokenization is difficult. 在这种情况下, 这
word-level model may not be as effective, 但
CHARLM can still be used with the proposed
桌子 4: Our method improves over the base
model on words that are both known and unknown
to the WFSA. We show the fraction of known test
字, and the fraction of correctly predicted
known and unknown words.
lexically aware decoding framework to obtain
improved performance over the baseline method.
6.6 分析
We analyze specific components of our model
to understand the advantages of our proposed
方法.
Known vs. Unknown Words We first identify
the source of the improvements that our approach
makes over the baseline. 桌子 4 presents the
fraction of correctly predicted words, split on
whether these words are ‘‘known’’ to the WFSA
(IE。, in the vocabulary of the word-level LM) 或者
‘‘unknown’’. 直观地, we expect that decoding
with the WFSA will improve prediction on the
known words.
Compared to the baseline, our method improves
on words known to the WFSA, moving from
91% 到 95% accuracy on average. Our method
also improves unknown word prediction over the
baseline from an average accuracy of 40% 到 53%.
In cases like Kwak’wala, 在哪里, due to the rich
morphology of the language, 多于 40% 的
1295
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
数字 3: The weight of the WFST during joint decod-
ing can affect word error rate (sometimes significantly,
as in Griko; 顶部). All other hyperparameters are kept
equal and correspond to the best systems in each
语言.
数字 5: Even a small amount of unannotated data
is useful for our semi-supervised method, improv-
ing WER over BASE (WER=7.51) 在 (A). Varying the
size of gold-annotated data has a stronger effect on
post-correction performance in (乙). Results are shown
with Griko.
This hyperparameter is less important in the
other three languages, leading to smaller variations
in performance. 举个例子, we depict the
effect on Yakkha in Figure 3, where increasing λ
does not affect performance as much as in Griko.
Self-Training Iterations The
evolution of
WER across 5 self-training iterations for Ainu
and Yakkha is shown in Figure 4. Particularly
for Ainu, we see that combination with lexically
aware decoding is crucial for the success of
self-training does
self-training. For Yakkha,
improve performance independently but is more
effective when lexically aware decoding is used
(error rates on Griko and Kwak’wala follow a
similar trend).
Dataset Size We study the effect of varying the
amount of gold-transcribed and unannotated data
used for training. The WER when varying the size
of the Griko datasets is shown in Figure 5 (the size
of each set is varied while keeping the other set at
its full size). We see that reducing the amount of
gold-transcribed data worsens WER significantly.
另一方面, reducing the unannotated data
has a smaller effect: even with a little unannotated
数据, our method improves over the BASE model.
Error Rate in the First Pass OCR To evaluate
how the error rate in the first pass OCR tran-
scription affects subsequent post-correction, 我们
measure the performance of our proposed method
when applied to first pass outputs from two OCR
数字 4: Integrating lexically aware decoding through
interpolation with a WFSA (red lines) aids self-training
in improving WER across iterations. Black dashed lines
correspond to self-training without lexical decoding.
test words are unseen, including an unknown word
model in the WFSA is particularly important.
WFSA Weight One of the important hyperpa-
rameters of our lexically aware method is the
weight that we place on the WFSA score dur-
ing inference (λ in Equation 3). 具体来说,
in the case of Griko, we find that the value of
this hyperparameter can significantly affect per-
formance. 如图 3, high weights
of λ (IE。, more weight on the WFSA) lead to
suboptimal WER, while lower λ leads to much
better performance.
1296
数字 6: Our post-correction model fixes many of the first pass OCR errors that the base model does not fix such
作为 (A) 和 (乙). In rare cases, our method introduces errors into the transcription such as (C) 和 (d).
系统: Google Vision and Ocular (described in
部分 6.4). 数字 7 shows the WER on the
Kwak’wala dataset. We see that, although Google
Vision has a much higher first pass error rate
than Ocular, the post-correction model improves
performance over both OCR systems. 我们也
note that the relative error reduction is higher for
the Google Vision system (68%) than for Ocular
(41%), likely because the Ocular LM is trained on
the same data as the post-correction model.
Qualitative Analysis
图中 6, we show ex-
amples of errors fixed as well as errors introduced
by our post-correction model, as compared to the
baseline system. 图中 6 (A) 和 (乙), we see
that although the baseline corrects some of the
errors in the first pass OCR, it also introduces
errors such as extra diacritics and incorrect sub-
机构. Using our proposed method leads to
an error-free transcription of these images. 如何-
曾经, 图中 6 (C) 和 (d), we see that our
method occasionally introduces errors in predic-
系统蒸发散. 具体来说, although the model fixes the
first pass errors, it generates words that are con-
siderably different from the target. Such errors
likely occur when the model follows an incorrect
path in the WFSA during lexically aware decod-
英. Since we are using beam search, the correct
path cannot be recovered if it was pruned at an
earlier timestep.
7 相关工作
OCR post-correction is well-studied in the high-
resource setting, particularly for English. Recent
methods primarily use neural encoder-decoder
型号 (Dong and Smith, 2018; Rigaud et al.,
2019; H¨am¨al¨ainen and Hengchen, 2019). 那里
has been relatively little work on lower-resourced
数字 7: Our post-correction model significantly im-
proves recognition accuracy over different first pass
OCR systems that have varied error rates (Google Vi-
sion and Ocular). Results are shown with Kwak’wala.
语言. Kolak and Resnik (2005) 展示
a probabilistic edit distance model for post-
correction on Cebuano and Igbo, and Krishna
等人. (2018) use a sequence-to-sequence model
with a copy mechanism for improved performance
on Romanized Sanskrit OCR.
While existing neural post-correction methods
do not rely on lexical information, some earlier
methods use dictionaries to improve performance.
例如, Tong and Evans (1996) and Niklas
(2010) use lexicons in combination with n-gram
context to generate post-correction candidates for
erroneous words. These methods are typically
evaluated on English and assume the presence
of high-coverage lexicons (Schulz and Kuhn,
2017), making them difficult to adapt to endan-
gered languages.
Related to our decoding method are models
that incorporate lexical knowledge into neural
machine translation. Arthur et al. (2016) propose
1297
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
adding a dictionary for translating low-frequency
words and Zhang et al. (2018) improve decoding
by upweighting translations that contain relevant
字. 此外, there are methods which add
hard lexical constraints by forcing predictions
to contain user-specified words and phrases
(Hokamp and Liu, 2017; Post and Vilar, 2018).
最后, we note that our proposed approach
combines information from a neural model and
a finite-state machine to leverage the advantages
of both. In a similar direction, Rastogi et al.
(2016) and Lin et al. (2019) design finite state
architectures with paths weighted by contextual
features from an LSTM. These methods use joint
parameterizations of the models and are thus more
complex to train (particularly in the low-resource
环境) than the joint decoding method we pro-
pose in this paper.
8 结论
Digitization at scale for documents in under-
represented languages is a promising avenue to-
wards tackling one aspect of their marginalization,
the lack of data. With this work, we take a step to-
wards better digitization for extremely data-scarce
scenarios. We develop a semi-supervised method
that combines self-training with lexically aware
decoding, reducing error rates by up to 29% 超过
a state-of-the-art OCR post-correction model on
four typologically diverse endangered languages.
In future work, we plan to expand our method
to take advantage of additional outputs of the
language documentation process. 例如,
documentary linguists typically collect word lists
(which range from lists of common words like the
Swadesh lists (Swadesh, 1955) to domain-specific
vocabularies). Using such word lists within the
lexically aware decoding framework could further
improve performance and enable the application of
our technique to even lower-resourced languages.
the improvements we achieve
through semi-supervised learning are potentially
orthogonal to the improvements Rijhwani et al.
(2020) achieve by incorporating information from
translations of the target text. As future work, 我们
plan to investigate the combination of these two
approaches in an attempt to utilize all available
sources of information to improve performance.
此外,
最后, while using a character-level n-gram
LM improves performance on unknown words,
it does not explicitly utilize morphological struc-
ture to generate unseen inflections of words. 在
未来, we plan to incorporate morphological
analysis during post-correction decoding, 哪个
will be helpful for morphologically rich endan-
gered languages.
致谢
Shruti Rijhwani was supported by the Bloomberg
Data Science Ph.D. Fellowship for this work.
This work was also supported by grant PR-
276810-21 (‘‘Unlocking Endangered Language
Resources’’) from the National Endowment for
人文学科, grant 1761548 (‘‘Discover-
ing and Demonstrating Linguistic Features for
Language Documentation’’) from the National
Science Foundation, National Research Council
Indigenous Language Technology grant ‘‘Kwak’
wala Corpus Collection Project’’, and Govern-
ment of Canada Social Sciences and Humanities
Research Council
Insight Development grant
GR002807 ‘‘K’ank’otłaxants Awi’nagwis (Know-
ing our land)’’. Any views, 发现, 结论
or recommendations expressed in this publication
do not necessarily represent those of the National
人文学科捐赠.
We would like to thank Jaymyn La Vallee for
assisting with the Kwak’wala dataset annotation
as well as Samridhi Choudhary, Siddharth Dalmia,
Deepak Gopinath, Maria Ryskina, the reviewers,
and the Action Editors for feedback on the paper.
参考
Philip Arthur, Graham Neubig, and Satoshi
Nakamura. 2016. Incorporating discrete transla-
tion lexicons into neural machine translation. 在
诉讼程序 2016 Conference on Empir-
ical Methods in Natural Language Processing,
pages 1557–1567, Austin, 德克萨斯州. 协会
for Computational Linguistics. https://
doi.org/10.18653/v1/D16-1162
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
本吉奥. 2015. Neural machine translation by
在
jointly learning to align and translate.
3rd International Conference on Learning
Representations, ICLR 2015.
Taylor Berg-Kirkpatrick, Greg Durrett, and Dan
克莱因. 2013. Unsupervised transcription of his-
torical documents. In Proceedings of the 51st
1298
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
Annual Meeting of the Association for Compu-
tational Linguistics (体积 1: Long Papers),
pages 207–217, Sofia, Bulgaria. 协会
计算语言学.
Conference on Document Analysis and Recog-
尼尼申 (ICDAR), 体积 1, pages 161–168.
IEEE. https://doi.org/10.1109/ICDAR
.2017.35
Franz Boas. 1900. Sketch of the Kwakiutl lan-
规格. American Anthropologist, 2(4):708–721.
https://doi.org/10.1525/aa.1900
.2.4.02a00080
Franz Boas. 1921. Ethnology of the Kwakiutl.
https://doi.org/10.1006/csla.1999
.0128
Stanley F. Chen and Joshua Goodman. 1999.
An empirical study of smoothing techniques
for language modeling. Computer Speech &
语言, 13(4):359–394.
Trevor Cohn, Cong Duy Vu Hoang, Ekaterina
Vymolova, Kaisheng Yao, Chris Dyer, 和
Gholamreza Haffari. 2016. Incorporating struc-
tural alignment biases into an attentional
neural translation model. 在诉讼程序中
这 2016 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
pages 876–885, 圣地亚哥, 加利福尼亚州. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/N16-1102
安德鲁·M. Dai and Quoc V. Le. 2015. Semi-
supervised sequence learning. In Proceedings
of the 28th International Conference on Neural
Information Processing Systems – 体积 2,
NIPS’15, page 3079–3087, 剑桥, 嘛,
美国. 与新闻界.
Rui Dong and David Smith. 2018. Multi-input
attention for unsupervised OCR correction. 在
Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 2363–2372,
墨尔本, 澳大利亚. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/P18-1220
Matthew Francis-Landau. 2020. MFST: A python
for custom
openfst wrapper with support
semirings and jupyter notebooks.
Yasuhisa Fujii, Karel Driesen, Jonathan Baccash,
Ash Hurst, and Ashok C. Popat. 2017.
Sequence-to-label script identification for mul-
tilingual OCR. 在 2017 14th IAPR International
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor
O.K. 李. 2016. Incorporating copying mech-
anism in sequence-to-sequence learning. 在
Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 1631–1640,
柏林, 德国. Association for Computa-
tional Linguistics.
Mika H¨am¨al¨ainen and Simon Hengchen. 2019.
From the paft to the fiiture: A fully automatic
NMT and word embeddings method for OCR
post-correction. In Proceedings of the Inter-
national Conference on Recent Advances in
自然语言处理 (RANLP 2019),
pages 431–436, Varna, Bulgaria. INCOMA Ltd.
https://doi.org/10.26615/978-954
-452-056-4 051
Junxian He,
Jiatao Gu,
Jiajun Shen,
和
Marc’Aurelio Ranzato. 2020. Revisiting self-
training for neural sequence generation. 8th
International Conference on Learning Repre-
句子, ICLR 2020, 亚的斯亚贝巴(埃塞俄比亚首都, 埃塞俄比亚-
pia, 四月 26-30, 2020, Conference Track
会议记录.
Kenneth Heafield, Ivan Pouzyrevsky, 乔纳森
H. 克拉克, and Philipp Koehn. 2013. Scalable
modified Kneser-Ney language model estima-
的. In Proceedings of the 51st Annual Meeting
of the Association for Computational Linguis-
抽动症 (体积 2: Short Papers), pages 690–696,
Sofia, Bulgaria. Association for Computational
语言学.
Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. 神经计算,
9(8):1735–1780. https://doi.org/10.1162
/neco.1997.9.8.1735, Pubmed: 9377276
Chris Hokamp and Qun Liu. 2017. Lexically con-
strained decoding for sequence generation using
grid beam search. In Proceedings of the 55th
Annual Meeting of the Association for Compu-
tational Linguistics (体积 1: Long Papers),
pages 1535–1546, Vancouver, 加拿大. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/P17-1141
1299
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
右. Reeve Ingle, Yasuhisa Fujii, Thomas Deselaers,
Jonathan Baccash, and Ashok C. Popat. 2019.
A scalable handwritten text recognition system.
arXiv 预印本 arXiv:1904.09150. https://
doi.org/10.1109/ICDAR.2019.00013
Pratik Joshi, Sebastin Santy, Amar Budhiraja,
Kalika Bali, and Monojit Choudhury. 2020.
The state and fate of linguistic diversity and
inclusion in the NLP world. 在诉讼程序中
the 58th Annual Meeting of the Association for
计算语言学, pages 6282–6293,
在线的. Association for Computational Lin-
语言学. https://doi.org/10.18653
/v1/2020.acl-main.560
Ky¯osuke Kindaichi. 1931. Ainu Jojishi Y¯ukara no
Kenky¯u [Research on Ainu Epic Yukar], T¯oky¯o:
T¯oky¯o Bunko.
Reinhard Kneser and Hermann Ney. 1995.
Improved backing-off for m-gram language
造型. 在 1995 International Conference
on Acoustics, Speech, and Signal Processing,
体积 1, pages 181–184. IEEE.
Okan Kolak and Philip Resnik. 2005. OCR
post-processing for low density languages. 在
Proceedings of Human Language Technol-
ogy Conference and Conference on Empirical
Methods in Natural Language Processing,
pages 867–874, Vancouver, British Columbia,
加拿大. 协会
for Computational
语言学. https://doi.org/10.3115
/1220575.1220684
Amrith Krishna, Bodhisattwa P. Majumder,
Rajesh Bhat, and Pawan Goyal. 2018. Upcycle
your OCR: Reusing OCRs for post-OCR text
correction in Romanised Sanskrit. In Proceed-
ings of the 22nd Conference on Computational
Natural Language Learning, pages 345–355,
布鲁塞尔, 比利时. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/K18-1034
Dong-Hyun Lee. 2013. Pseudo-label: The simple
and efficient semi-supervised learning method
for deep neural networks. In Workshop on
Challenges in Representation Learning, ICML.
Chu-Cheng Lin, Hao Zhu, Matthew R. Gormley,
and Jason Eisner. 2019. Neural finite-state
transducers: Beyond rational relations. In Pro-
ceedings of the 2019 Conference of the North
American Chapter of
the Association for
计算语言学: Human Language
Technologies, 体积 1 (Long and Short Pa-
pers), pages 272–283, 明尼阿波利斯, Minnesota.
计算语言学协会.
Haitao Mi, Baskaran Sankaran, Zhiguo Wang,
and Abe Ittycheriah. 2016. Coverage embed-
ding models for neural machine translation. 在
诉讼程序 2016 Conference on Empir-
ical Methods in Natural Language Processing,
pages 955–960, Austin, 德克萨斯州. 协会
计算语言学.
Mehryar Mohri. 1996. On some applications of
finite-state automata theory to natural language
加工. Natural Language Engineering,
2(1):61–80. https://doi.org/10.1017
/S135132499600126X
Mehryar Mohri, Fernando Pereira, 和迈克尔
Riley. 2002. Weighted finite-state transduc-
ers in speech recognition. Computer Speech &
语言, 16(1):69–88. https://doi.org
/10.1006/csla.2001.0184
Graham Neubig, Chris Dyer, Yoav Goldberg,
Austin Matthews, Waleed Ammar, Antonios
Anastasopoulos, Miguel Ballesteros, 大卫
蒋, Daniel Clothiaux, Trevor Cohn, Kevin
Duh, Manaal Faruqui, Cynthia Gan, Dan
Garrette, Yangfeng
吉, Lingpeng Kong,
Adhiguna Kuncoro, Gaurav Kumar, Chaitanya
Paul Michel, Yusuke Oda,
Malaviya,
Matthew Richardson, Naomi Saphra, Swabha
Swayamdipta, and Pengcheng Yin. 2017.
DyNet: The dynamic neural network toolkit.
arXiv 预印本 arXiv:1701.03980.
Kai Niklas. 2010. Unsupervised post-correction
of OCR errors. Master’s thesis. Leibniz Uni-
versit¨at Hannover.
Matt Post and David Vilar. 2018. Fast lexically
constrained decoding with dynamic beam
allocation for neural machine translation. 在
诉讼程序 2018 Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, 体积 1 (Long Papers),
pages 1314–1324, New Orleans, Louisiana.
1300
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
计算语言学协会.
https://doi.org/10.18653/v1/N18
-1119
Prajit Ramachandran, Peter Liu, and Quoc Le.
2017. Unsupervised pretraining for sequence
to sequence learning. 在诉讼程序中
2017 实证方法会议
自然语言处理, pages 383–391,
哥本哈根, 丹麦. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/D17-1039
Pushpendre Rastogi, Ryan Cotterell, and Jason
艾斯纳. 2016. Weighting finite-state transduc-
tions with neural context. 在诉讼程序中
这 2016 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
pages 623–633, 圣地亚哥, 加利福尼亚州. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/N16-1076
C. Rigaud, A. Doucet, 中号. Coustaty, 和 J. Moreux.
2019. ICDAR 2019 competition on post-OCR
text correction. 在 2019 International Confer-
ence on Document Analysis and Recognition
(ICDAR), pages 1588–1593. https://土井
.org/10.1109/ICDAR.2019.00255
Shruti Rijhwani, Antonios Anastasopoulos, 和
Graham Neubig. 2020. OCR Post Correction for
Endangered Language Texts. In Proceedings
的 2020 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 5931–5942, 在线的. 协会
计算语言学. https://doi.org
/10.18653/v1/2020.emnlp-main.478
Maria
蓝色的,
Ryskina,
泰勒
Eduard
Berg-Kirkpatrick, and Matthew R. Gormley.
2021. Comparative error analysis in neural and
finite-state models for unsupervised character-
level transduction. In Proceedings of the 18th
SIGMORPHON Workshop on Computational
研究
和
Morphology, pages 198–211, 在线的. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/2021.sigmorphon
-1.22
in Phonetics, Phonology,
https://elar.soas.ac.uk/Collection
/MPI186180. Accessed: 2020-02-02.
Carsten Schnober, Steffen Eger, Erik-Lˆan Do
Dinh, and Iryna Gurevych. 2016. Still not there?
Comparing traditional sequence-to-sequence
models to encoder-decoder neural networks on
monotone string translation tasks. In Proceed-
ings of COLING 2016, the 26th International
Conference on Computational Linguistics:
技术论文, pages 1703–1714, 大阪,
日本. The COLING 2016 Organizing Committee.
Sarah Schulz and Jonas Kuhn. 2017. 多-
modular domain-tailored OCR post-correction.
这 2017 会议
在诉讼程序中
in Natural Language
Empirical Methods
加工, pages 2716–2726, 哥本哈根,
丹麦. Association for Computational Lin-
语言学. https://doi.org/10.18653
/v1/D17-1288
Abigail See, 彼得·J. 刘, and Christopher D.
曼宁. 2017. Get to the point: Summariza-
tion with pointer-generator networks. In Pro-
ceedings of the 55th Annual Meeting of the
计算语言学协会
(体积 1: Long Papers), pages 1073–1083,
Vancouver, 加拿大. Association for Computa-
tional Linguistics.
Paolo Stomeo. 1980. Racconti greci inediti di
Sternat´ıa, La nuova Ellade, s.I. https://土井
.org/10.1086/464321
Morris Swadesh. 1955. Towards greater accuracy
in lexicostatistic dating. International Journal
of American Linguistics, 21(2):121–137.
Xiang Tong and David A. 埃文斯. 1996. A statisti-
cal approach to automatic OCR error correction
in context. In Fourth Workshop on Very Large
语料库.
Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua
刘, and Hang Li. 2016. Modeling coverage
for neural machine translation. In Proceedings
of the 54th Annual Meeting of the Associa-
tion for Computational Linguistics (体积 1:
Long Papers), pages 76–85, 柏林, 德国.
计算语言学协会.
Diana Schackow. 2012. Documentation and
grammatical description of Yakkha, Nepal.
David Yarowsky. 1995. Unsupervised word sense
disambiguation rivaling supervised methods.
1301
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
In 33rd Annual Meeting of the Association
for Computational Linguistics, pages 189–196,
剑桥, 马萨诸塞州, 美国. 协会
for Computational Linguistics. https://
doi.org/10.3115/981658.981684
Jingyi Zhang, Masao Utiyama, Eiichro Sumita,
Graham Neubig, and Satoshi Nakamura. 2018.
Guiding neural machine translation with re-
trieved translation pieces. 在诉讼程序中
这 2018 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
体积 1 (Long Papers), pages 1325–1335,
New Orleans, Louisiana. 协会
计算语言学. https://土井
.org/10.18653/v1/N18-1120
Xiaojin Zhu and Andrew B. Goldberg. 2009. 介绍-
duction to semi-supervised learning. Synthesis
Lectures on Artificial Intelligence and Machine
学习, 3(1):1–130.
Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin
Cui, Hanxiao Liu, Ekin Dogus Cubuk, 和
Quoc Le. 2020. Rethinking pre-training and
self-training. Advances in Neural Information
Processing Systems, 33. https://doi.org
/10.2200/S00196ED1V01Y200906AIM006
我
D
哦
w
n
哦
A
d
e
d
F
r
哦
米
H
t
t
p
:
/
/
d
我
r
e
C
t
.
米
我
t
.
e
d
你
/
t
A
C
我
/
我
A
r
t
我
C
e
–
p
d
F
/
d
哦
我
/
.
1
0
1
1
6
2
/
t
我
A
C
_
A
_
0
0
4
2
7
1
9
7
4
7
6
3
/
/
t
我
A
C
_
A
_
0
0
4
2
7
p
d
.
F
乙
y
G
你
e
s
t
t
哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3
1302