Approximating Probabilistic Models as

Approximating Probabilistic Models as
Weighted Finite Automata

Ananda Theertha Suresh
Google Research
theertha@google.com

Brian Roark
Google Research
roark@google.com

Michael Riley
Google Research
riley@google.com

Vlad Schogol
Google Research
vlads@google.com

Weighted finite automata (WFAs) are often used to represent probabilistic models, 例如
n-gram language models, because among other things, they are efficient for recognition tasks
in time and space. The probabilistic source to be represented as a WFA, 然而, may come in
多种形式. Given a generic probabilistic model over sequences, we propose an algorithm to
approximate it as a WFA such that the Kullback-Leibler divergence between the source model
and the WFA target model is minimized. The proposed algorithm involves a counting step and a
difference of convex optimization step, both of which can be performed efficiently. We demonstrate
the usefulness of our approach on various tasks, including distilling n-gram models from neural
型号, building compact language models, and building open-vocabulary character models. 这
algorithms used for these experiments are available in an open-source software library.

1. 介绍

Given a sequence of symbols x1, x2, . . . , xn−1, where symbols are drawn from the alpha-
bet Σ, a probabilistic model S assigns probability to the next symbol xn ∈ Σ by

ps[xn|xn−1 . . . x1]

Some of the results in this article were previously presented in Suresh et al. (2019).

提交材料已收到: 13 一月 2020; 收到修订版: 11 九月 2020; 接受出版:
6 一月 2021.

https://doi.org/10.1162/COLI 00401

© 2021 计算语言学协会
根据知识共享署名-非商业性-禁止衍生品发布 4.0 国际的
(CC BY-NC-ND 4.0) 执照

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

计算语言学

体积 47, 数字 2

Such a model might be Markovian, 在哪里

ps[xn|xn−1 . . . x1] = ps[xn|xn−1 . . . xn−k+1]

such as a k-gram language model (LM) (Chen and Goodman 1998) or it might be
non-Markovian such as a long short-term memory (LSTM) neural network LM
(Sundermeyer, Schl ¨uter, and Ney 2012). Our goal is to approximate a probabilistic
model as a weighted finite automaton (WFA) such that the weight assigned by the
WFA is close to the probability assigned by the source model. Specifically, we will seek
to minimize the Kullback-Leibler (吉隆坡) divergence between the source S and the target
WFA model.

Representing the target model as a WFA has many advantages, including efficient
使用, compact representation, interpretability, and composability. WFA models have
been used in many applications including speech recognition (Mohri, 佩雷拉, 和
Riley 2008), speech synthesis (Ebden and Sproat 2015), optical character recognition
(Breuel 2008), machine translation (Iglesias et al. 2011), computational biology (Durbin
等人. 1998), and image processing (Albert and Kari 2009). One particular problem of
interest is language modeling for on-device (虚拟的) keyboard decoding (Ouyang et al.
2017), where WFA models are widely used because of space and time constraints. 一
example use of methods we present in this article comes from this domain, involving
approximation of a neural LM. Storing training data from the actual domain of use
(individual keyboard activity) in a centralized server and training k-gram or WFA
models directly may not be feasible because of privacy constraints (Hard et al. 2018).
To circumvent this, an LSTM model can be trained by federated learning (Koneˇcn `y
等人. 2016; McMahan et al. 2017; Hard et al. 2018), converted to a WFA at the server,
and then used for fast on-device inference. This not only may improve performance
over training the models just on out-of-domain publicly available data, but also benefits
from the additional privacy provided by federated learning. Chen et al. (2019) use the
methods presented in this article for this very purpose.

There are multiple reasons why one may choose to approximate a source model
with a WFA. One may have strong constraints on system latency, such as the virtual
keyboard example above. 或者, a specialized application may require a distri-
bution over just a subset of possible strings, but must estimate this distribution from a
more general model—see the example below regarding utterances for setting an alarm.
To address the broadest range of use cases, we aim to provide methods that permit
large classes of source models and target WFA topologies. We explore several distinct
scenarios experimentally in Section 5.

Our methods allow failure transitions (Aho and Corasick 1975; Mohri 1997) 在里面
target WFA, which are taken only when no immediate match is possible at a given state,
for compactness. 例如, in the WFA representation of a backoff k-gram model,
failure transitions can compactly implement the backoff (Katz 1987; Chen and Goodman
1998; Allauzen, Mohri, and Roark 2003; Novak, Minematsu, and Hirose 2013; Hellsten
等人. 2017). The inclusion of failure transitions complicates our analysis and algorithms
but is highly desirable in applications such as keyboard decoding. 更远, to avoid
redundancy that leads to inefficiency, we assume the target model is deterministic, 哪个
requires that at each state there is at most one transition labeled with a given symbol.

The approximation problem can be divided into two steps: (1) select an unweighted
automaton A that will serve as the topology of the target automaton and (2) 重量
the automaton A to form our weighted approximation ˆA. The main goal of this article

222

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

数字 1
An unweighted automaton that specifies what one might say to set an alarm. The initial state is
the bold circle and the final state is the double circle. By convention, we terminate all accepted
strings with the symbol $. is the latter determination of the automaton’s weighting in the approximation. If the topology is not known, we suggest a few techniques for inferring topology later in the Introduction. We will now give some simple topology examples to illustrate the approximation idea. 在部分 5 we will give larger-scale examples. Consider the unweighted automa- ton A in Figure 1 that was designed for what one might say to set an alarm. To use this in an application such as speech recognition, we would want to weight the automaton with some reasonable probabilities for the alternatives. 例如, people may be more likely to set their alarms for six or seven than four. In the absence of data specifically for this scenario, we can fall back on some available background LM M, trained on a large suitable corpus. 尤其, we can use the conditional distribution pM[x1 . . . xn|x1 . . . xn ∈ L(A)] = pM[x1 . . . xn]1×1…xn∈L(A) (西德:80) x1…xn∈L(A) pM[x1 . . . xn] (1) 在哪里 1 is the indicator function and L(A) is the regular language accepted by the automaton A, as our source distribution S. We then use the unweighted automaton A as our target topology. If M is represented as a WFA, our approximation will in general give a different so- lution than forming the finite-state intersection with A and weight-pushing to normalize the result (Mohri, 佩雷拉, and Riley 2008; Mohri 2009). Our approximation has the same states as A whereas weight-pushed M ∩ A has O(|中号||A|) 状态. 此外, 重量- pushed M ∩ A is an exact WFA representation of the distribution in Equation (1). As stated earlier, some applications may simply require smaller models or those with lower latency of inference, and in such scenarios the specific target topology 223 l 从http下载 : / / 直接的 . 米特 . 呃呃 / c o l i / 拉蒂斯 – df / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 压力 . 来宾来访 0 7 九月 2 0 2 3 01set3alarm2my4toforalarm5twelveeleventennineeightsevensixfivefourthreetwoone7$6o’clock$ Computational Linguistics Volume 47, 数字 2 (A) (乙) 数字 2 Topology examples derived from the corpus aab. States are labeled with the context that is remembered, ∧ denotes the initial context, (西德:15) the empty context, $ the final context (和
terminates accepted strings), and matches any symbol in a context. (A) 3-gram topology: failure
transitions, labeled with ϕ, implement backoff from histories xy to y to (西德:15). (乙) skip-gram
topology: failure transitions implement backoff instead from histories xy to x .

may be unknown. 在这种情况下, one choice is to build a k-gram deterministic finite
automaton (DFA) topology from a corpus drawn from S (Allauzen, Mohri, and Roark
2003). This could be from an existing corpus or from random samples drawn from S.
数字 2(A) shows a trigram topology for the very simple corpus aab. 数字 2(乙) 节目
an alternative topology that allows skip-grams. Both of these representations make use
of failure transitions. These allow modeling strings unseen in the corpus (例如, abab) 在一个
compact way by failing or backing-off to states that correspond to lower-order histories.
Such models can be made more elaborate if some transitions represent classes, 例如
names or numbers, that are themselves represented by sub-automata. As mentioned
previously, we will mostly assume we have a topology either pre-specified or inferred
by some means and focus on how to weight that topology to best approximate the
source distribution. We hence focus our evaluation on intrinsic measures of model
quality such as perplexity or bits-per-character.1

1 Model size or efficiency of inference may be a common motivation for the model approximations we

展示, which may suggest a demonstration of the speed/accuracy tradeoff for some downstream use of
the models. 然而, as stated earlier in this Introduction, there are mulitple reasons why an
approximation may be needed, and our goal in this article is to establish that our methods provide a
well-motivated approach should an approximation be required for any reason.

224

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

^^aaεφaaa_aφa_bb$$φabbφabφ$φ$^^aaεφaaa^_φ_aa_bb$$abba_φφaφab$φφ$φb$ Suresh et al. Approximating Probabilistic Models as WFA This article expands upon an earlier, shorter version (Suresh et al. 2019) by also pro- viding, beyond the additional motivating examples that have already been presented in this Introduction: an extended related work section and references; expansion of the theoretical analysis from three lemmas without proofs (omitted for space) to five lemmas (and a corollary) with full proofs; inclusion of additional algorithms for count- 英 (for general WFA source and target in addition to ϕ-WFA); inclusion of additional experiments, including some illustrating the exact KL-minimization methods available for WFA sources of different classes; and documentation of the open-source library that provides the full functionality presented here. The article is organized as follows. 在部分 2 we review previous work in this area. 在部分 3 we give the theoretical formulation of the problem and the minimum KL divergence approximation. 在部分 4 we present algorithms to compute that solution. One algorithm is for the case that the source itself is finite-state. A second algorithm is for the case when it is not and involves a sampling approach. 在部分 5 we show experiments using the approximation. 在部分 6 we briefly describe the open-source software used in our experiments. 最后, in Section 7 we discuss the results and offer conclusions. 2. Related Work In this section we will review methods both for inferring unweighted finite-state models from data and estimating the weight distribution as well in the weighted case. We start with the unweighted case. There is a long history of unweighted finite-state model inference (Parekh and Honavar 2000; Cicchello and Kremer 2003). 金子 (1967) showed that an arbitrary reg- ular set L cannot be learned, identified in the limit, strictly from the presentation of a sequence of positive examples that eventually includes each string in L. This has led to several alternative lines of attack. One approach is to include the negative examples in the sequence. Given such a complete sample, there are polynomial-time algorithms that identify a regular set in the limit (金子 1978). 例如, a prefix tree of the positive examples can be built and then states can be merged so long as they do not cause a negative example to be accepted (Oncina and Garcia 1992; Dupont 1996). Another approach is to train a recurrent neural network (RNN) on the positive and negative examples and then extract a finite automaton by quantizing the continuous state space of the RNN (Giles et al. 1992; Jacobsson 2005). A second approach is to assume a teacher is available that determines not only if a string is a positive or negative example but also if the language of the current hypothe- sized automaton equals L or, 如果不, provides a counterexample. In this case the minimal m-state DFA corresponding to L can be learned in time polynomial in m (Angluin 1987). 韦斯, Goldberg, and Yahav (2018) apply this method for DFA extraction from an RNN. A third approach is to assume a probability distribution over the (positive only) 样品. With some reasonable restrictions on the distribution, such as that the proba- bilities are generated from a weighted automaton A with L = L(A), then L is identifiable in the limit with “high probability” (Angluin 1988; Pitt 1989). There have been a variety of approaches for estimating weighted automata. A variant of the prefix tree construction can be used that merges states with sufficiently similar suffix distributions, estimated from source frequencies (Carrasco and Oncina 1994, 1999). Approaches that produce (possibly highly) non-deterministic results in- clude the Expectation-Maximization (EM) algorithm (Dempster, Laird, and Rubin 1977) 225 l 从http下载 : / / 直接的 . 米特 . 呃呃 / c o l i / 拉蒂斯 – df / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 压力 . 来宾来访 0 7 九月 2 0 2 3 Computational Linguistics Volume 47, 数字 2 applied to fully connected Hidden Markov models or spectral methods applied to au- tomata (Balle and Mohri 2012; Balle et al. 2014). 艾斯纳 (2001) describes an algorithm for estimating probabilities in a finite-state transducer from data using EM-based methods. 韦斯, Goldberg, and Yahav (2019) and Okudono et al. (2020) provide adaptations to the Weiss, Goldberg, and Yahav (2018) DFA extraction algorithm to yield weighted automata. For approximating neural network (NN) models as WFAs, Deoras et al. (2011) used an RNN LM to generate samples that they then used to train a k-gram LM. Arisoy et al. (2014) used deep neural network (DNN) models of different orders to successively build and prune a k-gram LM with each new order constrained by the previous order. Adel et al. (2014) also trained DNNs of different orders, built a k-gram LM on the same data to obtain a topology, and then transferred the DNN probabilities of each order onto that k-gram topology. Ti ˜no and Vojtek (1997) quantized the continuous state space of an RNN and then estimated the transition probabilities from the RNN. Lecorv´e and Motlicek (2012) quantized the hidden states in an LSTM to form a finite-state model and then used an entropy criterion to back off to low-order k-grams to limit the number of transitions. See Section 4.3 for a more detailed comparison with the most closely related methods, once the details of our algorithms have been provided. Our article is distinguished in several respects from previous work. 第一的, our general approach does not depend on the form of the source distribution although we specialize our algorithms for (已知的) finite-state sources with an efficient direct construction and for other sources with an efficient sampling approach. 第二, our targets are a wide class of deterministic automata with failure transitions. These are con- siderably more general than k-gram models but retain the efficiency of determinism and the compactness failure transitions allow, which is especially important in applications with large alphabets like language modeling. 第三, we show that our approximation searches for the minimal KL divergence between the source and target distributions, given a fixed target topology provided by the application or some earlier computation. 3. Theoretical Analysis 3.1 Probabilistic Models Let Σ be a finite alphabet. Let xn 1. A probabilistic model p over Σ is a probabilistic distribution over the next symbol xn, given the previous symbols xn−1, such that2 i ∈ Σ∗ denote the string xixi+1 . . . xn and xn (西德:44) xn p(xn = x|xn−1) = 1 和 p(xn = x|xn−1) 0, ∀x ∈ Σ (西德:88) x∈Σ Without loss of generality, we assume that the model maintains an internal state q and updates it after observing the next symbol.3 Furthermore, the probability of the subsequent state just depends on the state q p (西德:0)xn i+1|希(西德:1) = p (西德:0)xn i+1|q(希)(西德:1) 2 We define x0 (西德:44) (西德:15), the empty string, and adopt p((西德:15)) = 0. 3 In the most general case, q(xn ) = xn. 226 l 从http下载 : / / 直接的 . 米特 . 呃呃 / c o l i / 拉蒂斯 – df / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 压力 . 来宾来访 0 7 九月 2 0 2 3 Suresh et al. Approximating Probabilistic Models as WFA for all i, n, 希, xn i+1, where q(希) is the state that the model has reached after observing sequence xi. Let Q(p) be the set of possible states. Let the language L(p) ⊆ Σ∗ defined by the distribution p be L(p) (西德:44) {xn ∈ Σ∗ : p(xn) > 0 and xn = $ and xi (西德:54)= $, ∀ i < n} (2) The symbol $ is used as a stopping criterion. Further, for all xn ∈ Σ∗ such that xn−1 = $, p(xn|xn−1) = 0. The KL divergence between the source model ps and the target model pa is given by D(ps||pa) = (cid:88) xn∈Σ∗ ps(xn) log ps(xn) pa(xn) (3) where for notational simplicity, we adopt the notion 0/0 = 1 and 0 log(0/0) = 0 throughout the article. Note that for the KL divergence to be finite, we need L(ps) ⊆ L(pa). We will assume throughout that the source entropy H(ps) = − (cid:80) xn ps(xn) log ps(xn) is finite.4 We first reduce the KL divergence between two models as follows (cf. Carrasco 1997; Cortes et al. 2008). In the following, let q∗ denote the states of the probability distribution p∗. Lemma 1 If L(ps) ⊆ L(pa), then D(ps||pa) = (cid:88) (cid:88) (cid:88) qa∈Qa qs∈Qs x∈Σ γ(qs, qa) ps(x|qs) log ps(x|qs) pa(x|qa) ∞ (cid:88) (cid:88) γ(qs, qa) = ps(xi) i=0 xi:qs(xi )=qs,qa(xi )=qa where Proof. D(ps||pa) = = = = ∞ (cid:88) (cid:88) n=1 xn ∞ (cid:88) (cid:88) n=1 xn ps(xn) log ps(xn) pa(xn) ps(xn) n (cid:88) i=1 log ps(xi|xi−1) pa(xi|xi−1) ∞ (cid:88) n (cid:88) (cid:88) n=1 i=1 xn ∞ (cid:88) n (cid:88) (cid:88) n=1 i=1 xn ps(xn) log ps(xi|xi−1) pa(xi|xi−1) ps(xi−1)ps(xi|xi−1)ps(xn i+1|xi) log ps(xi|xi−1) pa(xi|xi−1) 4 If |Q(ps )| is finite, it can be shown that H(ps ) is necessarily finite. (4) (5) 227 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 47, Number 2 = = ∞ (cid:88) (cid:88) i=1 xi−1 ∞ (cid:88) (cid:88) i=1 xi−1 ps(xi−1) ps(xi−1) (cid:88) xi (cid:88) xi ps(xi|xi−1) log ps(xi|xi−1) pa(xi|xi−1) · (cid:88) (cid:88) n≥i xn i+1 ps(xn i+1|xi) ps(xi|xi−1) log ps(xi|xi−1) pa(xi|xi−1) By definition, the probability of the next symbol conditioned on the past just depends on the state. Hence grouping terms corresponding to same states both in s and t yields ∞ (cid:88) (cid:88) i=1 xi−1 ps(xi−1) (cid:88) xi ps(xi|xi−1) log ps(xi|xi−1) pa(xi|xi−1) = = = ∞ (cid:88) (cid:88) i=1 xi−1 ps(xi−1) (cid:88) xi ps(xi|qs(xi−1)) log ps(xi|qs(xi−1)) pa(xi|qa(xi−1)) (cid:88) (cid:88) ∞ (cid:88) (cid:88) qa∈Qa qs∈Qs i=1 xi−1:qs(xi−1 )=qs,qa(xi−1 )=qa ps(xi−1) (cid:88) xi ps(xi|qs) log ps(xi|qs) pa(xi|qa) (cid:88) (cid:88) (cid:88) qa∈Qa qs∈Qs xi γ(qs, qa) ps(xi|qs) log ps(xi|qs) pa(xi|qa) Replacing xi by x yields the lemma. (cid:3) The quantity γ(qs, qa) counts each string xi that reaches both state qs in Qs and state qa in Qa weighted by its probability according to ps. Equivalently, it counts each string xi reaching state (qs, qa) in Qs × Qa (an interpretation we develop in Section 4.1.1).5 Note that γ does not depend on distribution pa. The following corollary is useful for finding the probabilistic model pa that has the minimal KL divergence from a model ps. Corollary 1 Let P be a set of probabilistic models pa for which L(ps) ⊆ L(pa). Then argmin pa∈P D(ps||pa) = argmax pa∈P (cid:88) (cid:88) qa∈Qa x∈Σ c(x, qa) log pa(x|qa) where c(x, qa) = (cid:88) qs∈Qs γ(qs, qa) ps(x|qs) (6) 5 γ is not (necessarily) a probability distribution over Qs × Qa. In fact, γ(qs, qa ) can be greater than 1. 228 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Suresh et al. Approximating Probabilistic Models as WFA Proof. By Lemma 1 argmin pa∈P D(ps||pa) = argmin pa∈P = argmin pa∈P qs∈Qs x∈Σ qa∈Qa   (cid:88) (cid:88) (cid:88)  qa∈Qa qs∈Qs x∈Σ (cid:88) (cid:88) (cid:88) γ(qs, qa) ps(x|qs) log ps(x|qs) pa(x|qa) γ(qs, qa) ps(x|qs) log ps(x|qs) (cid:88) (cid:88) (cid:88) − γ(qs, qa) ps(x|qs) log pa(x|qa) (7)    qa∈Qa (cid:88) qs∈Qs (cid:88) x∈Σ c(x, qa) log pa(x|qa) = argmax pa∈P qa∈Qa x∈Σ since the first term in Equation (7) does not depend on pa. (cid:3) The quantity c(xi, qa) counts each string xi that reaches a state (qs, qa) in Qs × {qa} by xi−1 1 weighted by its probability according to ps (cf. Equation (18)). 3.2 Weighted Finite Automata A weighted finite automaton A = (Σ, Q, E, i, f ) over R+ is given by a finite alphabet Σ, a finite set of states Q, a finite set of transitions E ⊆ Q × Σ × R+ × Q, an initial state i ∈ Q, and a final state f ∈ Q. A transition e = (p[e], (cid:96)[e], w[e], n[e]) ∈ E represents a move from a previous (or source) state p[e] to the next (or destination) state n[e] with the label (cid:96)[e] and weight w[e]. The transitions with previous state q are denoted by E[q] and the labels of those transitions as L[q]. A deterministic WFA has at most one transition with a given label leaving each state. An unweighted (finite) automaton is a WFA that satisfies w[e] = 1, ∀e ∈ E. A probabilistic (or stochastic) WFA satisfies (cid:88) e∈E[q] w[e] = 1 and w[e] ≥ 0, ∀q ∈ Q − {f } Transitions e1 and e2 are consecutive if n[ei] = p[ei+1]. A path π = e1 · · · en ∈ E∗ is a finite sequence of consecutive transitions. The previous state of a path we denote by p[π] and the next state by n[π]. The label of a path is the concatenation of its transition labels (cid:96)[π] = (cid:96)[e1] · · · (cid:96)[en]. The weight of a path is obtained by multiplying its transition weights w[π] = w[e1] × · · · × w[en]. For a non-empty path, the i-th transition is denoted by πi. P(q, q(cid:48)) denotes the set of all paths in A from state q to q(cid:48). We extend this to sets in the obvious way: P(q, R) denotes the set of all paths from state q to q(cid:48) ∈ R and so forth. A path π is successful if it is in P(i, f ) and in that case the automaton is said to accept the input string α = (cid:96)[π]. The language accepted by an automaton A is the regular set L(A) = {α ∈ Σ∗ : α = (cid:96)[π], π ∈ P(i, f )}. The weight of α ∈ L(A) assigned by the automaton is A(α) = Σπ∈P(i,f ): (cid:96)[π]=αw[π]. Similar to Equation (2), we assume a symbol $ ∈ Σ such that L(A) ⊆ {xn ∈ Σ∗ : xn = $ and xi (cid:54)= $, ∀ i < n} Thus all successful paths are terminated by the symbol $. 229 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 47, Number 2 For a symbol x ∈ Σ and a state q ∈ Q of a deterministic, probabilistic WFA A, define a distribution pa(x|q) (cid:44) w if (q, x, w, q(cid:48)) ∈ E and pa(x|q) (cid:44) 0 otherwise. Then pa is a probabilistic model over Σ as defined in the previous section. If A = (Σ, Q, E, i, f ) is an unweighted deterministic automaton, we denote by P (A) the set of all probabilistic models pa representable as a weighted WFA ˆA = (Σ, Q, ˆE, i, f ) with the same topology as A where ˆE = {(q, x, pa(x|q), q(cid:48)) : (q, x, 1, q(cid:48)) ∈ E}. Given an unweighted deterministic automaton A, our goal is to find the target dis- tribution pa ∈ P (A) that has the minimum KL divergence from our source probability model ps. Lemma 2 If L(ps) ⊆ L(A), then where Proof. From Corollary 1 argmin pa∈P(A) D(ps||pa) = ˜p(x|qa) (cid:44) c(x, qa) c(qa) (8) c(qa) = (cid:88) x∈Σ c(x, qa) argmin pa∈P(A) D(ps||pa) = argmax pa∈P(A) = argmax pa∈P(A) (cid:88) (cid:88) qa∈Qa (cid:88) x∈Σ (cid:88) qa∈Qa x∈Σ c(x, qa) log pa(x|qa) c(qa) ˜p(x|qa) log pa(x|qa) = argmin pa∈P(A) (cid:88) (cid:110) − (cid:88) c(qa) qa∈Qa x∈Σ ˜p(x|qa) log pa(x|qa) (cid:111) The quantity in braces is minimized when pa(x|qa) = ˜p(x|qa) because it is the cross entropy between the two distributions. Since L(ps) ⊆ L(A), it follows that ˜p ∈ P (A). (cid:3) 3.3 Weighted Finite Automata with Failure Transitions A weighted finite automaton with failure transitions (ϕ-WFA) A = (Σ, Q, E, i, f ) is a WFA extended to allow a transition to have a special failure label denoted by ϕ. Then E ⊆ Q × (Σ ∪ {ϕ}) × R+ × Q. A ϕ transition does not add to a path label; it consumes no input.6 However, it is followed only when the input cannot be read immediately. Specifically, a path e1 · · · en in a ϕ-WFA is disallowed if it contains a subpath ei · · · ej such that (cid:96)[ek] = ϕ for all k, i ≤ k < j, and there is another transition e ∈ E such that p[ei] = p[e] and (cid:96)[ej] = (cid:96)[e] ∈ Σ (see Figure 3). Since the label x = l[ej] can be read on e, we do not follow the failure transitions to read it on ej as well. 6 In other words, a ϕ label, like an (cid:15) label, acts as an identity element in string concatenation (Allauzen and Riley 2018). 230 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Suresh et al. Approximating Probabilistic Models as WFA Figure 3 The (dashed red) path ei = (qi, ϕ, ωi, qi+1) to ej = (qj, x, ωj, qj+1) is disallowed since x can be read already on e = (qi, x, ω, q). We use P∗(q, q(cid:48)) ⊆ P(q, q(cid:48)) to denote the set of (not dis-) allowed paths from state q to q(cid:48) in a ϕ-WFA. This again extends to sets in the obvious way. A path π is successful in a ϕ-WFA if π ∈ P∗(i, f ) and only in that case is the input string α = (cid:96)[π] accepted. The language accepted by the ϕ-automaton A is the set L(A) = {α ∈ Σ∗ : α = (cid:96)[π], π ∈ P∗(i, f )}. The weight of α ∈ Σ∗ assigned by the automaton is A(α) = Σπ∈P∗(i,f ): (cid:96)[π]=αw[π]. We assume each string in L(A) is terminated by the symbol $ as before. We also assume there are no ϕ-labeled cycles and there is at most one exiting failure transition per state. We express the ϕ-extended transitions leaving q as (cid:110) E∗[q] = (q, x, ω, q(cid:48)) : (cid:111) π ∈ P∗(q, Q), x = (cid:96)[π] = (cid:96)[π|π|] ∈ Σ, ω = w[π], q(cid:48) = n[π] This is a set of (possibly new) transitions (q, x, ω, q(cid:48)), one for each allowed path from previous state q to next state q(cid:48) with optional leading failure transitions and a final x-labeled transition. Denote the labels of E∗[q] by L∗[q]. Note the WFA (Σ, Q, E∗, i, f ) accepts the same strings with the same weights as ϕ-WFA A and thus L(A) is regular (Allauzen and Riley 2018). A probabilistic (or stochastic) ϕ-WFA satisfies (cid:88) e∈E∗[q] w[e] = 1 and w[e] ≥ 0, ∀q ∈ Q − {f } A deterministic ϕ-WFA is backoff-complete if a failure transition from state q to q(cid:48) implies L[q] ∩ Σ ⊆ L[q(cid:48)] ∩ Σ. Further, if ϕ /∈ L[q(cid:48)], then the containment is strict: L[q] ∩ Σ ⊂ L[q(cid:48)] ∩ Σ. In other words, if a symbol can be read immediately from a state q it can also be read from a state failing (backing-off ) from q and if q(cid:48) does not have a backoff arc, then at least one additional label can be read from q(cid:48) that cannot be read from q. For example, both topologies depicted in Figure 2 have this property. We restrict our target automata to have a topology with the backoff-complete property since it will simplify our analysis, make our algorithms efficient, and is commonly found in applications. For example, backoff n-gram models, such as the Katz model, are ϕ-cycle-free, backoff- complete ϕ-WFAs (Katz 1987; Chen and Goodman 1998). For a symbol x ∈ Σ and a state q ∈ Q of a deterministic, probabilistic ϕ-WFA A, a is a prob- a at a state q a (x|q) (cid:44) 0 otherwise. Then p∗ define p∗ abilistic model over Σ as defined in Section 3.1. Note the distribution p∗ a (x|q) (cid:44) w if (q, x, w, q(cid:48)) ∈ E∗[q] and p∗ 231 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 qiqx/ωqi+1φ/ωiqjφ/ωi+1...φ/ωj-1qj+1x/ωj Computational Linguistics Volume 47, Number 2 is defined over the ϕ−extended transitions E∗[q] where pa in the previous section is defined over the transitions E[q]. It is convenient to define a companion distribution pa ∈ a as follows:7 Given a symbol x ∈ Σ ∪ {ϕ} and state q ∈ Q, define pa(x|q) (cid:44) P (A) to p∗ a (x|q) when x ∈ L[q] ∩ Σ, pa(ϕ|q) (cid:44) 1 − (cid:80) a (x|q), and pa(x|q) (cid:44) 0 otherwise. x∈L[q]∩Σ p∗ p∗ The companion distribution is thus defined solely over the transitions E[q]. When A = (Σ, Q, E, i, f ) is an unweighted deterministic, backoff-complete ϕ-WFA, a representable as a weighted we denote by P ∗(A) the set of all probabilistic models p∗ ϕ-WFA ˆA = (Σ, Q, ˆE, i, f ) of same topology as A with ˆE ={(q, x, pa(x|q), q(cid:48)) : (q, x, 1, q(cid:48)) ∈ E, x ∈ Σ} ∪ {(q, ϕ, α(q, q(cid:48)), q(cid:48)) : (q, ϕ, 1, q(cid:48)) ∈ E} where pa ∈ P (A) is the companion distribution to p∗ the weight of the failure transition from state q to q(cid:48) with a and α(q, q(cid:48)) = pa(ϕ|q)/d(q, q(cid:48)) is d(q, q(cid:48)) = 1 − (cid:88) pa(x|q(cid:48)) x∈L[q]∩Σ (9) Note that we have specified the weights on the automaton that represents p∗ a ∈ P ∗(A) entirely in terms of the companion distribution pa ∈ P (A). The failure transi- tion weight α is determined from the requirement that (cid:80) x E∗[q] = 1 (Katz 1987, Equation (16)). Thanks to the backoff-complete property, this transition weight depends only on its previous state q and its next state q(cid:48) and not all the states in the failure path from q. This is a key reason for assuming that property; otherwise our algorithms could not benefit from this locality. Conversely, each distribution pa ∈ P (A) can be associated with a distribution p∗ a ∈ P ∗(A) given a deterministic, backoff-complete ϕ-WFA A. First extend α(q, q(cid:48)) to any failure path as follows. Denote a failure path from state q to q(cid:48) by πϕ(q, q(cid:48)). Then define α(q, q(cid:48)) = (cid:89) e∈πϕ(q,q(cid:48) ) pa(ϕ|p[e]) d(p[e], n[e]) (10) where this quantity is taken to be 1 when the failure path is empty (q = q(cid:48)). Finally define (cid:40) p∗ a (x|q) = α(q, qx) pa(x|qx), x ∈ L∗[q] otherwise 0, (11) where for x ∈ L∗[q], qx signifies the first state q(cid:48) on a ϕ-labeled path in A from state q for which x ∈ L[q(cid:48)]. For Equation (10) to be well-defined, we need d(p[e], n[e]) > 0. To ensure this condi-

的, we restrict P (A) to contain distributions such that pa(X|q) (西德:15) for each x ∈ L[q].8

Given an unweighted deterministic, backoff-complete, automaton A, our goal is to
a ∈ P ∗(A) that has the minimum KL divergence from our

find the target distribution p∗
source probability model ps.

7 The meaning of P (A) when A is ϕ-WFA is to interpret it as a WFA with the failure labels as regular

symbols.

8 For brevity, we do not include (西德:15) in the notation of P (A).

232

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

Lemma 3
Assume L(ps) ⊆ L(A). Let p∗ = ˜p(X|qa) from Lemma 2. If L(p∗) ⊆ L(A) and p∗ ∈ P ∗(A)
然后

p∗ = argmin
a ∈P∗(A)
p∗

D(ps||p∗
A )

Proof. This follows immediately from Lemma 2.

(西德:3)

The requirement that the p∗ of Lemma 3 is in P ∗(A) will be true if, 例如, 这
target has no failure transitions or if the source and target are both ϕ-WFAs with the
same topology and failure transitions. 一般来说, this requirement cannot be assured.
While membership in P (A) principally requires the weights of the transitions leaving
a state are non-negative and sum to 1, membership in P ∗(A) imposes additional
constraints due to the failure transitions, indicated in Equation (11).

像这样, we restate our goal in terms of the companion distribution pa ∈ P (A)
a ∈ P ∗(A) 直接地. Let Bn(q) be the set of

rather than its corresponding distribution p∗
states in A that back off to state q in n failure transitions and let B(q) = (西德:83)|Qa|

n=0 Bn(q).

Lemma 4
If L(ps) ⊆ L(A) 然后

argmin
a ∈P∗(A)
p∗

D(ps||p∗

A ) = argmax
pa∈P(A)

(西德:88)

(西德:26) (西德:88)

q∈Qa

x∈L[q]

C(X, q) log pa(X|q) -

C(ϕ, q0) log d(q0, q)

(西德:27)

(西德:88)

q0∈B1(q)

在哪里

C(X, q) =

(西德:88)

C(X, qa)1q=qx
A ,

x ∈ Σ

C(ϕ, q) =

qa∈B(q)
(西德:88)

(西德:88)

qa∈B(q)

x∈Σ

C(X, qa)1x/∈L[q]

(12)

(13)

and do not depend on pa.

Proof. From Corollary 1, 方程 (11) and the previously shown 1:1 correspondence
between each distribution p∗

a ∈ P ∗(A) and its companion distribution pa ∈ P (A)

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

argmin
a ∈P∗(A)
p∗

D(ps||p∗

A ) = argmin
a ∈P∗(A)
p∗

(西德:88)

(西德:88)

C(X, qa) log p∗

A (X|qa)

qa∈Qa
(西德:88)

x∈L∗[qa]
(西德:88)

C(X, qa) log α(qa, qx

A )pa(X|qx
A )

= argmax
pa∈P(A)

= argmax
pa∈P(A)

= argmax
pa∈P(A)

qa∈Qa

x∈L∗[qa]

(西德:88)

(西德:88)

C(X, qa) 日志

(西德:89)

x∈L∗[qa]

qa∈Qa
(西德:26)

Ax + Aϕ − Ad

(西德:27)

e∈πϕ(qa,qx
A )

pa(ϕ|p[e])
d(p[e], n[e])

pa(X|qx
A )

(14)

233

计算语言学

体积 47, 数字 2

where we distribute the factors inside the logarithm in Equation (14) as follows:

(西德:88)

(西德:88)

Ax =

C(X, qa) log pa(X|qx
A )

qa∈Qa
(西德:88)

x∈L∗[qa]
(西德:88)

(西德:88)

C(X, qa)1q=qx

a log pa(X|q)

(15)

q∈Qa
(西德:88)

qa∈B(q)
(西德:88)

x∈L[q]∩Σ

C(X, q) log pa(X|q)

q∈Qa

x∈L[q]∩Σ

=

=

方程 (15) follows from q = qx

a implying qa ∈ B(q).

(西德:88)

(西德:88)

Aϕ =

C(X, qa) 日志

(西德:89)

pa(ϕ|p[e])

qa∈Qa
(西德:88)

x∈L∗[qa]
(西德:88)

C(X, qa)

e∈πϕ(qa,qx
A )
(西德:88)

log pa(ϕ|p[e])

qa∈Qa
(西德:88)

x∈L∗[qa]
(西德:88)

(西德:88)

e∈πϕ(qa,qx
A )
(西德:88)

C(X, qa)

1q=p[e] log pa(ϕ|q)

q∈Qa
(西德:88)

qa∈B(q)
(西德:88)

x∈L∗[qa]
(西德:88)

e∈πϕ(qa,qx
A )

C(X, qa)1x/∈L[q] log pa(ϕ|q)

qa∈B(q)

x∈Σ

C(ϕ, q) log pa(ϕ|q)

q∈Qa
(西德:88)

q∈Qa

=

=

=

=

方程 (16) follows from e ∈ πϕ(qa, qx

A ) implying x /∈ p[e].

(西德:88)

(西德:88)

Ad =

C(X, qa) 日志

(西德:89)

d(p[e], n[e])

qa∈Qa
(西德:88)

x∈L∗[qa]
(西德:88)

C(X, qa)

e∈πϕ(qa,qx
A )
(西德:88)

log d(p[e], n[e])

=

=

=

=

qa∈Qa
(西德:88)

x∈L∗[qa]
(西德:88)

(西德:88)

e∈πϕ(qa,qx
A )
(西德:88)

C(X, qa)

q∈Qa
(西德:88)

qa∈B(q0 )
(西德:88)

q0∈B1(q)
(西德:88)

x∈L∗[qa]
(西德:88)

q∈Qa
(西德:88)

qa∈B(q0 )
(西德:88)

q∈Qa

q0∈B1(q)

q0∈B1(q)

x∈Σ

C(ϕ, q0) log d(q0, q)

(西德:88)

e∈πϕ(qa,qx
A )

1q0=p[e] log d(q0, q)

C(X, qa)1x/∈L[q0] log d(q0, q)

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

(16)

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Substituting these results into Equation (14) proves the lemma.

(西德:3)

If there are no failure transitions in the target automaton, then C(X, q) = c(X, q), if x
is in Σ, 并且是 0 否则. 在这种情况下, the statement of Lemma 4 simplifies to that of
Corollary 1.

234

Suresh et al.

Approximating Probabilistic Models as WFA

很遗憾, we do not have a closed-form solution, analogous to Lemma 2 或者
Lemma 3, for the general ϕ-WFA case. Instead we will present numerical optimization
algorithms to find the KL divergence minimum in the next section. This is aided by
the observation that the quantity in braces in the statement of Lemma 4 depends on
the distribution pa only at state q. Thus the KL divergence minimum can be found by
maximizing that quantity independently for each state.

4. Algorithms

Approximating a probabilistic source algorithmically as a WFA requires two steps: (1)
compute the quantity c(X, qa) found in Corollary 1 and Lemma 2 or C(X, q) in Lemma 4
和 (2) use this quantity to find the minimum KL divergence solution. The first step,
which we will refer to as counting, is covered in the next section and the KL divergence
minimization step is covered afterwards, followed by an explicit comparison of the
presented algorithms with closely related prior work.

4.1 Counting

How the counts are computed will depend on the form of the source and target models.
We break this down into several cases.

4.1.1 WFA Source and Target. When the source and target models are represented as WFAs
we compute c(X, qa) from Lemma 2. From Equation (6) this can be written as

C(X, qa) =

(西德:88)

qs∈Qs

C(qs, qa)ps(X|qs)

(17)

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C

在哪里


(西德:88)

(西德:88)

C(qs, qa) =

ps(希)

i=0

希:qs(希 )=qs,qa(希 )=qa

The quantity γ(qs, qa) can be computed as

C(qs, qa) =

(西德:88)

w[圆周率]

π∈PS∩A((是,ia ),(qs,qa ))


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

where S ∩ A is the weighted finite-state intersection of automata S and A (Mohri 2009).
The above summation over this intersection is the (generalized) shortest distance from
the initial state to a specified state computed over the positive real semiring (Mohri
2002; Allauzen and Riley 2018). Algorithms to efficiently compute the intersection and
shortest distance on WFAs are available in OpenFst (Allauzen et al. 2007), an open-
source WFA library.

Then from Equation (17) we can form the sum

C(X, qa) =

(西德:88)

C(qs, qa) w

((qs,qa ),X,w,(q(西德:48)

s ,q(西德:48)

A ))∈ES∩A

(18)

235

计算语言学

体积 47, 数字 2

方程 (18) is the weighted count of the paths in S ∩ A that begin at the initial state
and end in any transition leaving a state (qs, qa) labeled with x.

The worst-case time complexity for the counting step is dominated by the short-
est distance algorithm on the intersection S ∩ A. The shortest distance computation
is a meta-algorithm that depends on the queue discipline selected (Mohri 2002). 如果
s is the maximum number of times a state in the intersection is inserted into the
shortest distance queue, C the maximum cost of a queue operation, and D the max-
imum out-degree in S and A (both assumed deterministic), then the algorithm runs
in O(s(D + C)|QS||QA|) (Mohri 2002; Allauzen and Riley 2018). The worst-case space
complexity is in O(D|QS||QA|), determined by the intersection size.

4.1.2 ϕ-WFA Source and Target. When the source and target models are represented as
ϕ-WFAs we compute C(X, qa) from Lemma 4. From Equation (12) and the previous case
this can be written as

C(X, q) =

(西德:88)

(西德:88)

qa∈B(q)

qs∈Qs

C(qs, qa)ps(X|qs)1q=qx
A ,

x ∈ Σ

(19)

To compute this quantity we first form S ∩ A using an efficient ϕ-WFA intersection that
compactly retains failure transitions in the result as described in Allauzen and Riley
(2018). 方程 (19) is the weighted count of the paths in S ∩ A allowed by the failure
transitions that begin at the initial state and end in any transition leaving a state (qs, q)
labeled with x.

We can simplify this computation by the following transformation. First we con-
vert S ∩ A to an equivalent WFA by replacing each failure transition with an epsilon
transition and introducing a negatively weighted transition to compensate for formerly
disallowed paths (Allauzen and Riley 2018). The result is then promoted to a transducer
T with the output label used to keep track of the previous state in A of the compensated

S ∩ A

时间

数字 4
A ϕ-WFA is transformed into an equivalent WFA by replacing each failure transition by an
(西德:15)-过渡. To compensate for the formerly disallowed paths, 新的 (dashed red) negatively
weighted transitions are added. The result is promoted to a transducer T with the output label
used to keep track of the previous state in A of the compensated positive transition.

236

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

(qs,qa)x/ω(qs’,qa’)φ/αx/ν(qs,qa)X:qa/ω(qs’,qa’)ε:-/αx:qa’/-α νx:qa’/ν

Suresh et al.

Approximating Probabilistic Models as WFA

positive transition (见图 4).9 Algorithms to efficiently compute the intersection
and shortest distance on ϕ-WFAs are available in the OpenGrm libraries (Allauzen and
Riley 2018).
然后

C(X, q) =

(西德:88)

γT(qs, qa)w,

x ∈ Σ

(20)

((qs,qa ),X,q,w,(q(西德:48)

s ,q(西德:48)

A ))∈ET

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

where e = (p[e], il[e], ol[e], w[e], n[e]) is a transition in T and γT(qs, q) is the shortest dis-
tance from the initial state to (qs, qa) in T computed over the real semiring as described in
Allauzen and Riley (2018). 方程 (20) is the weighted count of all paths in S ∩ A that
begin at the initial state and end in any transition leaving a state (qs, q) labeled with x
minus the weighted count of those paths that are disallowed by the failure transitions.
最后, we compute C(ϕ, q) as follows. The count mass entering a state q must equal

the count mass leaving a state

(西德:88)

C(X, qa) =

(西德:88)

C(X(西德:48), q)

(qa,X,1,q)∈EA

(q,X(西德:48),1,q(西德:48)

A )∈EA
(西德:88)

=

(q,X(西德:48),1,q(西德:48)

A )∈EA,X(西德:48)∈Σ

C(X(西德:48), q) + C(ϕ, q)

因此

C(ϕ, q) =

(西德:88)

C(X, qa) -

(西德:88)

C(X(西德:48), q)

(qa,X,1,q)∈EA

(q,X(西德:48),1,q(西德:48)

A )∈EA,X(西德:48)∈Σ

This quantity can be computed iteratively in the topological order of states with respect
to the ϕ-labeled transitions.

The worst-case time and space complexity for the counting step for ϕ-WFAs is the

same as for WFAs (Allauzen and Riley 2018).

4.1.3 Arbitrary Source and ϕ-WFA Target. 在某些情况下, the source is a distribution with
possibly infinite states, 例如, LSTMs. For these sources, computing C(X, q) 能
be computationally intractable as Equation (19) requires a summation over all possible
states in the source machine, Qs. We propose to use a sampling approach to approximate
C(X, q) for these cases. Let x(1), X(2), . . . , X(氮) be independent random samples from ps.
Instead of C(X, q), we propose to use

ˆC(X, q) =

(西德:88)

(西德:88)

qa∈B(q)

qs∈Qs

ˆγ(qs, qa)ps(X|qs)1q=qx
A ,

x ∈ Σ

9 The construction illustrated in Figure 4 is sufficient when S ∩ A is acyclic. In the cyclic case a slightly

modified construction is needed to ensure convergence in the shortest distance calculation (Allauzen and
Riley 2018).

237

计算语言学

体积 47, 数字 2

在哪里

ˆγ(qs, qa) = 1


(西德:88)

(西德:88)

j=1

i≥1

1qs(希(j))=qs,qa(希(j))=qa

Observe that in expectation,

乙[ ˆγ(qs, qa)] = 1


(西德:88)

(西德:88)

j=1

i≥1

乙[1qs(希(j))=qs,qa(希(j))=qa]

(西德:88)

=

(西德:88)

ps(希)

i≥1

希:qs(希 )=qs,qa(希 )=qa

= γ(qs, qa),

and hence ˆγ(qs, qa) is an unbiased, asymptotically consistent estimator of γ(qs, qa). 给定
ˆC(X, q), we compute C(ϕ, q) similarly to the previous section. 如果 (西德:96) is the expected number
of symbols per sample, then the computational complexity of counting in expectation is
in O(氮(西德:96)|Σ|).

4.2 KL Divergence Minimization
4.2.1 WFA Target. When the target topology is a deterministic WFA, we use c(X, qa) 从
the previous section and Lemma 2 to immediately find the minimum KL divergence
解决方案.

4.2.2 ϕ-WFA Target. When the target topology is a deterministic, backoff-complete
ϕ-WFA, Lemma 3 can be applied in some circumstances to find the minimum KL diver-
gence solution but not in general. 然而, as noted before, the quantity in braces in
the statement of Lemma 4 depends on the distribution pa only at state q so the minimum
KL divergence D(ps||p∗
A ) can be found by maximizing that quantity independently for
each state.

(西德:44) pa(X|q) for x ∈ L[q] and let y (西德:44) [yx]x∈L[q].10 Then our goal

Fix a state q and let yx

reduces to

最大精量
y

(西德:88)

x∈L[q]

C(X, q) log yx −

(西德:88)

C(ϕ, q0) 日志 (西德:0)1 -

(西德:88)

(西德:1)

yx

(21)

q0∈B1(q)

x∈L[q0]∩Σ

subject to the constraints yx ≥ (西德:15) for x ∈ L[q] 和 (西德:80)
x∈L[q]

yx = 1.

This is a difference of two concave functions in y since log(F (y)) is concave for any
linear function f (y), C(X, q), C(ϕ, q0) are always non-negative, and the sum of concave
functions is also concave. We give a DC programming solution to this optimization
(Horst and Thoai 1999). Let

Ω = {y : ∀x, yx ≥ (西德:15),

(西德:88)

x∈L(q)

yx ≤ 1}

10 We fix some total order on Σ ∪ {ϕ} so that y is well-defined.

238

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

and let u(y) = (西德:80)
x∈L[q] C(X, q) log yx and v(y) = (西德:80)
Then the optimization problem can be written as

q0∈B1(q) C(ϕ, q0) 日志(1 - (西德:80)

x∈L[q0]∩Σ yx).

你(y) − v(y)

max
y∈Ω

The DC programming solution for such a problem uses an iterative procedure that
linearizes the subtrahend in the concave difference about the current estimate and then
solves the resulting concave objective for the next estimate (Horst and Thoai 1999), 那
是,

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

(22)

(23)

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

yn+1 = argmax

你(y) − y · (西德:79)v(yn)

y∈Ω

Substituting u and (西德:79)v gives

yn+1 = argmax

(西德:26)

(西德:88)

y∈Ω

x∈L[q]

C(X, q) log yx + yxf (X, q, yn)

(西德:27)

在哪里

F (X, q, yn) =

(西德:88)

q0∈B1(q)

C(ϕ, q0)1x∈L[q0]∩Σ
1 - (西德:80)
X(西德:48)∈L[q0]∩Σ yn
X(西德:48)

Observe that 1 - (西德:80)
yn ∈ Ω.

X(西德:48)∈L[q0]∩Σ yn

X(西德:48) (西德:15) as the automaton is backoff-complete and

Let C(q) be defined as:

C(q) =

(西德:88)

C(X(西德:48), q)

X(西德:48)∈L[q]

The following lemma provides the solution to the optimization problem in Equa-

的 (22) that leads to a stationary point of the objective.

Lemma 5
Solution to Equation (22) is given by

yn+1
x = max

(西德:18)

C(X, q)
λ − f (X, q, yn)

(西德:19)

, (西德:15)

(西德:104)
maxx∈L[q] F (X, q, yn) + C(X, q), maxx∈L[q] F (X, q, yn) + C(q)

1-|L[q]|(西德:15)

where λ ∈
(西德:80)
x yn+1
x = 1.

(24)

(西德:105)

这样

Proof. With KKT multipliers, the optimization problem can be written as

max
y,λ,µx:µx≤0

(西德:26)

(西德:88)

x∈L[q]

C(X, q) log yx + yx f (X, q, yn)

(西德:27)

+ λ(西德:0)1 -

(西德:88)

(西德:1) +

yx

(西德:88)

µx((西德:15) − yx)

x∈L[q]

x∈L[q]

239

计算语言学

体积 47, 数字 2

We divide the proof into two cases depending on the value of C(X, q). Let C(X, q) (西德:54)= 0.
Differentiating this equation with respect to yx and equating to zero, we get

yn+1
x =

C(X, q)
λ + µx − f (X, q, yn)

此外, by the KKT condition, µx((西德:15) − yn+1
x = (西德:15) and if µx is zero, then yn+1
yn+1
for yn+1
X
can be re-expressed as Equation (24). If C(X, q) = 0, then we get

) = 0. 因此, µx is only non-zero if
λ−f (X,q,yn ) . 此外, because for all x, µx ≤ 0,
to be positive, we need λ ≥ maxx f (X, q, yn). 因此, the above two conditions

x = C(X,q)

X

F (X, q, yn) = λ + µx and µx((西德:15) − yn+1

X

) = 0

and the solution is given by yn+1
x = (西德:15) and µx = f (X, q, yn) − λ. Since µx cannot be pos-
itive, we have f (X, q, yn) ≤ λ for all x. 因此, irrespective of the value of C(X, q), 这
solution is given by Equation (24).

The above analysis restricts λ ≥ maxx f (X, q, yn). If λ < f (x, q, yn) + C(x, q), then x < 1. Hence λ needs to 1−|L[q]|(cid:15) , then (cid:80) x > 1 and if λ > maxx f (X, q, yn) + C(q)
yn+1
lie in

x yn+1

(西德:20)

max
x∈L[q]

F (X, q, yn) + C(X, q), max
x∈L[q]

F (X, q, yn) +

(西德:21)

C(q)
1 - |L[q]|(西德:15)

to ensure that (西德:80)

x yn+1

x = 1.

(西德:3)

Algorithm KL-MINIMIZATION

符号:

yx = pa(X|q) for x ∈ L(q)


• C(X, q) from Equations (12) 和 (13)
• C(q) = (西德:80)
X(西德:48)∈L[q] C(X(西德:48), q)

F (X, q, yn) from Equation (23)

k = |L[q]|
lb = maxx∈L[q] F (X, q, yn) + C(X, q)



• ub = maxx∈L[q] F (X, q, yn) + C(q)
1−k(西德:15)

(西德:15) = lower bound on yx

Trivial case: If C(q) = 0, output y given by yx = 1/k for all x.
Initialization: Initialize:

y0
x =

C(X, q)
C(q)

(1 − k(西德:15)) + (西德:15).

Iteration: Until convergence do:

yn+1
x = max

(西德:18)

C(X, q)
λ − f (X, q, yn)

(西德:19)

, (西德:15)

,

where λ ∈ [lb, ub] is chosen (in a binary search) to ensure (西德:80)

x∈L(q) yx = 1.

数字 5
KL-MINIMIZATION algorithm.

240

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

由此, we form the KL-MINIMIZATION algorithm in Figure 5. Observe that if
all the counts are zero, then for any y, 你(y) − v(y) = 0 and any solution is an optimal
solution and the algorithm returns a uniform distribution over labels. 在其他情况下, 我们
initialize the model based on counts such that y0 ∈ Ω. We then repeat the DC program-
ming algorithm iteratively until convergence. Because Ω is a convex compact set and
functions u, v, 和 (西德:79)v are continuous and differentiable in Ω, the KL-MINIMIZATION
converges to a stationary point (Sriperumbudur and Lanckriet 2009, Theorem 4). 为了
each state q, the computational complexity of KL-MINIMIZATION is in O(|乙[q]|) 每
iteration. 因此, if the maximum number of iterations per each state is s, the overall
computational complexity of KL-MINIMIZATION is in O(s|乙|).

4.3 讨论

Our method for approximating from source ϕ-WFAs, specifically from backoff k-gram
型号, shares some similarities with entropy pruning (Stolcke 2000). Both can be used
to approximate onto a more compact k-gram topology and both use a KL-divergence
criterion to do so. They differ in that entropy pruning, by sequentially removing tran-
地点, selects the target topology and in doing so uses a greedy rather than global
entropy reduction strategy. We will empirically compare these methods in Section 5.1
for this use case.

Perhaps the closest to our work on approximating arbitrary sources is that of Deoras
等人. (2011), where they used an RNN LM to generate samples that they then used to
train a k-gram LM. 相比之下, we use samples to approximate the joint state probability
C(qs, qa). 如果 (西德:96) is the average number of words per sentence and N is the total number
of sampled sentences, then the time complexity of Deoras et al. (2011) is O(氮(西德:96)|Σ|)
as it takes O(|Σ|) time to generate a sample from the neural model for every word.
This is the same as that of our algorithm in Section 4.1.3. 然而, our approach has
several advantages. 第一的, for a given topology, our algorithm provably finds the best
KL minimized solution, whereas their approach is optimal only when the size of the
k-gram model tends to infinity. 第二, as we show in our experiments, the proposed
algorithm performs better compared with that of Deoras et al. (2011).

Arisoy et al. (2014) and Adel et al. (2014) both trained DNN models of different
orders as the means to derive probabilities for a k-gram model, rather than focusing on
model approximation as the above-mentioned methods do. 更远, the methods are
applicable to k-gram models specifically, not the larger class of target WFA to which our
methods apply.

5. 实验

We now provide experimental evidence of the theory’s validity and show its usefulness
in various applications. For the ease of notation, we use WFA-APPROX to denote the
exact counting algorithm described in Section 4.1.2 followed by the KL-MINIMIZATION
algorithm of Section 4.2. 相似地, we use WFA-SAMPLEAPPROX(氮) to denote the
sampled counting described in Section 4.1.3 with N sampled sentences followed by KL-
MINIMIZATION.

We first give experimental evidence that supports the theory in Section 5.1. 我们
then show how to approximate neural models as WFAs in Section 5.2. We also use the
proposed method to provide lower bounds on the perplexity given a target topology
in Section 5.3. 最后, we present experiments in Section 5.4 addressing scenarios with
source WFAs, hence not requiring sampling. 第一的, motivated by low-memory applica-

241

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

计算语言学

体积 47, 数字 2

tions such as (虚拟的) keyboard decoding (Ouyang et al. 2017), we use our approach
to create compact language models in Section 5.4.1. 然后, in Section 5.4.2, we use our
approach to create compact open-vocabulary character language models from count-
thresholded n-grams.

For most experiments (unless otherwise noted) 我们使用 1996 CSR Hub4 Lan-
guage Model data, LDC98T31 (English Broadcast News data). We use the processed
form of the corpus and further process it to downcase all the words and remove
punctuation. The resulting data set has 132M words in the training set, 20M words in the
test set, and has 240K unique words. For all the experiments that use word models, 我们
create a vocabulary of approximately 32K words consisting of all words that appeared
多于 50 times in the training corpus. Using this vocabulary, we create a trigram
Katz model and prune it to contain 2M n-grams using entropy pruning (Stolcke 2000).
We use this pruned model as a baseline in all our word-based experiments. We use
Katz smoothing because it is amenable to pruning (Chelba et al. 2010). The perplexity of
this model on the test set is 144.4.11 All algorithms were implemented using the open-
source OpenFst and OpenGrm n-gram and stochastic automata (SFst) libraries12 with the
last library including these implementations (Allauzen et al. 2007; Roark et al. 2012;
Allauzen and Riley 2018).

5.1 Empirical Evidence of Theory

Recall that our goal is to find the distribution on a target DFA topology that minimizes
the KL divergence to the source distribution. 然而, as stated in Section 4.2, 如果
target topology has failure transitions, the optimization objective is not convex so the
stationary point solution may not be the global optimum. We now show that the model
indeed converges to a good solution in various cases empirically.

Idempotency. When the target topology is the same as the source topology, we show that
the performance of the approximated model matches the source model. Let ps be the
pruned Katz word model described above. We approximate ps onto the same topol-
ogy using WFA-APPROX and WFA-SAMPLEAPPROX(·) and then compute perplexity
on the test corpus. The results are presented in Figure 6. The test perplexity of the
WFA-APPROX model matches that of the source model and the performance of the
WFA-SAMPLEAPPROX(氮) model approaches that of the source model as the number of
samples N increases.

Comparison to Greedy Pruning. Recall that entropy pruning (Stolcke 2000) greedily re-
moves n-grams such that the KL divergence to the original model ps is small. Let pgreedy
be the resulting model and Agreedy be the topology of pgreedy. If the KL-MINIMIZATION
converges to a good solution, then approximating ps onto Agreedy would give a model
that is at least as good as pgreedy. We show that this is indeed the case; 实际上, approximat-
ing ps onto Agreedy performs better than pgreedy. As before, let ps again be the 2M n-gram
Katz model described above. We prune it to have 1M n-grams and obtain pgreedy, 哪个
has a test perplexity of 157.4. We then approximate the source model ps on Agreedy, 这

11 For all perplexity measurements, we treat the unknown word as a single token instead of a class. 到

compute the perplexity with the unknown token being treated as class, multiply the perplexity by k0.0115,
where k is the number of tokens in the unknown class and 0.0115 is the out-of-vocabulary rate in the test
数据集.

12 These libraries are available at www.openfst.org and www.opengrm.org.

242

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

数字 6
Idempotency: Test set perplexity for Katz baseline and approximations of that baseline trained on
the same data. The Katz baseline and Katz WFA-APPROX plots are identical (and thus overlap),
while WFA-SAMPLEAPPROX(氮) plot converges to the baseline.

桌子 1
Test perplexity of greedy pruning and approximated models.

Model size Greedy pruning Approximated model

250K
500K
1中号
1.5中号

205.7
177.3
157.4
149.0

198.3
173.0
155.7
148.4

topology of pgreedy. The resulting model has a test perplexity of 155.7, which is smaller
than the test perplexity of pgreedy. This shows that the approximation algorithm indeed
finds a good solution. We repeat the experiment with different pruning thresholds and
observe that the approximation algorithm provides a good solution across the resulting
model sizes. The results are in Table 1.

5.2 Neural Models to WFA Conversion

Since neural models such as LSTMs give improved performance over n-gram models,
we investigate whether an LSTM distilled onto a WFA model can obtain better per-
formance than the baseline WFA trained directly from Katz smoothing. As stated in
the Introduction, this could then be used together with federated learning for fast and
private on-device inference, which was done in Chen et al. (2019) using the methods
presented here.

To explore this, we train an LSTM language model on the training data. The model
有 2 LSTM layers with 1,024 states and an embedding size of 1,024. The resulting model
has a test perplexity of 60.5. We approximate this model as a WFA in three ways.

243

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

llll5e+051e+062e+065e+06140145150155# of samples (氮)test perplexitylKatz baselineWFA−ApproxWFA−SampleApprox(氮)

计算语言学

体积 47, 数字 2

数字 7
Approximated Neural Models for the English Broadcast News corpus. Test set perplexity for Katz
baseline and LSTM models approximated in three ways. One uses LSTM samples to build a new
Katz model (WFA-SAMPLEKATZ(氮)). The remaining two use our approximation algorithm
(WFA-SAMPLEAPPROX(氮)), but with different topologies. One topology (KT) is known, 使用
the baseline Katz topology, 和一个 (UT) is unknown, using the samples drawn for
WFA-SAMPLEAPPROX(氮) to also infer the topology.

The first way is to construct a Katz n-gram model on N LSTM samples and entropy-
prune to 2M n-grams, which we denote by WFA-SAMPLEKATZ(氮). This approach is
very similar to that of Deoras et al. (2011), except that we use Katz smoothing instead
of Kneser-Ney smoothing. We used Katz due to the fact that, as stated earlier, Kneser-
Ney models are not amenable to pruning (Chelba et al. 2010). The second way is to
approximate onto the baseline Katz 2M n-gram topology described above using WFA-
SAMPLEAPPROX(氮). We refer to this experiment as WFA-SAMPLEAPPROX(氮) (KT),
where KT stands for known topology. The results are shown in Figure 7. The WFA-
SAMPLEKATZ(氮) models perform significantly worse than the baseline Katz model
even at 32M samples, while the WFA-SAMPLEAPPROX(氮) models have better perplex-
ity than the baseline Katz model with as little as 1M samples. With 32M samples this
way of approximating the LSTM model as a WFA is 3.6 better in perplexity than the
baseline Katz.

最后, if the WFA topology is unknown we use the samples obtained in WFA-
SAMPLEAPPROX(·) to create a Katz model entropy-pruned to 2M n-grams. We refer
to this experiment as WFA-SAMPLEAPPROX(氮) (UT), where UT stands for unknown
topology. The results are shown in Figure 7. The approximated models obtained by
WFA-SAMPLEAPPROX(氮) (UT) perform better than WFA-SAMPLEKATZ(氮). 然而,
they do not perform as well as WFA-SAMPLEAPPROX(氮) (KT), the models obtained
with the known topology derived from the training data. 然而, with enough sam-
普莱斯, their performance is similar to that of the original Katz model.

We then compare our approximation method with other methods that approximate
DNNs with n-gram LMs (Arisoy et al. 2014; Adel et al. 2014). Both these methods require
training DNNs of different orders and we used DNNs with two layers with 1,024 节点
and an embedding size of 1,024. The results are in Table 2. The proposed algorithms
WFA-SAMPLEAPPROX(·) perform better than the existing approaches.

244

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

lllllll5e+051e+062e+065e+061e+072e+075e+07140150160170# of samples (氮)test perplexitylKatz baselineWFA−SampleKatz(氮)WFA−SampleApprox(氮) (KT)WFA−SampleApprox(氮) (UT)

Suresh et al.

Approximating Probabilistic Models as WFA

桌子 2
Test perplexity of k-gram models obtained by different approximation methods for the
English Broadcast News corpus. We use 32M samples for both WFA-SAMPLEKATZ(·) 和
WFA-SAMPLEAPPROX(·).

模型

Test perplexity

Katz
WFA-SAMPLEKATZ(·)
Arisoy et al. (2014)
Adel et al. (2014)
WFA-SAMPLEAPPROX(·) (KT)
WFA-SAMPLEAPPROX(·) (UT)

144.4
151.8
159.6
146.7
140.8
143.9

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C

数字 8
Approximated Neural Models for the Polish (Europarl) 语料库. Test set perplexity for Katz baseline
and LSTM models approximated in three ways. One uses LSTM samples to build a new Katz
模型 (WFA-SAMPLEKATZ(氮)). The remaining two use our approximation algorithm
(WFA-SAMPLEAPPROX(氮)), but with different topologies. One topology (KT) is known, 使用
the baseline Katz topology, 和一个 (UT) is unknown, using the samples drawn for
WFA-SAMPLEAPPROX(氮) to also infer the topology.

Experiment on Polish. To repeat the above experiment on a different language and corpus,
we turn to the Polish language section13 of the Europarl corpus (科恩 2005). 我们
chose Polish for this follow-up experiment due to the fact that it belongs to a different
language family than English, is relatively highly inflected (unlike English), 并且是
found in a publicly available corpus. We use the processed form of the corpus and
further process it to downcase all the words and remove punctuation. The resulting
data set has approximately 13M words in the training set and 210K words in the test
放. The selected vocabulary has approximately 30K words, consisting of all words that
appeared more than 20 times in the training corpus. Using this vocabulary, we create a
trigram Katz model and prune it to contain 2M n-grams using entropy pruning (Stolcke
2000). We use this pruned model as a baseline. The results are in Figure 8. The trend


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

13 https://www.statmt.org/europarl/v7/pl-en.tgz.

245

llll5e+051e+062e+065e+06180200220240# of samples (氮)test perplexitylKatz baselineWFA−SampleKatz(氮)WFA−SampleApprox(氮) (KT)WFA−SampleApprox(氮) (UT)

计算语言学

体积 47, 数字 2

桌子 3
Test perplexity of k-gram models obtained by different approaches for the Polish (Europarl)
语料库. We use 32M samples for both WFA-SAMPLEKATZ(·) and WFA-SAMPLEAPPROX(·).

模型

Test perplexity

Katz
WFA-SAMPLEKATZ(·)
Arisoy et al. (2014)
Adel et al. (2014)
WFA-SAMPLEAPPROX(·) (KT)
WFA-SAMPLEAPPROX(·) (UT)

185.5
189.3
197.8
187.1
181.4
177.0

is similar to that of the English Broadcast News corpus with the proposed algorithm
WFA-SAMPLEAPPROX(·) performing better than the other methods. We also compare
the proposed algorithms with other neural approximation algorithms. The comparison
results are shown in Table 3.

5.3 Lower Bounds on Perplexity

Best Approximation for Target Topology. The neural model in Section 5.2 has a perplexity
的 60.5, but the best perplexity for the approximated model is 140.8. Is there a better
approximation algorithm for the given target topology? We place bounds on that next.
Let T be the set of test sentences. The test-set log-perplexity of a model p can be

written as

1
|时间|

(西德:88)

x∗∈T

日志

1
p(x∗)

=

(西德:88)

x∗

ˆpt(x∗) 日志

1
p(x∗)

where ˆpt is the empirical distribution of test sentences. Observe that the best model with
topology A can be computed as

p(西德:48)
a = argmin
pa∈P(A)

(西德:88)

x∗

ˆpt(x∗) 日志

1
pa(x∗)

which is the model with topology A that has minimal KL divergence from the test
distribution ˆpt. This can be computed using WFA-APPROX. If we use this approach
on the English Broadcast News test set with the 2M n-gram Katz model, the resulting
model has perplexity of 121.1, showing that, under the assumption that the algorithm
finds the global KL divergence minimum, the test perplexity with this topology cannot
be improved beyond 121.1, irrespective of the method.

Approximation onto Best Trigram Topology. If we approximate the LSTM onto the best
trigram topology, how well does it perform over the test data? 为了测试这个, we build a
trigram model from the test data and approximate the LSTM on the trigram topology.
This approximated model has 11M n-grams and a perplexity of 81. This shows that for
large data sets, the largest shortfall of n-gram models in the approximation is due to the
n-gram topology.

246

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

5.4 WFA Sources

While distillation of neural models is an important use case for our algorithms, 有
other scenarios where approximations will be derived from WFA sources. 在这种情况下,
no sampling is required and we can use the algorithms presented in Sections 4.1.1 或者
4.1.2, 即, WFA-APPROX. 例如, it is not uncommon for applications to place
maximum size restrictions on models, so that existing WFA models are too large and
must be reduced in size prior to use. 部分 5.4.1 presents a couple of experiments
focused on character language model size reduction that were motivated by such
size restrictions. Another scenario with WFA sources is when, for privacy reasons, 不
language model training corpus is available in a domain, but only minimum-count-
thresholded (IE。, high frequency) word n-gram counts are provided. 在部分 5.4.2 我们
examine the estimation of open-vocabulary character language models from such data.

5.4.1 Creating Compact Language Models

Creating Compact Models for Infrequent Words. In low-memory applications such as on-
device keyboard decoding (Ouyang et al. 2017), it is often useful to have a character-
level WFA representation of a large set of vocabulary words that act only as unigrams,
例如, those words beyond the 32K words of our trigram model. We explore how
to compactly represent such a unigram-only model.

To demonstrate our approach, we take all the words in the training set (without a
count cutoff) and build a character-level deterministic WFA of those words weighted by
their unigram probabilities. This is represented as a tree rooted at the initial state (a trie).
This automaton has 820K transitions. Storing this many transitions can be prohibitive;
we can reduce the size in two steps.

The first step is to minimize this WFA using weighted minimization (Mohri 2000) 到
produce pchar, which has a topology Achar. Although pchar is already much smaller (it has
378K transitions, A 54% reduction), we can go further by approximating onto the min-
imal deterministic unweighted automaton, Minimize(Achar). This gives us a model with
only 283K transitions, a further 25% reduction. Because Minimize(Achar) accepts exactly
the same words as Achar, we are not corrupting our model by adding or removing any
vocabulary items. 反而, we find an estimate that is as close as possible to the original,
but that is constrained to the minimal deterministic representation that preserves the
词汇.

To evaluate this approach, we randomly select a 20K sentence subset of the original
test set, and represent each selected string as a character-level sequence. We evaluate us-
ing cross entropy in bits-per-character, common for character-level models. The result-
ing cross entropy for pchar is 1.557 bits-per-character. 通过对比, the cross entropy
for pchar approximated onto Minimize(Achar) 是 1.560 bits-per-character. In exchange for
this small accuracy loss we are rewarded with a model that is 25% 较小. Wolf-Sonkin
等人. (2019) used the methods presented here to augment the vocabulary of an on-device
keyboard to deal with issues related to a lack of standard orthography.

Creating Compact WFA Language Models. Motivated by the previous experiment, 我们
also consider applying (unweighted) minimization to Agreedy, the word-based trigram
topology that we pruned to 1M n-grams described earlier. 表中 4 we show that
applying minimization to Agreedy and then approximating onto the resulting topology
leads to a reduction of 7% in the number of transitions needed to represent the model.
然而, the test perplexity also increases some. To control for this, we prune the

247

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

计算语言学

体积 47, 数字 2

桌子 4
Test perplexity of models when approximated onto smaller topologies.

Topology

Test perplexity

# Transitions

Agreedy
Minimize(Agreedy)

greedy

A(西德:48)
Minimize(A(西德:48)

greedy)

155.7
156.4

154.1
154.9

1.13中号
1.05中号

1.22中号
1.13中号

original model to a 1.08M n-gram topology A(西德:48)
greedy instead of the 1M as before and
apply the same procedure to obtain an approximation on Minimize(A(西德:48)
greedy). We achieve
A 0.4% perplexity reduction compared to the approximation on Agreedy with very nearly
the same number of transitions.

5.4.2 Count Thresholded Data for Privacy. One increasingly common scenario that can
benefit from these algorithms is modeling from frequency thresholded substring counts
rather than raw text. 例如, word n-grams and their frequencies may be pro-
vided from certain domains of interest only when they occur within at least k separate
文件. With a sufficiently large k (说 50), no n-gram can be traced to a specific
文档, thus providing privacy in the aggregation. This is known as k-anonymity
(Samarati 2001).

然而, for any given domain, there are many kinds of models that one may want
to build depending on the task, some of which may be trickier to estimate from such a
collection of word n-gram counts than with standard approaches for estimation from a
given corpus. 例如, character n-gram models can be of high utility for tasks like
language identification, and have the benefit of a relatively small memory footprint and
low latency in use. 然而, character n-gram models might be harder to learn from a
k-anonymized corpus.

Here we will compare open-vocabulary character language models, which accept
all strings in Σ∗ for a character vocabulary Σ, trained in several ways. Each approach
relies on the training corpus and 32k vocabulary, with every out-of-vocabulary word
replaced by a single OOV symbol (西德:70). 此外, for each approach we add 50 到
the unigram character count of any printable ASCII character, so that even those that
are unobserved in the words of our 32k vocabulary have some observations. Our three
approaches are:

1.

Baseline corpus trained models: We count character 5-grams from the
k-anonymized corpus, then remove all n-grams that include the (西德:70) 象征
(in any position) prior to smoothing and normalization. Here we present
both Kneser-Ney and Witten Bell smoothed models, as both are popular
for character n-gram models.

2. Word trigram sampled model: First we count word trigrams in the

k-anonymized corpus and discard any n-gram with the (西德:70) 象征 (in any
位置) prior to smoothing and normalization. We then sample one
million strings from a Katz smoothed model and build a character 5-gram
model from these strings. We also use this as our target topology for the
next approach.

248

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

3. Word trigram KL minimization estimation: We create a source model by

converting the 2M n-gram word trigram model into an open vocabulary
模型. We do this using a specialized construction related to the
construction presented in Section 6 of Chen et al. (2019), briefly described
以下, that converts the word model into a character sequence model. 作为
this model is still closed vocabulary (见下文), we additionally smooth
the unigram distribution with a character trigram model trained from the
words in the symbol table (and including the 50 extra counts for every
printable ASCII character as with the other methods). From this source
模型, we estimate a model on the sampled character 5-gram topology
from the previous approach, using our KL minimization algorithm.

Converting Word n-gram Source to Character Sequence Source Model. 简单地说, for every
state q in the n-gram automaton, the set of words labeling transitions leaving q are
represented as a trie of characters including a final end-of-word symbol. Each resulting
transition labeled with the end-of-word symbol represents the last transition for that
particular word spelled out by that sequence of transitions, hence is assigned the same
next state as the original word transition. If q has a backoff transition pointing to
its backoff state q(西德:48), then each new internal state in the character trie backs off to the
corresponding state in the character trie leaving q(西德:48). The presence of the corresponding
state in the character trie leaving q(西德:48) is guaranteed because the n-gram automaton is
backoff-complete, as discussed in Section 3.3.

As stated above, this construction converts from word sequences to character se-
序列, but will only accept character sequences consisting of strings of in-vocabulary
字, 那是, this is still closed vocabulary. To make it open vocabulary, we further back
off the character trie leaving the unigram state to a character n-gram model estimated
from the symbol table (and additional ASCII character observations). This is done using
a very similar construction to that described above. The resulting model is used as the
source model for our KL minimization algorithm, to estimate a distribution over the
sampled character 5-gram topology.

We encode the test set as a sequence of characters, without using the symbol table
because our models are intended to be open vocabulary. Following typical practice for
open-vocabulary settings, we evaluate with bits-per-character. The results are presented
表中 5. Here we achieve slightly lower bits-per-character than even what we get

桌子 5
Comparison of character 5-gram models derived from either the original corpus or a word
trigram model. Size of the models is presented in terms of the number of character n-grams, 这
numbers of states and transitions in the automaton representation, and the file size in MB. 这
two corpus estimated models have the same topology, hence the same size; as do the two word
trigram estimated models.

来源

n-grams
(x1000)

状态
(x1000)

语料库

336

Word trigram

277

60

56

transitions MB

Estimation

bits/char

(x1000)

381

322

6.5

5.6

Kneser-Ney
Witten-Bell (WB)

Sampled (WB)
KL min

2.04
2.01

2.36
1.99

249

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

计算语言学

体积 47, 数字 2

straight from the corpus, perhaps due to better regularization of the word-based model
than with either Witten-Bell or Kneser-Ney on the character n-grams.

6. Software Library

All of the algorithms presented here are available in the OpenGrm libraries at http://
www.opengrm.org under the topic: SFST Library: operations to normalize, sample, combine,
and approximate stochastic finite-state transducers. We illustrate the use of this library by
showing how to implement some of the experiments in the previous section.

6.1 Example Data and Models

Instead of Broadcast News, we will use the text of Oscar Wilde’s The Importance of Being
Earnest for our tutorial example. This is a small tutorial corpus that we make available
在http://sfst.opengrm.org.

This corpus of approximately 1,688 sentences and 18,000-words has been upper-
cased and had punctuation removed. 第一个 850 sentences were used as training data
和剩余的 838 sentences used as test data. 从这些, we produce two 1,000-
word vocabulary Katz bigram models, the ∼ 6k n-gram earnest train.mod and the ∼ 4k
n-gram earnest test.mod. We also used relative-entropy pruning to create the ∼ 2k
n-gram earnest train.pru. The data, the steps to generate these models, and how to
compute their perplexity using OpenGrm NGram are all fully detailed in the QuickTour
topic at http://sfst.opengrm.org.

6.2 Computing the Approximation

The following step shows how to compute the approximation of a ϕ-WFA model onto
a ϕ-WFA topology. In the example below, the first argument, earnest train.mod, 是个
source model, and the second argument, earnest train.pru, provides the topology.
The result is a ϕ-WFA whose perplexity can be measured as before.

$ sfstapprox -phi label=0 earnest train.mod earnest train.pru \ >earnest train.approx An alternative, equivalent way to perform this approximation is to break it into two steps, with the counting and normalization (KL divergence minimization) done separately. $ sfstcount -phi label=0 earnest train.mod earnest train.pru \

>earnest train.approx cnts

$ sfstnormalize -method=kl min -phi label=0 \

earnest train.approx cnts >earnest train.approx

We can now use these utilities to run some experiments analogous to our larger
Broadcast News experiments by using different target topologies. 结果显示
表中 6. We see in the idempotency experiment that the perplexity of the approxi-
mation on the same topology matches the source. In the greedy-pruning experiment,
the approximation onto the greedy-pruned topology yields a better perplexity than the
greedily-pruned model. 最后, the approximation onto the test-set bigram topology
gives a better perplexity than the training-set model since we include all the relevant
二元组.

250

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

桌子 6
Perplexities for example experiments.

实验

来源
模型

Topology

Approx.
Perplexity

Idempotency
Comparison to greedy pruning
Approx. onto best bigram topology

earnest train.mod
earnest train.mod
earnest train.mod

earnest train.mod
earnest train.pru
earnest test.pru

73.41
83.12
69.68

桌子 7
Available Operations in the OpenGrm SFst Library.

Operation

sfstapprox

sfstcount

sfstintersect

描述

Approximates a stochastic ϕ-WFA
wrt a backoff-complete ϕ-WFA.
Counts from a stochastic ϕ-WFA
wrt a backoff-complete ϕ-WFA.

Intersects two ϕ-WFAs.

sfstnormalize -method=global

Globally normalizes a ϕ-WFA.

sfstnormalize -method=kl min

sfstnormalize -method=local

Normalizes a count ϕ-WFA using
KL divergence minimization.

Locally normalizes a ϕ-WFA.

sfstnormalize -method=phi

ϕ-normalizes a ϕ-WFA.

sfstperplexity

sfstrandgen

Computes perplexity of a

stochastic ϕ-WFA.

Randomly generates paths from a

stochastic ϕ-WFA.

sfstshortestdistance

Computes the shortest distance

sfsttrim

on a ϕ-WFA.

Removes useless states and
transitions in a ϕ-WFA.

6.3 Available Operations

桌子 7 lists the available command-line operations in the OpenGrm SFst library. 我们
show command-line utilities here; there are corresponding C++ library functions that
can be called from within a program; 参见http://sfst.opengrm.org.

7. Summary and Discussion

在本文中, we have presented an algorithm for minimizing the KL-divergence
between a probabilistic source model over sequences and a WFA target model. 那
our algorithm is general enough to permit source models of arbitrary form (例如, RNNs)
and deriving an appropriate WFA topology—something we do not really touch on in
this article—is particularly important.

251

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

计算语言学

体积 47, 数字 2

参考
Adel, Heike, Katrin Kirchhoff, Ngoc Thang
Vu, Dominic Telaar, and Tanja Schultz.
2014. Comparing approaches to convert
recurrent neural networks into backoff
language models for efficient decoding. 在
Fifteenth Annual Conference of the
International Speech Communication
协会 (Interspeech), pages 651–655.
Aho, Alfred V. and Margaret J. Corasick.

1975. Efficient string matching: An aid to
bibliographic search. Communications of the
ACM, 18(6):333–340. https://doi.org/10
.1145/360825.360855

阿尔伯特, J ¨urgen and Jarkko Kari. 2009. Digital

image compression. 在米. Droste, 瓦.
Kuich, 和H. Vogler, 编辑, Handbook
of Weighted Automata, 施普林格,
pages 453–479. https://doi.org/10.1007
/978-3-642-01492-5 11

Allauzen, Cyril, Mehryar Mohri, and Brian
Roark. 2003. Generalized algorithms for
constructing language models. 在
Proceedings of ACL, pages 40–47.
https://doi.org/10.3115/1075096
.1075102

Allauzen, Cyril, Michael Riley, Johan

Schalkwyk, Wojciech Skut, and Mehryar
Mohri. 2007. OpenFst Library.
http://www.openfst.org.

Allauzen, Cyril and Michael D. Riley. 2018.
Algorithms for weighted finite automata
with failure transitions. 在国际
Conference on Implementation and
Application of Automata, pages 46–58.
https://doi.org/10.1007/978-3-319
-94812-6 5

Angluin, Dana. 1987. Learning regular sets

from queries and counterexamples.
Information and Computation, 75(2):87–106.
https://doi.org/10.1016/0890-5401
(87)90052-6

Angluin, Dana. 1988. Identifying languages

from stochastic examples. 技术报告
YALEU /DCS /RR-614, Yale University.

Arisoy, Ebru, Stanley F. 陈, Bhuvana

Ramabhadran, and Abhinav Sethy. 2014.
Converting neural network language
models into back-off language models for
efficient decoding in automatic speech
认出. IEEE/ACM Transactions on
声音的, Speech and Language Processing
(TASLP), 22(1):184–192. https://doi.org
/10.1109/TASLP.2013.2286919

Balle, Borja, Xavier Carreras, Franco M.
Luque, and Ariadna Quattoni. 2014.
Spectral learning of weighted automata.
Machine Learning, 96(1-2):33–63.
https://doi.org/10.1007/s10994-013
-5416-X

252

Balle, Borja and Mehryar Mohri. 2012.

Spectral learning of general weighted
automata via constrained matrix
completion. In Advances in Neural
Information Processing Systems,
pages 2159–2167.

Breuel, 托马斯·M. 2008. The OCRopus open

source OCR system. 在诉讼程序中
IS&T/SPIE 20th Annual Symposium.
https://doi.org/10.1117/12.783598

Carrasco, Rafael C. 1994. Accurate

computation of the relative entropy
between stochastic regular grammars.
RAIRO-Theoretical Informatics and
应用领域, 31(5):437–444.
https://doi.org/10.1051/ita
/1997310504371

Carrasco, Rafael C. and Jos´e Oncina. 1994.

Learning stochastic regular grammars by
means of a state merging method. 在
International Colloquium on Grammatical
推理, pages 139–152. https://doi.org
/10.1007/3-540-58473-0 144

Carrasco, Rafael C. and Jos´e Oncina. 1999.

Learning deterministic regular grammars
from stochastic samples in polynomial
时间. RAIRO-Theoretical Informatics and
应用领域, 33(1):1–19. https://doi.org
/10.1051/ita:1999102

Chelba, Ciprian, Thorsten Brants, 将要

Neveitt, and Peng Xu. 2010. Study on
interaction between entropy pruning and
Kneser-Ney smoothing. In Eleventh Annual
Conference of the International Speech
Communication Association (Interspeech),
pages 2422–2425.

陈, Mingqing, Ananda Theertha Suresh,
Rajiv Mathews, Adeline Wong, Franc¸oise
Beaufays, Cyril Allauzen, 和迈克尔
Riley. 2019. Federated learning of N-gram
language models. In Proceedings of the 23rd
Conference on Computational Natural
Language Learning (CoNLL), pages 121–130.
https://doi.org/10.18653/v1/K19-1012

陈, Stanley and Joshua Goodman. 1998.

An empirical study of smoothing
techniques for language modeling.
TR-10-98, 哈佛大学.

Cicchello, Orlando and Stefan C. Kremer.

2003. Inducing grammars from sparse data
套: A survey of algorithms and results.
Journal of Machine Learning Research,
4(Oct):603–632.

科尔特斯, Corinna, Mehryar Mohri, Ashish

Rastogi, and Michael Riley. 2008. 上
computation of the relative entropy of
probabilistic automata. International Journal
of Foundations of Computer Science,
19(01):219–242. https://doi.org/10.1142
/S0129054108005644

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

Dempster, Arthur P., Nan M. Laird, 和
Donald B. 鲁宾. 1977. Maximum
likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical
社会. Series B (Methodological), 39(1):1–38.
https://doi.org/10.1111/j.2517-6161
.1977.tb01600.x

Deoras, Anoop, Tom´aˇs Mikolov, Stefan

Kombrink, Martin Karafi´at, and Sanjeev
Khudanpur. 2011. Variational
approximation of long-span language
models for LVCSR. 在 2011 IEEE
International Conference on Acoustics, Speech
and Signal Processing (ICASSP),
pages 5532–5535. https://doi.org/10
.1109/ICASSP.2011.5947612

Dupont, 皮埃尔. 1996. Incremental regular
inference. In International Colloquium on
Grammatical Inference, pages 222–237.
https://doi.org/10.1007/BFb0033357

Durbin, 理查德, Sean R. Eddy, Anders

Krogh, and Graeme J. Mitchison. 1998.
Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids,
剑桥大学出版社. https://
doi.org/10.1017/CBO9780511790492
Ebden, Peter and Richard Sproat. 2015. 这
Kestrel TTS text normalization system.
Natural Language Engineering,
21(3):333–353. https://doi.org/10
.1017/S1351324914000175

艾斯纳, Jason. 2001. Expectation semirings:

Flexible EM for learning finite-state
transducers. In Proceedings of the ESSLLI
Workshop on Finite-state Methods in NLP,
pages 1–5.

贾尔斯, C. 李, Clifford B. 磨坊主, Dong Chen,
Hsing-Hen Chen, Guo-Zheng Sun, 和
Yee-Chun Lee. 1992. Learning and
extracting finite state automata with
second-order recurrent neural networks.
神经计算, 4(3):393–405.
https://doi.org/10.1162/neco.1992.4
.3.393

Hellsten, Lars, Brian Roark, Prasoon Goyal,
Cyril Allauzen, Franc¸oise Beaufays, 汤姆
Ouyang, Michael Riley, and David Rybach.
2017. Transliterated mobile keyboard input
via weighted finite-state transducers. 在
FSMNLP 2017, pages 10–19. https://
doi.org/10.18653/v1/W17-4002

Horst, Reiner and Nguyen V. Thoai. 1999.
DC programming: Overview. 杂志
Optimization Theory and Applications,
103(1):1–43. https://doi.org/10.1023
/A:1021765131316

Iglesias, Gonzalo, Cyril Allauzen, 威廉
Byrne, Adri`a de Gispert, 和迈克尔
Riley. 2011. Hierarchical phrase-based
translation representations. In EMNLP
2011, pages 1373–1383.

Jacobsson, Henrik. 2005. Rule extraction
from recurrent neural networks: A
taxonomy and review. 神经计算,
17(6):1223–1263. https://doi.org/10
.1162/0899766053630350

Katz, Slava M. 1987. Estimation of

probabilities from sparse data for the
language model component of a speech
recognizer. IEEE Transactions on Acoustic,
Speech, and Signal Processing, 35(3):400–401.
https://doi.org/10.1109/TASSP.1987
.1165125

科恩, Philipp. 2005. Europarl: A parallel

corpus for statistical machine translation.
In MT Summit, 5:79–86.

Koneˇcn `y, Jakub, H. Brendan McMahan,
Felix X. 于, Peter Richt´arik, Ananda
Theertha Suresh, and Dave Bacon. 2016.
Federated learning: Strategies for
improving communication efficiency. arXiv
preprint arXiv:1610.05492.

Lecorv´e, Gw´enol´e and Petr Motlicek. 2012.
Conversion of recurrent neural network
language models to weighted finite state
transducers for automatic speech
认出. In Thirteenth Annual Conference
of the International Speech Communication
协会 (Interspeech), pages 1668–1671.

金子, 乙. 标记. 1967. Language identification

McMahan, Brendan, Eider Moore,

in the limit. Information and Control,
10(5):447–474. https://doi.org/10.1016
/S0019-9958(67)91165-5

金子, 乙. 标记. 1978. Complexity of

automaton identification from given data.
Information and Control, 37(3):302–320.
https://doi.org/10.1016/S0019
-9958(78)90562-4

Hard, 安德鲁, Kanishka Rao, Rajiv

Mathews, Franc¸oise Beaufays, Sean
Augenstein, Hubert Eichner, Chlo´e
Kiddon, and Daniel Ramage. 2018.
Federated learning for mobile keyboard
prediction. arXiv 预印本 arXiv:1811.03604.

Daniel Ramage, Seth Hampson, 和
Blaise Aguera y Arcas. 2017.
Communication-efficient learning of deep
networks from decentralized data. 在
Artificial Intelligence and Statistics,
pages 1273–1282.

Mohri, Mehryar. 1997. String-matching with
automata. Nordic Journal of Computing,
4(2):217–213.

Mohri, Mehryar. 2000. Minimization

algorithms for sequential transducers.
Theoretical Computer Science,
234(1-2):177–201. https://doi.org
/10.1016/S0304-3975(98)00115-7

253

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

计算语言学

体积 47, 数字 2

Mohri, Mehryar. 2002. Semiring frameworks

and algorithms for shortest-distance
问题. Journal of Automata, Languages
and Combinatorics, 7(3):321–350.

Mohri, Mehryar. 2009. Weighted automata
算法. 在米. Droste, 瓦. Kuich, 和
H. Vogler, 编辑, Handbook of Weighted
Automata. 施普林格, pages 213–254.
https://doi.org/10.1007/978-3-642
-01492-5 6

Mohri, Mehryar, Fernando C. 氮. 佩雷拉, 和
Michael Riley. 2008. Speech recognition
with weighted finite-state transducers. 在
J. Benesty, 中号. 中号. Sondhi, 和 Y. A. 黄,
编辑, Handbook on Speech Processing and
Speech Communication, 施普林格,
pages 559–584. https://doi.org/10
.1007/978-3-540-49127-9 28

Novak, Josef R., Nobuaki Minematsu, 和
Keikichi Hirose. 2013. Failure transitions
for joint n-gram models and G2P
转换. In Fourteenth Annual Conference
of the International Speech Communication
协会 (Interspeech), pages 1821–1825.

Okudono, Takamasa, Masaki Waga, Taro
Sekiyama, and Ichiro Hasuo. 2020.
Weighted automata extraction from
recurrent neural networks via regression
on state spaces. In Proceedings of the AAAI
Conference on Artificial Intelligence,
34:5306–5314. https://doi.org/10
.1609/aaai.v34i04.5977

Oncina, Jos´e and Pedro Garcia. 1992.
Identifying regular languages in
polynomial time. In Advances in Structural
and Syntactic Pattern Recognition. 世界
Scientific, pages 99–108. https://doi.org
/10.1142/9789812797919 0007

Ouyang, 汤姆, David Rybach, Franc¸oise
Beaufays, and Michael Riley. 2017.
Mobile keyboard input decoding with
finite-state transducers. arXiv 预印本
arXiv:1704.03987.

Parekh, Rajesh and Vasant Honavar. 2000.

Grammar inference, automata induction,
and language acquisition. 在R中. 戴尔, H.
Moisl, 和H. Somers, 编辑, Handbook of
自然语言处理, pages 727–764.

Pitt, Leonard. 1989. Inductive inference,

DFAs, and computational complexity. 在
International Workshop on Analogical and
Inductive Inference, pages 18–44.
https://doi.org/10.1007/3-540
-51734-0 50

Roark, Brian, Richard Sproat, Cyril Allauzen,
Michael Riley, Jeffrey Sorensen, and Terry
Tai. 2012. The OpenGrm open-source
finite-state grammar software libraries.
Proceedings of the ACL 2012 系统
Demonstrations, pages 61–66.

254

Samarati, Pierangela. 2001. Protecting
respondents’ identities in microdata
发布. IEEE Transactions on Knowledge and
Data Engineering, 13(6):1010–1027.
https://doi.org/10.1109/69
.971193

Sriperumbudur, Bharath K. and Gert R. G.

Lanckriet. 2009. On the convergence of the
concave-convex procedure. In Proceedings
of the 22nd International Conference on
Neural Information Processing Systems,
pages 1759–1767.

Stolcke, Andreas. 2000. Entropy-based
pruning of backoff language models.
arXiv preprint cs/0006025.

Sundermeyer, 马丁, Ralf Schl ¨uter, 和
Hermann Ney. 2012. LSTM neural
networks for language modeling.
Thirteenth Annual Conference of the
International Speech Communication
协会, pages 194–197.

Suresh, Ananda Theertha, Brian Roark,

Michael Riley, and Vlad Schogol. 2019.
Distilling weighted finite automata from
arbitrary probabilistic models. 在
Proceedings of the 14th International
Conference on Finite-State Methods and
自然语言处理 (FSMNLP),
pages 87–97. https://doi.org/10
.18653/v1/W19-3112

Ti ˜no, Peter and Vladimir Vojtek. 1997.
Extracting stochastic machines from
recurrent neural networks trained on
complex symbolic sequences. 在
第一国际会议录
Conference on Knowledge-Based Intelligent
Electronic Systems, 体积 2,
pages 551–558, IEEE.

韦斯, Gail, Yoav Goldberg, and Eran Yahav.
2018. Extracting automata from recurrent
neural networks using queries and
counterexamples. 在国际
Conference on Machine Learning,
pages 5244–5253.

韦斯, Gail, Yoav Goldberg, and Eran Yahav.
2019. Learning deterministic weighted
automata with queries and
counterexamples. Advances in Neural
Information Processing Systems,
pages 8560–8571.

Wolf-Sonkin, 劳伦斯, Vlad Schogol, Brian
Roark, and Michael Riley. 2019. Latin script
keyboards for South Asian languages with
finite-state normalization. 在诉讼程序中
the 14th International Conference on
Finite-State Methods and Natural Language
加工 (FSMNLP), pages 108–117.
https://doi.org/10.18653/v1/W19
-3114

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
C



/

A
r
t

C
e

p
d

F
/

/

/

/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C


_
A
_
0
0
4
0
1
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3
下载pdf