Approximating Probabilistic Models as - IA de Investigación especializada en el MIT

Approximating Probabilistic Models as
Weighted Finite Automata

Ananda Theertha Suresh
Google Research
theertha@google.com

Brian Roark
Google Research
roark@google.com

Michael Riley
Google Research
riley@google.com

Vlad Schogol
Google Research
vlads@google.com

Weighted ﬁnite automata (WFAs) are often used to represent probabilistic models, como
n-gram language models, because among other things, they are efﬁcient for recognition tasks
in time and space. The probabilistic source to be represented as a WFA, sin embargo, may come in
many forms. Given a generic probabilistic model over sequences, we propose an algorithm to
approximate it as a WFA such that the Kullback-Leibler divergence between the source model
and the WFA target model is minimized. The proposed algorithm involves a counting step and a
difference of convex optimization step, both of which can be performed efﬁciently. We demonstrate
the usefulness of our approach on various tasks, including distilling n-gram models from neural
modelos, building compact language models, and building open-vocabulary character models. El
algorithms used for these experiments are available in an open-source software library.

1. Introducción

Given a sequence of symbols x1, x2, . . . , xn−1, where symbols are drawn from the alpha-
bet Σ, a probabilistic model S assigns probability to the next symbol xn ∈ Σ by

ps[xn|xn−1 . . . x1]

Some of the results in this article were previously presented in Suresh et al. (2019).

Envío recibido: 13 Enero 2020; versión revisada recibida: 11 Septiembre 2020; accepted for publication:
6 Enero 2021.

https://doi.org/10.1162/COLI a 00401

© 2021 Asociación de Lingüística Computacional
Publicado bajo una Atribución Creative Commons-NoComercial-SinDerivadas 4.0 Internacional
(CC BY-NC-ND 4.0) licencia

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 47, Número 2

Such a model might be Markovian, dónde

ps[xn|xn−1 . . . x1] = ps[xn|xn−1 . . . xn−k+1]

such as a k-gram language model (LM) (Chen and Goodman 1998) or it might be
non-Markovian such as a long short-term memory (LSTM) neural network LM
(Sundermeyer, Schl ¨uter, and Ney 2012). Our goal is to approximate a probabilistic
model as a weighted ﬁnite automaton (WFA) such that the weight assigned by the
WFA is close to the probability assigned by the source model. Específicamente, we will seek
to minimize the Kullback-Leibler (KL) divergence between the source S and the target
WFA model.

Representing the target model as a WFA has many advantages, including efﬁcient
usar, compact representation, interpretability, and composability. WFA models have
been used in many applications including speech recognition (Mohri, Pereira, y
Riley 2008), speech synthesis (Ebden and Sproat 2015), optical character recognition
(Breuel 2008), machine translation (Iglesias et al. 2011), computational biology (Durbin
et al. 1998), and image processing (Albert and Kari 2009). One particular problem of
interest is language modeling for on-device (virtual) keyboard decoding (Ouyang et al.
2017), where WFA models are widely used because of space and time constraints. Uno
example use of methods we present in this article comes from this domain, involving
approximation of a neural LM. Storing training data from the actual domain of use
(individual keyboard activity) in a centralized server and training k-gram or WFA
models directly may not be feasible because of privacy constraints (Hard et al. 2018).
To circumvent this, an LSTM model can be trained by federated learning (Koneˇcn `y
et al. 2016; McMahan et al. 2017; Hard et al. 2018), converted to a WFA at the server,
and then used for fast on-device inference. This not only may improve performance
over training the models just on out-of-domain publicly available data, but also beneﬁts
from the additional privacy provided by federated learning. Chen et al. (2019) use the
methods presented in this article for this very purpose.

There are multiple reasons why one may choose to approximate a source model
with a WFA. One may have strong constraints on system latency, such as the virtual
keyboard example above. Alternativamente, a specialized application may require a distri-
bution over just a subset of possible strings, but must estimate this distribution from a
more general model—see the example below regarding utterances for setting an alarm.
To address the broadest range of use cases, we aim to provide methods that permit
large classes of source models and target WFA topologies. We explore several distinct
scenarios experimentally in Section 5.

Our methods allow failure transitions (Aho and Corasick 1975; Mohri 1997) en el
target WFA, which are taken only when no immediate match is possible at a given state,
for compactness. Por ejemplo, in the WFA representation of a backoff k-gram model,
failure transitions can compactly implement the backoff (katz 1987; Chen and Goodman
1998; Allauzen, Mohri, and Roark 2003; Novak, Minematsu, and Hirose 2013; Hellsten
et al. 2017). The inclusion of failure transitions complicates our analysis and algorithms
but is highly desirable in applications such as keyboard decoding. Más, to avoid
redundancy that leads to inefﬁciency, we assume the target model is deterministic, cual
requires that at each state there is at most one transition labeled with a given symbol.

The approximation problem can be divided into two steps: (1) select an unweighted
automaton A that will serve as the topology of the target automaton and (2) weight
the automaton A to form our weighted approximation ˆA. The main goal of this article

222

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

Cifra 1
An unweighted automaton that speciﬁes what one might say to set an alarm. The initial state is
the bold circle and the ﬁnal state is the double circle. By convention, we terminate all accepted
strings with the symbol $. is the latter determination of the automaton’s weighting in the approximation. If the topology is not known, we suggest a few techniques for inferring topology later in the Introduction. We will now give some simple topology examples to illustrate the approximation idea. En la sección 5 we will give larger-scale examples. Consider the unweighted automa- ton A in Figure 1 that was designed for what one might say to set an alarm. To use this in an application such as speech recognition, we would want to weight the automaton with some reasonable probabilities for the alternatives. Por ejemplo, people may be more likely to set their alarms for six or seven than four. In the absence of data speciﬁcally for this scenario, we can fall back on some available background LM M, trained on a large suitable corpus. En particular, we can use the conditional distribution pM[x1 . . . xn|x1 . . . xn ∈ L(A)] = pM[x1 . . . xn]1×1…xn∈L(A) (cid:80) x1…xn∈L(A) pM[x1 . . . xn] (1) dónde 1 is the indicator function and L(A) is the regular language accepted by the automaton A, as our source distribution S. We then use the unweighted automaton A as our target topology. If M is represented as a WFA, our approximation will in general give a different so- lution than forming the ﬁnite-state intersection with A and weight-pushing to normalize the result (Mohri, Pereira, and Riley 2008; Mohri 2009). Our approximation has the same states as A whereas weight-pushed M ∩ A has O(|METRO||A|) estados. Además, weight- pushed M ∩ A is an exact WFA representation of the distribution in Equation (1). As stated earlier, some applications may simply require smaller models or those with lower latency of inference, and in such scenarios the speciﬁc target topology 223 l D o w n o a d e desde h t t p : / / directo . mi t . e d u / c o l i / lartice – pdf / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 pd . f por invitado 0 7 septiembre 2 0 2 3 01set3alarm2my4toforalarm5twelveeleventennineeightsevensixfivefourthreetwoone7$6o’clock$ Computational Linguistics Volume 47, Número 2 (a) (b) Cifra 2 Topology examples derived from the corpus aab. States are labeled with the context that is remembered, ∧ denotes the initial context, (cid:15) the empty context, $ the ﬁnal context (y
terminates accepted strings), and matches any symbol in a context. (a) 3-gram topology: failure
transitions, labeled with ϕ, implement backoff from histories xy to y to (cid:15). (b) skip-gram
topología: failure transitions implement backoff instead from histories xy to x .

may be unknown. In such cases, one choice is to build a k-gram deterministic ﬁnite
automaton (DFA) topology from a corpus drawn from S (Allauzen, Mohri, and Roark
2003). This could be from an existing corpus or from random samples drawn from S.
Cifra 2(a) shows a trigram topology for the very simple corpus aab. Cifra 2(b) muestra
an alternative topology that allows skip-grams. Both of these representations make use
of failure transitions. These allow modeling strings unseen in the corpus (p.ej., abab) en un
compact way by failing or backing-off to states that correspond to lower-order histories.
Such models can be made more elaborate if some transitions represent classes, como
names or numbers, that are themselves represented by sub-automata. As mentioned
previamente, we will mostly assume we have a topology either pre-speciﬁed or inferred
by some means and focus on how to weight that topology to best approximate the
source distribution. We hence focus our evaluation on intrinsic measures of model
quality such as perplexity or bits-per-character.1

1 Model size or efﬁciency of inference may be a common motivation for the model approximations we

present, which may suggest a demonstration of the speed/accuracy tradeoff for some downstream use of
the models. Sin embargo, as stated earlier in this Introduction, there are mulitple reasons why an
approximation may be needed, and our goal in this article is to establish that our methods provide a
well-motivated approach should an approximation be required for any reason.

224

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

^^aaεφaaa_aφa_bb$$φabbφabφ$φ$^^aaεφaaa^_φ_aa_bb$$abba_φφaφab$φφ$φb$ Suresh et al. Approximating Probabilistic Models as WFA This article expands upon an earlier, shorter version (Suresh et al. 2019) by also pro- viding, beyond the additional motivating examples that have already been presented in this Introduction: an extended related work section and references; expansion of the theoretical analysis from three lemmas without proofs (omitted for space) to ﬁve lemmas (and a corollary) with full proofs; inclusion of additional algorithms for count- En g (for general WFA source and target in addition to ϕ-WFA); inclusion of additional experiments, including some illustrating the exact KL-minimization methods available for WFA sources of different classes; and documentation of the open-source library that provides the full functionality presented here. The article is organized as follows. En la sección 2 we review previous work in this area. En la sección 3 we give the theoretical formulation of the problem and the minimum KL divergence approximation. En la sección 4 we present algorithms to compute that solution. One algorithm is for the case that the source itself is ﬁnite-state. A second algorithm is for the case when it is not and involves a sampling approach. En la sección 5 we show experiments using the approximation. En la sección 6 we brieﬂy describe the open-source software used in our experiments. Finalmente, en la sección 7 we discuss the results and offer conclusions. 2. Related Work In this section we will review methods both for inferring unweighted ﬁnite-state models from data and estimating the weight distribution as well in the weighted case. We start with the unweighted case. There is a long history of unweighted ﬁnite-state model inference (Parekh and Honavar 2000; Cicchello and Kremer 2003). Gold (1967) showed that an arbitrary reg- ular set L cannot be learned, identiﬁed in the limit, strictly from the presentation of a sequence of positive examples that eventually includes each string in L. This has led to several alternative lines of attack. One approach is to include the negative examples in the sequence. Given such a complete sample, there are polynomial-time algorithms that identify a regular set in the limit (Gold 1978). Por ejemplo, a preﬁx tree of the positive examples can be built and then states can be merged so long as they do not cause a negative example to be accepted (Oncina and Garcia 1992; Dupont 1996). Another approach is to train a recurrent neural network (RNN) on the positive and negative examples and then extract a ﬁnite automaton by quantizing the continuous state space of the RNN (Giles et al. 1992; Jacobsson 2005). A second approach is to assume a teacher is available that determines not only if a string is a positive or negative example but also if the language of the current hypothe- sized automaton equals L or, if not, provides a counterexample. In this case the minimal m-state DFA corresponding to L can be learned in time polynomial in m (Angluin 1987). Weiss, Goldberg, and Yahav (2018) apply this method for DFA extraction from an RNN. A third approach is to assume a probability distribution over the (positive only) muestras. With some reasonable restrictions on the distribution, such as that the proba- bilities are generated from a weighted automaton A with L = L(A), then L is identiﬁable in the limit with “high probability” (Angluin 1988; Pitt 1989). There have been a variety of approaches for estimating weighted automata. A variant of the preﬁx tree construction can be used that merges states with sufﬁciently similar sufﬁx distributions, estimated from source frequencies (Carrasco and Oncina 1994, 1999). Approaches that produce (possibly highly) non-deterministic results in- clude the Expectation-Maximization (EM) algoritmo (Dempster, Laird, and Rubin 1977) 225 l D o w n o a d e desde h t t p : / / directo . mi t . e d u / c o l i / lartice – pdf / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 pd . f por invitado 0 7 septiembre 2 0 2 3 Computational Linguistics Volume 47, Número 2 applied to fully connected Hidden Markov models or spectral methods applied to au- tomata (Balle and Mohri 2012; Balle et al. 2014). Eisner (2001) describes an algorithm for estimating probabilities in a ﬁnite-state transducer from data using EM-based methods. Weiss, Goldberg, and Yahav (2019) and Okudono et al. (2020) provide adaptations to the Weiss, Goldberg, and Yahav (2018) DFA extraction algorithm to yield weighted automata. For approximating neural network (NN) models as WFAs, Deoras et al. (2011) used an RNN LM to generate samples that they then used to train a k-gram LM. Arisoy et al. (2014) used deep neural network (DNN) models of different orders to successively build and prune a k-gram LM with each new order constrained by the previous order. Adel et al. (2014) also trained DNNs of different orders, built a k-gram LM on the same data to obtain a topology, and then transferred the DNN probabilities of each order onto that k-gram topology. Ti ˜no and Vojtek (1997) quantized the continuous state space of an RNN and then estimated the transition probabilities from the RNN. Lecorv´e and Motlicek (2012) quantized the hidden states in an LSTM to form a ﬁnite-state model and then used an entropy criterion to back off to low-order k-grams to limit the number of transitions. See Section 4.3 for a more detailed comparison with the most closely related methods, once the details of our algorithms have been provided. Our article is distinguished in several respects from previous work. Primero, our general approach does not depend on the form of the source distribution although we specialize our algorithms for (conocido) ﬁnite-state sources with an efﬁcient direct construction and for other sources with an efﬁcient sampling approach. Segundo, our targets are a wide class of deterministic automata with failure transitions. These are con- siderably more general than k-gram models but retain the efﬁciency of determinism and the compactness failure transitions allow, which is especially important in applications with large alphabets like language modeling. Tercero, we show that our approximation searches for the minimal KL divergence between the source and target distributions, given a ﬁxed target topology provided by the application or some earlier computation. 3. Theoretical Analysis 3.1 Probabilistic Models Let Σ be a ﬁnite alphabet. Let xn 1. A probabilistic model p over Σ is a probabilistic distribution over the next symbol xn, given the previous symbols xn−1, such that2 i ∈ Σ∗ denote the string xixi+1 . . . xn and xn (cid:44) xn p(xn = x|xn−1) = 1 y P(xn = x|xn−1) ≥ 0, ∀x ∈ Σ (cid:88) x∈Σ Without loss of generality, we assume that the model maintains an internal state q and updates it after observing the next symbol.3 Furthermore, the probability of the subsequent state just depends on the state q p (cid:0)xn i+1|xi(cid:1) = p (cid:0)xn i+1|q(xi)(cid:1) 2 We deﬁne x0 (cid:44) (cid:15), the empty string, and adopt p((cid:15)) = 0. 3 In the most general case, q(xn ) = xn. 226 l D o w n o a d e desde h t t p : / / directo . mi t . e d u / c o l i / lartice – pdf / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 pd . f por invitado 0 7 septiembre 2 0 2 3 Suresh et al. Approximating Probabilistic Models as WFA for all i, norte, xi, xn i+1, where q(xi) is the state that the model has reached after observing sequence xi. Let Q(pag) be the set of possible states. Let the language L(pag) ⊆ Σ∗ deﬁned by the distribution p be L(pag) (cid:44) {xn ∈ Σ∗ : pag(xn) > 0 and xn = $ and xi (cid:54)= $, ∀ i < n} (2) The symbol $ is used as a stopping criterion. Further, for all xn ∈ Σ∗ such that xn−1 = $, p(xn|xn−1) = 0. The KL divergence between the source model ps and the target model pa is given by D(ps||pa) = (cid:88) xn∈Σ∗ ps(xn) log ps(xn) pa(xn) (3) where for notational simplicity, we adopt the notion 0/0 = 1 and 0 log(0/0) = 0 throughout the article. Note that for the KL divergence to be ﬁnite, we need L(ps) ⊆ L(pa). We will assume throughout that the source entropy H(ps) = − (cid:80) xn ps(xn) log ps(xn) is ﬁnite.4 We ﬁrst reduce the KL divergence between two models as follows (cf. Carrasco 1997; Cortes et al. 2008). In the following, let q∗ denote the states of the probability distribution p∗. Lemma 1 If L(ps) ⊆ L(pa), then D(ps||pa) = (cid:88) (cid:88) (cid:88) qa∈Qa qs∈Qs x∈Σ γ(qs, qa) ps(x|qs) log ps(x|qs) pa(x|qa) ∞ (cid:88) (cid:88) γ(qs, qa) = ps(xi) i=0 xi:qs(xi )=qs,qa(xi )=qa where Proof. D(ps||pa) = = = = ∞ (cid:88) (cid:88) n=1 xn ∞ (cid:88) (cid:88) n=1 xn ps(xn) log ps(xn) pa(xn) ps(xn) n (cid:88) i=1 log ps(xi|xi−1) pa(xi|xi−1) ∞ (cid:88) n (cid:88) (cid:88) n=1 i=1 xn ∞ (cid:88) n (cid:88) (cid:88) n=1 i=1 xn ps(xn) log ps(xi|xi−1) pa(xi|xi−1) ps(xi−1)ps(xi|xi−1)ps(xn i+1|xi) log ps(xi|xi−1) pa(xi|xi−1) 4 If |Q(ps )| is ﬁnite, it can be shown that H(ps ) is necessarily ﬁnite. (4) (5) 227 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 47, Number 2 = = ∞ (cid:88) (cid:88) i=1 xi−1 ∞ (cid:88) (cid:88) i=1 xi−1 ps(xi−1) ps(xi−1) (cid:88) xi (cid:88) xi ps(xi|xi−1) log ps(xi|xi−1) pa(xi|xi−1) · (cid:88) (cid:88) n≥i xn i+1 ps(xn i+1|xi) ps(xi|xi−1) log ps(xi|xi−1) pa(xi|xi−1) By deﬁnition, the probability of the next symbol conditioned on the past just depends on the state. Hence grouping terms corresponding to same states both in s and t yields ∞ (cid:88) (cid:88) i=1 xi−1 ps(xi−1) (cid:88) xi ps(xi|xi−1) log ps(xi|xi−1) pa(xi|xi−1) = = = ∞ (cid:88) (cid:88) i=1 xi−1 ps(xi−1) (cid:88) xi ps(xi|qs(xi−1)) log ps(xi|qs(xi−1)) pa(xi|qa(xi−1)) (cid:88) (cid:88) ∞ (cid:88) (cid:88) qa∈Qa qs∈Qs i=1 xi−1:qs(xi−1 )=qs,qa(xi−1 )=qa ps(xi−1) (cid:88) xi ps(xi|qs) log ps(xi|qs) pa(xi|qa) (cid:88) (cid:88) (cid:88) qa∈Qa qs∈Qs xi γ(qs, qa) ps(xi|qs) log ps(xi|qs) pa(xi|qa) Replacing xi by x yields the lemma. (cid:3) The quantity γ(qs, qa) counts each string xi that reaches both state qs in Qs and state qa in Qa weighted by its probability according to ps. Equivalently, it counts each string xi reaching state (qs, qa) in Qs × Qa (an interpretation we develop in Section 4.1.1).5 Note that γ does not depend on distribution pa. The following corollary is useful for ﬁnding the probabilistic model pa that has the minimal KL divergence from a model ps. Corollary 1 Let P be a set of probabilistic models pa for which L(ps) ⊆ L(pa). Then argmin pa∈P D(ps||pa) = argmax pa∈P (cid:88) (cid:88) qa∈Qa x∈Σ c(x, qa) log pa(x|qa) where c(x, qa) = (cid:88) qs∈Qs γ(qs, qa) ps(x|qs) (6) 5 γ is not (necessarily) a probability distribution over Qs × Qa. In fact, γ(qs, qa ) can be greater than 1. 228 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Suresh et al. Approximating Probabilistic Models as WFA Proof. By Lemma 1 argmin pa∈P D(ps||pa) = argmin pa∈P = argmin pa∈P qs∈Qs x∈Σ qa∈Qa   (cid:88) (cid:88) (cid:88)  qa∈Qa qs∈Qs x∈Σ (cid:88) (cid:88) (cid:88) γ(qs, qa) ps(x|qs) log ps(x|qs) pa(x|qa) γ(qs, qa) ps(x|qs) log ps(x|qs) (cid:88) (cid:88) (cid:88) − γ(qs, qa) ps(x|qs) log pa(x|qa) (7)    qa∈Qa (cid:88) qs∈Qs (cid:88) x∈Σ c(x, qa) log pa(x|qa) = argmax pa∈P qa∈Qa x∈Σ since the ﬁrst term in Equation (7) does not depend on pa. (cid:3) The quantity c(xi, qa) counts each string xi that reaches a state (qs, qa) in Qs × {qa} by xi−1 1 weighted by its probability according to ps (cf. Equation (18)). 3.2 Weighted Finite Automata A weighted ﬁnite automaton A = (Σ, Q, E, i, f ) over R+ is given by a ﬁnite alphabet Σ, a ﬁnite set of states Q, a ﬁnite set of transitions E ⊆ Q × Σ × R+ × Q, an initial state i ∈ Q, and a ﬁnal state f ∈ Q. A transition e = (p[e], (cid:96)[e], w[e], n[e]) ∈ E represents a move from a previous (or source) state p[e] to the next (or destination) state n[e] with the label (cid:96)[e] and weight w[e]. The transitions with previous state q are denoted by E[q] and the labels of those transitions as L[q]. A deterministic WFA has at most one transition with a given label leaving each state. An unweighted (ﬁnite) automaton is a WFA that satisﬁes w[e] = 1, ∀e ∈ E. A probabilistic (or stochastic) WFA satisﬁes (cid:88) e∈E[q] w[e] = 1 and w[e] ≥ 0, ∀q ∈ Q − {f } Transitions e1 and e2 are consecutive if n[ei] = p[ei+1]. A path π = e1 · · · en ∈ E∗ is a ﬁnite sequence of consecutive transitions. The previous state of a path we denote by p[π] and the next state by n[π]. The label of a path is the concatenation of its transition labels (cid:96)[π] = (cid:96)[e1] · · · (cid:96)[en]. The weight of a path is obtained by multiplying its transition weights w[π] = w[e1] × · · · × w[en]. For a non-empty path, the i-th transition is denoted by πi. P(q, q(cid:48)) denotes the set of all paths in A from state q to q(cid:48). We extend this to sets in the obvious way: P(q, R) denotes the set of all paths from state q to q(cid:48) ∈ R and so forth. A path π is successful if it is in P(i, f ) and in that case the automaton is said to accept the input string α = (cid:96)[π]. The language accepted by an automaton A is the regular set L(A) = {α ∈ Σ∗ : α = (cid:96)[π], π ∈ P(i, f )}. The weight of α ∈ L(A) assigned by the automaton is A(α) = Σπ∈P(i,f ): (cid:96)[π]=αw[π]. Similar to Equation (2), we assume a symbol $ ∈ Σ such that L(A) ⊆ {xn ∈ Σ∗ : xn = $ and xi (cid:54)= $, ∀ i < n} Thus all successful paths are terminated by the symbol $. 229 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 47, Number 2 For a symbol x ∈ Σ and a state q ∈ Q of a deterministic, probabilistic WFA A, deﬁne a distribution pa(x|q) (cid:44) w if (q, x, w, q(cid:48)) ∈ E and pa(x|q) (cid:44) 0 otherwise. Then pa is a probabilistic model over Σ as deﬁned in the previous section. If A = (Σ, Q, E, i, f ) is an unweighted deterministic automaton, we denote by P (A) the set of all probabilistic models pa representable as a weighted WFA ˆA = (Σ, Q, ˆE, i, f ) with the same topology as A where ˆE = {(q, x, pa(x|q), q(cid:48)) : (q, x, 1, q(cid:48)) ∈ E}. Given an unweighted deterministic automaton A, our goal is to ﬁnd the target dis- tribution pa ∈ P (A) that has the minimum KL divergence from our source probability model ps. Lemma 2 If L(ps) ⊆ L(A), then where Proof. From Corollary 1 argmin pa∈P(A) D(ps||pa) = ˜p(x|qa) (cid:44) c(x, qa) c(qa) (8) c(qa) = (cid:88) x∈Σ c(x, qa) argmin pa∈P(A) D(ps||pa) = argmax pa∈P(A) = argmax pa∈P(A) (cid:88) (cid:88) qa∈Qa (cid:88) x∈Σ (cid:88) qa∈Qa x∈Σ c(x, qa) log pa(x|qa) c(qa) ˜p(x|qa) log pa(x|qa) = argmin pa∈P(A) (cid:88) (cid:110) − (cid:88) c(qa) qa∈Qa x∈Σ ˜p(x|qa) log pa(x|qa) (cid:111) The quantity in braces is minimized when pa(x|qa) = ˜p(x|qa) because it is the cross entropy between the two distributions. Since L(ps) ⊆ L(A), it follows that ˜p ∈ P (A). (cid:3) 3.3 Weighted Finite Automata with Failure Transitions A weighted ﬁnite automaton with failure transitions (ϕ-WFA) A = (Σ, Q, E, i, f ) is a WFA extended to allow a transition to have a special failure label denoted by ϕ. Then E ⊆ Q × (Σ ∪ {ϕ}) × R+ × Q. A ϕ transition does not add to a path label; it consumes no input.6 However, it is followed only when the input cannot be read immediately. Speciﬁcally, a path e1 · · · en in a ϕ-WFA is disallowed if it contains a subpath ei · · · ej such that (cid:96)[ek] = ϕ for all k, i ≤ k < j, and there is another transition e ∈ E such that p[ei] = p[e] and (cid:96)[ej] = (cid:96)[e] ∈ Σ (see Figure 3). Since the label x = l[ej] can be read on e, we do not follow the failure transitions to read it on ej as well. 6 In other words, a ϕ label, like an (cid:15) label, acts as an identity element in string concatenation (Allauzen and Riley 2018). 230 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Suresh et al. Approximating Probabilistic Models as WFA Figure 3 The (dashed red) path ei = (qi, ϕ, ωi, qi+1) to ej = (qj, x, ωj, qj+1) is disallowed since x can be read already on e = (qi, x, ω, q). We use P∗(q, q(cid:48)) ⊆ P(q, q(cid:48)) to denote the set of (not dis-) allowed paths from state q to q(cid:48) in a ϕ-WFA. This again extends to sets in the obvious way. A path π is successful in a ϕ-WFA if π ∈ P∗(i, f ) and only in that case is the input string α = (cid:96)[π] accepted. The language accepted by the ϕ-automaton A is the set L(A) = {α ∈ Σ∗ : α = (cid:96)[π], π ∈ P∗(i, f )}. The weight of α ∈ Σ∗ assigned by the automaton is A(α) = Σπ∈P∗(i,f ): (cid:96)[π]=αw[π]. We assume each string in L(A) is terminated by the symbol $ as before. We also assume there are no ϕ-labeled cycles and there is at most one exiting failure transition per state. We express the ϕ-extended transitions leaving q as (cid:110) E∗[q] = (q, x, ω, q(cid:48)) : (cid:111) π ∈ P∗(q, Q), x = (cid:96)[π] = (cid:96)[π|π|] ∈ Σ, ω = w[π], q(cid:48) = n[π] This is a set of (possibly new) transitions (q, x, ω, q(cid:48)), one for each allowed path from previous state q to next state q(cid:48) with optional leading failure transitions and a ﬁnal x-labeled transition. Denote the labels of E∗[q] by L∗[q]. Note the WFA (Σ, Q, E∗, i, f ) accepts the same strings with the same weights as ϕ-WFA A and thus L(A) is regular (Allauzen and Riley 2018). A probabilistic (or stochastic) ϕ-WFA satisﬁes (cid:88) e∈E∗[q] w[e] = 1 and w[e] ≥ 0, ∀q ∈ Q − {f } A deterministic ϕ-WFA is backoff-complete if a failure transition from state q to q(cid:48) implies L[q] ∩ Σ ⊆ L[q(cid:48)] ∩ Σ. Further, if ϕ /∈ L[q(cid:48)], then the containment is strict: L[q] ∩ Σ ⊂ L[q(cid:48)] ∩ Σ. In other words, if a symbol can be read immediately from a state q it can also be read from a state failing (backing-off ) from q and if q(cid:48) does not have a backoff arc, then at least one additional label can be read from q(cid:48) that cannot be read from q. For example, both topologies depicted in Figure 2 have this property. We restrict our target automata to have a topology with the backoff-complete property since it will simplify our analysis, make our algorithms efﬁcient, and is commonly found in applications. For example, backoff n-gram models, such as the Katz model, are ϕ-cycle-free, backoff- complete ϕ-WFAs (Katz 1987; Chen and Goodman 1998). For a symbol x ∈ Σ and a state q ∈ Q of a deterministic, probabilistic ϕ-WFA A, a is a prob- a at a state q a (x|q) (cid:44) 0 otherwise. Then p∗ deﬁne p∗ abilistic model over Σ as deﬁned in Section 3.1. Note the distribution p∗ a (x|q) (cid:44) w if (q, x, w, q(cid:48)) ∈ E∗[q] and p∗ 231 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 4 7 2 2 2 1 1 9 3 0 9 8 8 / c o l i _ a _ 0 0 4 0 1 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 qiqx/ωqi+1φ/ωiqjφ/ωi+1...φ/ωj-1qj+1x/ωj Computational Linguistics Volume 47, Number 2 is deﬁned over the ϕ−extended transitions E∗[q] where pa in the previous section is deﬁned over the transitions E[q]. It is convenient to deﬁne a companion distribution pa ∈ a as follows:7 Given a symbol x ∈ Σ ∪ {ϕ} and state q ∈ Q, deﬁne pa(x|q) (cid:44) P (A) to p∗ a (x|q) when x ∈ L[q] ∩ Σ, pa(ϕ|q) (cid:44) 1 − (cid:80) a (x|q), and pa(x|q) (cid:44) 0 otherwise. x∈L[q]∩Σ p∗ p∗ The companion distribution is thus deﬁned solely over the transitions E[q]. When A = (Σ, Q, E, i, f ) is an unweighted deterministic, backoff-complete ϕ-WFA, a representable as a weighted we denote by P ∗(A) the set of all probabilistic models p∗ ϕ-WFA ˆA = (Σ, Q, ˆE, i, f ) of same topology as A with ˆE ={(q, x, pa(x|q), q(cid:48)) : (q, x, 1, q(cid:48)) ∈ E, x ∈ Σ} ∪ {(q, ϕ, α(q, q(cid:48)), q(cid:48)) : (q, ϕ, 1, q(cid:48)) ∈ E} where pa ∈ P (A) is the companion distribution to p∗ the weight of the failure transition from state q to q(cid:48) with a and α(q, q(cid:48)) = pa(ϕ|q)/d(q, q(cid:48)) is d(q, q(cid:48)) = 1 − (cid:88) pa(x|q(cid:48)) x∈L[q]∩Σ (9) Note that we have speciﬁed the weights on the automaton that represents p∗ a ∈ P ∗(A) entirely in terms of the companion distribution pa ∈ P (A). The failure transi- tion weight α is determined from the requirement that (cid:80) x E∗[q] = 1 (Katz 1987, Equation (16)). Thanks to the backoff-complete property, this transition weight depends only on its previous state q and its next state q(cid:48) and not all the states in the failure path from q. This is a key reason for assuming that property; otherwise our algorithms could not beneﬁt from this locality. Conversely, each distribution pa ∈ P (A) can be associated with a distribution p∗ a ∈ P ∗(A) given a deterministic, backoff-complete ϕ-WFA A. First extend α(q, q(cid:48)) to any failure path as follows. Denote a failure path from state q to q(cid:48) by πϕ(q, q(cid:48)). Then deﬁne α(q, q(cid:48)) = (cid:89) e∈πϕ(q,q(cid:48) ) pa(ϕ|p[e]) d(p[e], n[e]) (10) where this quantity is taken to be 1 when the failure path is empty (q = q(cid:48)). Finally deﬁne (cid:40) p∗ a (x|q) = α(q, qx) pa(x|qx), x ∈ L∗[q] otherwise 0, (11) where for x ∈ L∗[q], qx signiﬁes the ﬁrst state q(cid:48) on a ϕ-labeled path in A from state q for which x ∈ L[q(cid:48)]. For Equation (10) to be well-deﬁned, we need d(p[e], n[e]) > 0. To ensure this condi-

ción, we restrict P (A) to contain distributions such that pa(X|q) ≥ (cid:15) for each x ∈ L[q].8

Given an unweighted deterministic, backoff-complete, automaton A, our goal is to
a ∈ P ∗(A) that has the minimum KL divergence from our

ﬁnd the target distribution p∗
source probability model ps.

7 The meaning of P (A) when A is ϕ-WFA is to interpret it as a WFA with the failure labels as regular

symbols.

8 Para ser breve, we do not include (cid:15) in the notation of P (A).

232

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

Lema 3
Assume L(ps) ⊆ L(A). Let p∗ = ˜p(X|qa) from Lemma 2. If L(p∗) ⊆ L(A) and p∗ ∈ P ∗(A)
entonces

p∗ = argmin
a ∈P∗(A)
p∗

D(ps||p∗
a )

Prueba. This follows immediately from Lemma 2.

(cid:3)

The requirement that the p∗ of Lemma 3 is in P ∗(A) will be true if, por ejemplo, el
target has no failure transitions or if the source and target are both ϕ-WFAs with the
same topology and failure transitions. En general, this requirement cannot be assured.
While membership in P (A) principally requires the weights of the transitions leaving
a state are non-negative and sum to 1, membership in P ∗(A) imposes additional
constraints due to the failure transitions, indicated in Equation (11).

Tal como, we restate our goal in terms of the companion distribution pa ∈ P (A)
a ∈ P ∗(A) directly. Let Bn(q) be the set of

rather than its corresponding distribution p∗
states in A that back off to state q in n failure transitions and let B(q) = (cid:83)|Qa|

n=0 Bn(q).

Lema 4
If L(ps) ⊆ L(A) entonces

argmin
a ∈P∗(A)
p∗

D(ps||p∗

a ) = argmax
pa∈P(A)

(cid:88)

(cid:26) (cid:88)

q∈Qa

x∈L[q]

C(X, q) log pa(X|q) −

C(ϕ, q0) log d(q0, q)

(cid:27)

(cid:88)

q0∈B1(q)

dónde

C(X, q) =

(cid:88)

C(X, qa)1q=qx
a ,

x ∈ Σ

C(ϕ, q) =

qa∈B(q)
(cid:88)

(cid:88)

qa∈B(q)

x∈Σ

C(X, qa)1x/∈L[q]

(12)

(13)

and do not depend on pa.

Prueba. From Corollary 1, Ecuación (11) and the previously shown 1:1 correspondencia
between each distribution p∗

a ∈ P ∗(A) and its companion distribution pa ∈ P (A)

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

argmin
a ∈P∗(A)
p∗

D(ps||p∗

a ) = argmin
a ∈P∗(A)
p∗

(cid:88)

C(X, qa) log p∗

a (X|qa)

qa∈Qa
(cid:88)

x∈L∗[qa]
(cid:88)

C(X, qa) log α(qa, qx

a )pa(X|qx
a )

= argmax
pa∈P(A)

qa∈Qa

x∈L∗[qa]

(cid:88)

C(X, qa) registro

(cid:89)

x∈L∗[qa]

qa∈Qa
(cid:26)

Ax + Aϕ − Ad

(cid:27)

e∈πϕ(qa,qx
a )

pa(ϕ|pag[mi])
d(pag[mi], norte[mi])

pa(X|qx
a )

(14)

233

Ligüística computacional

Volumen 47, Número 2

where we distribute the factors inside the logarithm in Equation (14) como sigue:

(cid:88)

Ax =

C(X, qa) log pa(X|qx
a )

qa∈Qa
(cid:88)

x∈L∗[qa]
(cid:88)

(cid:88)

C(X, qa)1q=qx

a log pa(X|q)

(15)

q∈Qa
(cid:88)

qa∈B(q)
(cid:88)

x∈L[q]∩Σ

C(X, q) log pa(X|q)

q∈Qa

x∈L[q]∩Σ

Ecuación (15) follows from q = qx

a implying qa ∈ B(q).

(cid:88)

Aϕ =

C(X, qa) registro

(cid:89)

pa(ϕ|pag[mi])

qa∈Qa
(cid:88)

x∈L∗[qa]
(cid:88)

C(X, qa)

e∈πϕ(qa,qx
a )
(cid:88)

log pa(ϕ|pag[mi])

qa∈Qa
(cid:88)

x∈L∗[qa]
(cid:88)

(cid:88)

e∈πϕ(qa,qx
a )
(cid:88)

C(X, qa)

1q=p[mi] log pa(ϕ|q)

q∈Qa
(cid:88)

qa∈B(q)
(cid:88)

x∈L∗[qa]
(cid:88)

e∈πϕ(qa,qx
a )

C(X, qa)1x/∈L[q] log pa(ϕ|q)

qa∈B(q)

x∈Σ

C(ϕ, q) log pa(ϕ|q)

q∈Qa
(cid:88)

q∈Qa

Ecuación (16) follows from e ∈ πϕ(qa, qx

a ) implying x /∈ p[mi].

(cid:88)

Ad =

C(X, qa) registro

(cid:89)

d(pag[mi], norte[mi])

qa∈Qa
(cid:88)

x∈L∗[qa]
(cid:88)

C(X, qa)

e∈πϕ(qa,qx
a )
(cid:88)

log d(pag[mi], norte[mi])

qa∈Qa
(cid:88)

x∈L∗[qa]
(cid:88)

(cid:88)

e∈πϕ(qa,qx
a )
(cid:88)

C(X, qa)

q∈Qa
(cid:88)

qa∈B(q0 )
(cid:88)

q0∈B1(q)
(cid:88)

x∈L∗[qa]
(cid:88)

q∈Qa
(cid:88)

qa∈B(q0 )
(cid:88)

q∈Qa

q0∈B1(q)

x∈Σ

C(ϕ, q0) log d(q0, q)

(cid:88)

e∈πϕ(qa,qx
a )

1q0=p[mi] log d(q0, q)

C(X, qa)1x/∈L[q0] log d(q0, q)

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

(16)

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Substituting these results into Equation (14) proves the lemma.

(cid:3)

If there are no failure transitions in the target automaton, then C(X, q) =c(X, q), if x
is in Σ, and is 0 de lo contrario. En este caso, the statement of Lemma 4 simpliﬁes to that of
Corolario 1.

234

Suresh et al.

Approximating Probabilistic Models as WFA

Desafortunadamente, we do not have a closed-form solution, analogous to Lemma 2 o
Lema 3, for the general ϕ-WFA case. Instead we will present numerical optimization
algorithms to ﬁnd the KL divergence minimum in the next section. This is aided by
the observation that the quantity in braces in the statement of Lemma 4 depends on
the distribution pa only at state q. Thus the KL divergence minimum can be found by
maximizing that quantity independently for each state.

4. Algorithms

Approximating a probabilistic source algorithmically as a WFA requires two steps: (1)
compute the quantity c(X, qa) found in Corollary 1 and Lemma 2 or C(X, q) in Lemma 4
y (2) use this quantity to ﬁnd the minimum KL divergence solution. The ﬁrst step,
which we will refer to as counting, is covered in the next section and the KL divergence
minimization step is covered afterwards, followed by an explicit comparison of the
presented algorithms with closely related prior work.

4.1 Counting

How the counts are computed will depend on the form of the source and target models.
We break this down into several cases.

4.1.1 WFA Source and Target. When the source and target models are represented as WFAs
we compute c(X, qa) from Lemma 2. From Equation (6) this can be written as

C(X, qa) =

(cid:88)

qs∈Qs

γ(qs, qa)ps(X|qs)

(17)

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

dónde

∞
(cid:88)

(cid:88)

γ(qs, qa) =

ps(xi)

i=0

xi:qs(xi )=qs,qa(xi )=qa

The quantity γ(qs, qa) can be computed as

γ(qs, qa) =

(cid:88)

w[Pi]

π∈PS∩A((es,ia ),(qs,qa ))

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

where S ∩ A is the weighted ﬁnite-state intersection of automata S and A (Mohri 2009).
The above summation over this intersection is the (generalized) shortest distance from
the initial state to a speciﬁed state computed over the positive real semiring (Mohri
2002; Allauzen and Riley 2018). Algorithms to efﬁciently compute the intersection and
shortest distance on WFAs are available in OpenFst (Allauzen et al. 2007), an open-
source WFA library.

Then from Equation (17) we can form the sum

C(X, qa) =

(cid:88)

γ(qs, qa) w

((qs,qa ),X,w,(q(cid:48)

s ,q(cid:48)

a ))∈ES∩A

(18)

235

Ligüística computacional

Volumen 47, Número 2

Ecuación (18) is the weighted count of the paths in S ∩ A that begin at the initial state
and end in any transition leaving a state (qs, qa) labeled with x.

The worst-case time complexity for the counting step is dominated by the short-
est distance algorithm on the intersection S ∩ A. The shortest distance computation
is a meta-algorithm that depends on the queue discipline selected (Mohri 2002). Si
s is the maximum number of times a state in the intersection is inserted into the
shortest distance queue, C the maximum cost of a queue operation, and D the max-
imum out-degree in S and A (both assumed deterministic), then the algorithm runs
en O(s(D + C)|QS||control de calidad|) (Mohri 2002; Allauzen and Riley 2018). The worst-case space
complexity is in O(D|QS||control de calidad|), determined by the intersection size.

4.1.2 ϕ-WFA Source and Target. When the source and target models are represented as
ϕ-WFAs we compute C(X, qa) from Lemma 4. From Equation (12) and the previous case
this can be written as

C(X, q) =

(cid:88)

qa∈B(q)

qs∈Qs

γ(qs, qa)ps(X|qs)1q=qx
a ,

x ∈ Σ

(19)

To compute this quantity we ﬁrst form S ∩ A using an efﬁcient ϕ-WFA intersection that
compactly retains failure transitions in the result as described in Allauzen and Riley
(2018). Ecuación (19) is the weighted count of the paths in S ∩ A allowed by the failure
transitions that begin at the initial state and end in any transition leaving a state (qs, q)
labeled with x.

We can simplify this computation by the following transformation. First we con-
vert S ∩ A to an equivalent WFA by replacing each failure transition with an epsilon
transition and introducing a negatively weighted transition to compensate for formerly
disallowed paths (Allauzen and Riley 2018). The result is then promoted to a transducer
T with the output label used to keep track of the previous state in A of the compensated

S ∩ A

Cifra 4
A ϕ-WFA is transformed into an equivalent WFA by replacing each failure transition by an
(cid:15)-transition. To compensate for the formerly disallowed paths, nuevo (dashed red) negatively
weighted transitions are added. The result is promoted to a transducer T with the output label
used to keep track of the previous state in A of the compensated positive transition.

236

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

(qs,qa)x/ω(qs’,qa’)φ/αx/ν(qs,qa)X:qa/ω(qs’,qa’)ε:-/αx:qa’/-α νx:qa’/ν

Suresh et al.

Approximating Probabilistic Models as WFA

positive transition (ver figura 4).9 Algorithms to efﬁciently compute the intersection
and shortest distance on ϕ-WFAs are available in the OpenGrm libraries (Allauzen and
Riley 2018).
Entonces

C(X, q) =

(cid:88)

γT(qs, qa)w,

x ∈ Σ

(20)

((qs,qa ),X,q,w,(q(cid:48)

s ,q(cid:48)

a ))∈ET

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

where e = (pag[mi], il[mi], ol[mi], w[mi], norte[mi]) is a transition in T and γT(qs, q) is the shortest dis-
tance from the initial state to (qs, qa) in T computed over the real semiring as described in
Allauzen and Riley (2018). Ecuación (20) is the weighted count of all paths in S ∩ A that
begin at the initial state and end in any transition leaving a state (qs, q) labeled with x
minus the weighted count of those paths that are disallowed by the failure transitions.
Finalmente, we compute C(ϕ, q) como sigue. The count mass entering a state q must equal

the count mass leaving a state

(cid:88)

C(X, qa) =

(cid:88)

C(X(cid:48), q)

(qa,X,1,q)∈EA

(q,X(cid:48),1,q(cid:48)

a )∈EA
(cid:88)

(q,X(cid:48),1,q(cid:48)

a )∈EA,X(cid:48)∈Σ

C(X(cid:48), q) + C(ϕ, q)

De este modo

C(ϕ, q) =

(cid:88)

C(X, qa) −

(cid:88)

C(X(cid:48), q)

(qa,X,1,q)∈EA

(q,X(cid:48),1,q(cid:48)

a )∈EA,X(cid:48)∈Σ

This quantity can be computed iteratively in the topological order of states with respect
to the ϕ-labeled transitions.

The worst-case time and space complexity for the counting step for ϕ-WFAs is the

same as for WFAs (Allauzen and Riley 2018).

4.1.3 Arbitrary Source and ϕ-WFA Target. En algunos casos, the source is a distribution with
possibly inﬁnite states, Por ejemplo, LSTMs. For these sources, computing C(X, q) poder
be computationally intractable as Equation (19) requires a summation over all possible
states in the source machine, Qs. We propose to use a sampling approach to approximate
C(X, q) for these cases. Let x(1), X(2), . . . , X(norte) be independent random samples from ps.
Instead of C(X, q), we propose to use

ˆC(X, q) =

(cid:88)

qa∈B(q)

qs∈Qs

ˆγ(qs, qa)ps(X|qs)1q=qx
a ,

x ∈ Σ

9 The construction illustrated in Figure 4 is sufﬁcient when S ∩ A is acyclic. In the cyclic case a slightly

modiﬁed construction is needed to ensure convergence in the shortest distance calculation (Allauzen and
Riley 2018).

237

Ligüística computacional

Volumen 47, Número 2

dónde

ˆγ(qs, qa) = 1
norte

norte
(cid:88)

(cid:88)

j=1

i≥1

1qs(xi(j))=qs,qa(xi(j))=qa

Observe that in expectation,

mi[ ˆγ(qs, qa)] = 1
norte

norte
(cid:88)

(cid:88)

j=1

i≥1

mi[1qs(xi(j))=qs,qa(xi(j))=qa]

(cid:88)

ps(xi)

i≥1

xi:qs(xi )=qs,qa(xi )=qa

= γ(qs, qa),

and hence ˆγ(qs, qa) is an unbiased, asymptotically consistent estimator of γ(qs, qa). Given
ˆC(X, q), we compute C(ϕ, q) similarly to the previous section. Si (cid:96) is the expected number
of symbols per sample, then the computational complexity of counting in expectation is
en O(norte(cid:96)|S|).

4.2 KL Divergence Minimization
4.2.1 WFA Target. When the target topology is a deterministic WFA, we use c(X, qa) de
the previous section and Lemma 2 to immediately ﬁnd the minimum KL divergence
solución.

4.2.2 ϕ-WFA Target. When the target topology is a deterministic, backoff-complete
ϕ-WFA, Lema 3 can be applied in some circumstances to ﬁnd the minimum KL diver-
gence solution but not in general. Sin embargo, as noted before, the quantity in braces in
the statement of Lemma 4 depends on the distribution pa only at state q so the minimum
KL divergence D(ps||p∗
a ) can be found by maximizing that quantity independently for
each state.

(cid:44) pa(X|q) for x ∈ L[q] and let y (cid:44) [yx]x∈L[q].10 Then our goal

Fix a state q and let yx

reduces to

argmax
y

(cid:88)

x∈L[q]

C(X, q) log yx −

(cid:88)

C(ϕ, q0) registro (cid:0)1 −

(cid:88)

(cid:1)

(21)

q0∈B1(q)

x∈L[q0]∩Σ

subject to the constraints yx ≥ (cid:15) for x ∈ L[q] y (cid:80)
x∈L[q]

yx = 1.

This is a difference of two concave functions in y since log(F (y)) is concave for any
linear function f (y), C(X, q), C(ϕ, q0) are always non-negative, and the sum of concave
functions is also concave. We give a DC programming solution to this optimization
(Horst and Thoai 1999). Dejar

Ω = {y : ∀x, yx ≥ (cid:15),

(cid:88)

x∈L(q)

yx ≤ 1}

10 We ﬁx some total order on Σ ∪ {ϕ} so that y is well-deﬁned.

238

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

and let u(y) = (cid:80)
x∈L[q] C(X, q) log yx and v(y) = (cid:80)
Then the optimization problem can be written as

q0∈B1(q) C(ϕ, q0) registro(1 − (cid:80)

x∈L[q0]∩Σ yx).

tu(y) − v(y)

máximo
y∈Ω

The DC programming solution for such a problem uses an iterative procedure that
linearizes the subtrahend in the concave difference about the current estimate and then
solves the resulting concave objective for the next estimate (Horst and Thoai 1999), eso
es,

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

(22)

(23)

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

yn+1 = argmax

tu(y) − y · (cid:79)v(en)

y∈Ω

Substituting u and (cid:79)v gives

yn+1 = argmax

(cid:26)

(cid:88)

y∈Ω

x∈L[q]

C(X, q) log yx + yxf (X, q, en)

(cid:27)

dónde

F (X, q, en) =

(cid:88)

q0∈B1(q)

C(ϕ, q0)1x∈L[q0]∩Σ
1 − (cid:80)
X(cid:48)∈L[q0]∩Σ yn
X(cid:48)

Observe that 1 − (cid:80)
yn ∈ Ω.

X(cid:48)∈L[q0]∩Σ yn

X(cid:48) ≥ (cid:15) as the automaton is backoff-complete and

Let C(q) be deﬁned as:

C(q) =

(cid:88)

C(X(cid:48), q)

X(cid:48)∈L[q]

The following lemma provides the solution to the optimization problem in Equa-

ción (22) that leads to a stationary point of the objective.

Lema 5
Solution to Equation (22) is given by

yn+1
x = max

(cid:18)

C(X, q)
λ − f (X, q, en)

(cid:19)

, (cid:15)

(cid:104)
maxx∈L[q] F (X, q, en) + C(X, q), maxx∈L[q] F (X, q, en) + C(q)

1−|l[q]|(cid:15)

where λ ∈
(cid:80)
x yn+1
x = 1.

(24)

(cid:105)

such that

Prueba. With KKT multipliers, the optimization problem can be written as

máximo
y,λ,µx:µx≤0

(cid:26)

(cid:88)

x∈L[q]

C(X, q) log yx + yx f (X, q, en)

(cid:27)

+ λ(cid:0)1 −

(cid:88)

(cid:1) +

(cid:88)

µx((cid:15) − yx)

x∈L[q]

239

Ligüística computacional

Volumen 47, Número 2

We divide the proof into two cases depending on the value of C(X, q). Let C(X, q) (cid:54)= 0.
Differentiating this equation with respect to yx and equating to zero, we get

yn+1
x =

C(X, q)
λ + µx − f (X, q, en)

Además, by the KKT condition, µx((cid:15) − yn+1
x = (cid:15) and if µx is zero, then yn+1
yn+1
for yn+1
X
can be re-expressed as Equation (24). If C(X, q) = 0, then we get

) = 0. Por eso, µx is only non-zero if
λ−f (X,q,en ) . Además, because for all x, µx ≤ 0,
to be positive, we need λ ≥ maxx f (X, q, en). Por eso, the above two conditions

x = C(X,q)

F (X, q, en) = λ + µx and µx((cid:15) − yn+1

) = 0

and the solution is given by yn+1
x = (cid:15) and µx = f (X, q, en) − λ. Since µx cannot be pos-
itive, we have f (X, q, en) ≤ λ for all x. Por eso, irrespective of the value of C(X, q), el
solution is given by Equation (24).

The above analysis restricts λ ≥ maxx f (X, q, en). If λ < f (x, q, yn) + C(x, q), then x < 1. Hence λ needs to 1−|L[q]|(cid:15) , then (cid:80) x > 1 and if λ > maxx f (X, q, en) + C(q)
yn+1
lie in

x yn+1

(cid:20)

máximo
x∈L[q]

F (X, q, en) + C(X, q), máximo
x∈L[q]

F (X, q, en) +

(cid:21)

C(q)
1 − |l[q]|(cid:15)

to ensure that (cid:80)

x yn+1

x = 1.

(cid:3)

Algorithm KL-MINIMIZATION

Notation:

yx = pa(X|q) for x ∈ L(q)

•
• C(X, q) from Equations (12) y (13)
• C(q) = (cid:80)
X(cid:48)∈L[q] C(X(cid:48), q)
•

F (X, q, en) from Equation (23)

k = |l[q]|
lb = maxx∈L[q] F (X, q, en) + C(X, q)

•
•
• ub = maxx∈L[q] F (X, q, en) + C(q)
1−k(cid:15)
•

(cid:15) = lower bound on yx

Trivial case: If C(q) = 0, output y given by yx = 1/k for all x.
Initialization: Initialize:

y0
x =

C(X, q)
C(q)

(1 − k(cid:15)) + (cid:15).

Iteration: Until convergence do:

yn+1
x = max

(cid:18)

C(X, q)
λ − f (X, q, en)

(cid:19)

, (cid:15)

where λ ∈ [lb, ub] is chosen (in a binary search) to ensure (cid:80)

x∈L(q) yx = 1.

Cifra 5
KL-MINIMIZATION algorithm.

240

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

From this, we form the KL-MINIMIZATION algorithm in Figure 5. Observe that if
all the counts are zero, then for any y, tu(y) − v(y) = 0 and any solution is an optimal
solution and the algorithm returns a uniform distribution over labels. En otros casos, nosotros
initialize the model based on counts such that y0 ∈ Ω. We then repeat the DC program-
ming algorithm iteratively until convergence. Because Ω is a convex compact set and
functions u, v, y (cid:79)v are continuous and differentiable in Ω, the KL-MINIMIZATION
converges to a stationary point (Sriperumbudur and Lanckriet 2009, Teorema 4). Para
each state q, the computational complexity of KL-MINIMIZATION is in O(|mi[q]|) por
iteration. Por eso, if the maximum number of iterations per each state is s, the overall
computational complexity of KL-MINIMIZATION is in O(s|mi|).

4.3 Discusión

Our method for approximating from source ϕ-WFAs, speciﬁcally from backoff k-gram
modelos, shares some similarities with entropy pruning (Stolcke 2000). Both can be used
to approximate onto a more compact k-gram topology and both use a KL-divergence
criterion to do so. They differ in that entropy pruning, by sequentially removing tran-
sitions, selects the target topology and in doing so uses a greedy rather than global
entropy reduction strategy. We will empirically compare these methods in Section 5.1
for this use case.

Perhaps the closest to our work on approximating arbitrary sources is that of Deoras
et al. (2011), where they used an RNN LM to generate samples that they then used to
train a k-gram LM. A diferencia de, we use samples to approximate the joint state probability
γ(qs, qa). Si (cid:96) is the average number of words per sentence and N is the total number
of sampled sentences, then the time complexity of Deoras et al. (2011) is O(norte(cid:96)|S|)
as it takes O(|S|) time to generate a sample from the neural model for every word.
This is the same as that of our algorithm in Section 4.1.3. Sin embargo, our approach has
several advantages. Primero, for a given topology, our algorithm provably ﬁnds the best
KL minimized solution, whereas their approach is optimal only when the size of the
k-gram model tends to inﬁnity. Segundo, as we show in our experiments, the proposed
algorithm performs better compared with that of Deoras et al. (2011).

Arisoy et al. (2014) and Adel et al. (2014) both trained DNN models of different
orders as the means to derive probabilities for a k-gram model, rather than focusing on
model approximation as the above-mentioned methods do. Más, the methods are
applicable to k-gram models speciﬁcally, not the larger class of target WFA to which our
methods apply.

5. experimentos

We now provide experimental evidence of the theory’s validity and show its usefulness
in various applications. For the ease of notation, we use WFA-APPROX to denote the
exact counting algorithm described in Section 4.1.2 followed by the KL-MINIMIZATION
algorithm of Section 4.2. Similarmente, we use WFA-SAMPLEAPPROX(norte) to denote the
sampled counting described in Section 4.1.3 with N sampled sentences followed by KL-
MINIMIZATION.

We ﬁrst give experimental evidence that supports the theory in Section 5.1. Nosotros
then show how to approximate neural models as WFAs in Section 5.2. We also use the
proposed method to provide lower bounds on the perplexity given a target topology
en la sección 5.3. Finalmente, we present experiments in Section 5.4 addressing scenarios with
source WFAs, hence not requiring sampling. Primero, motivated by low-memory applica-

241

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 47, Número 2

tions such as (virtual) keyboard decoding (Ouyang et al. 2017), we use our approach
to create compact language models in Section 5.4.1. Entonces, en la sección 5.4.2, we use our
approach to create compact open-vocabulary character language models from count-
thresholded n-grams.

For most experiments (unless otherwise noted) we use the 1996 CSR Hub4 Lan-
guage Model data, LDC98T31 (English Broadcast News data). We use the processed
form of the corpus and further process it to downcase all the words and remove
punctuation. The resulting data set has 132M words in the training set, 20M words in the
test set, and has 240K unique words. For all the experiments that use word models, nosotros
create a vocabulary of approximately 32K words consisting of all words that appeared
más que 50 times in the training corpus. Using this vocabulary, we create a trigram
Katz model and prune it to contain 2M n-grams using entropy pruning (Stolcke 2000).
We use this pruned model as a baseline in all our word-based experiments. Usamos
Katz smoothing because it is amenable to pruning (Chelba et al. 2010). The perplexity of
this model on the test set is 144.4.11 All algorithms were implemented using the open-
source OpenFst and OpenGrm n-gram and stochastic automata (SFst) libraries12 with the
last library including these implementations (Allauzen et al. 2007; Roark et al. 2012;
Allauzen and Riley 2018).

5.1 Empirical Evidence of Theory

Recall that our goal is to ﬁnd the distribution on a target DFA topology that minimizes
the KL divergence to the source distribution. Sin embargo, as stated in Section 4.2, if the
target topology has failure transitions, the optimization objective is not convex so the
stationary point solution may not be the global optimum. We now show that the model
indeed converges to a good solution in various cases empirically.

Idempotency. When the target topology is the same as the source topology, we show that
the performance of the approximated model matches the source model. Let ps be the
pruned Katz word model described above. We approximate ps onto the same topol-
ogy using WFA-APPROX and WFA-SAMPLEAPPROX(·) and then compute perplexity
on the test corpus. The results are presented in Figure 6. The test perplexity of the
WFA-APPROX model matches that of the source model and the performance of the
WFA-SAMPLEAPPROX(norte) model approaches that of the source model as the number of
samples N increases.

Comparison to Greedy Pruning. Recall that entropy pruning (Stolcke 2000) greedily re-
moves n-grams such that the KL divergence to the original model ps is small. Let pgreedy
be the resulting model and Agreedy be the topology of pgreedy. If the KL-MINIMIZATION
converges to a good solution, then approximating ps onto Agreedy would give a model
that is at least as good as pgreedy. We show that this is indeed the case; En realidad, approximat-
ing ps onto Agreedy performs better than pgreedy. As before, let ps again be the 2M n-gram
Katz model described above. We prune it to have 1M n-grams and obtain pgreedy, cual
has a test perplexity of 157.4. We then approximate the source model ps on Agreedy, el

11 For all perplexity measurements, we treat the unknown word as a single token instead of a class. A

compute the perplexity with the unknown token being treated as class, multiply the perplexity by k0.0115,
where k is the number of tokens in the unknown class and 0.0115 is the out-of-vocabulary rate in the test
data set.

12 These libraries are available at www.openfst.org and www.opengrm.org.

242

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

Cifra 6
Idempotency: Test set perplexity for Katz baseline and approximations of that baseline trained on
the same data. The Katz baseline and Katz WFA-APPROX plots are identical (and thus overlap),
while WFA-SAMPLEAPPROX(norte) plot converges to the baseline.

Mesa 1
Test perplexity of greedy pruning and approximated models.

Model size Greedy pruning Approximated model

250k
500k
1METRO
1.5METRO

205.7
177.3
157.4
149.0

198.3
173.0
155.7
148.4

topology of pgreedy. The resulting model has a test perplexity of 155.7, which is smaller
than the test perplexity of pgreedy. This shows that the approximation algorithm indeed
ﬁnds a good solution. We repeat the experiment with different pruning thresholds and
observe that the approximation algorithm provides a good solution across the resulting
model sizes. The results are in Table 1.

5.2 Neural Models to WFA Conversion

Since neural models such as LSTMs give improved performance over n-gram models,
we investigate whether an LSTM distilled onto a WFA model can obtain better per-
formance than the baseline WFA trained directly from Katz smoothing. As stated in
the Introduction, this could then be used together with federated learning for fast and
private on-device inference, which was done in Chen et al. (2019) using the methods
presented here.

To explore this, we train an LSTM language model on the training data. The model
tiene 2 LSTM layers with 1,024 states and an embedding size of 1,024. The resulting model
has a test perplexity of 60.5. We approximate this model as a WFA in three ways.

243

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

llll5e+051e+062e+065e+06140145150155# of samples (norte)test perplexitylKatz baselineWFA−ApproxWFA−SampleApprox(norte)

Ligüística computacional

Volumen 47, Número 2

Cifra 7
Approximated Neural Models for the English Broadcast News corpus. Test set perplexity for Katz
baseline and LSTM models approximated in three ways. One uses LSTM samples to build a new
Katz model (WFA-SAMPLEKATZ(norte)). The remaining two use our approximation algorithm
(WFA-SAMPLEAPPROX(norte)), but with different topologies. One topology (KT) is known, usando
the baseline Katz topology, y uno (UT) is unknown, using the samples drawn for
WFA-SAMPLEAPPROX(norte) to also infer the topology.

The ﬁrst way is to construct a Katz n-gram model on N LSTM samples and entropy-
prune to 2M n-grams, which we denote by WFA-SAMPLEKATZ(norte). This approach is
very similar to that of Deoras et al. (2011), except that we use Katz smoothing instead
of Kneser-Ney smoothing. We used Katz due to the fact that, as stated earlier, Kneser-
Ney models are not amenable to pruning (Chelba et al. 2010). The second way is to
approximate onto the baseline Katz 2M n-gram topology described above using WFA-
SAMPLEAPPROX(norte). We refer to this experiment as WFA-SAMPLEAPPROX(norte) (KT),
where KT stands for known topology. The results are shown in Figure 7. The WFA-
SAMPLEKATZ(norte) models perform signiﬁcantly worse than the baseline Katz model
even at 32M samples, while the WFA-SAMPLEAPPROX(norte) models have better perplex-
ity than the baseline Katz model with as little as 1M samples. With 32M samples this
way of approximating the LSTM model as a WFA is 3.6 better in perplexity than the
baseline Katz.

Finalmente, if the WFA topology is unknown we use the samples obtained in WFA-
SAMPLEAPPROX(·) to create a Katz model entropy-pruned to 2M n-grams. We refer
to this experiment as WFA-SAMPLEAPPROX(norte) (UT), where UT stands for unknown
topología. The results are shown in Figure 7. The approximated models obtained by
WFA-SAMPLEAPPROX(norte) (UT) perform better than WFA-SAMPLEKATZ(norte). Sin embargo,
they do not perform as well as WFA-SAMPLEAPPROX(norte) (KT), the models obtained
with the known topology derived from the training data. Sin embargo, with enough sam-
ples, their performance is similar to that of the original Katz model.

We then compare our approximation method with other methods that approximate
DNNs with n-gram LMs (Arisoy et al. 2014; Adel et al. 2014). Both these methods require
training DNNs of different orders and we used DNNs with two layers with 1,024 nodos
and an embedding size of 1,024. The results are in Table 2. The proposed algorithms
WFA-SAMPLEAPPROX(·) perform better than the existing approaches.

244

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

lllllll5e+051e+062e+065e+061e+072e+075e+07140150160170# of samples (norte)test perplexitylKatz baselineWFA−SampleKatz(norte)WFA−SampleApprox(norte) (KT)WFA−SampleApprox(norte) (UT)

Suresh et al.

Approximating Probabilistic Models as WFA

Mesa 2
Test perplexity of k-gram models obtained by different approximation methods for the
English Broadcast News corpus. We use 32M samples for both WFA-SAMPLEKATZ(·) y
WFA-SAMPLEAPPROX(·).

Modelo

Test perplexity

katz
WFA-SAMPLEKATZ(·)
Arisoy et al. (2014)
Adel et al. (2014)
WFA-SAMPLEAPPROX(·) (KT)
WFA-SAMPLEAPPROX(·) (UT)

144.4
151.8
159.6
146.7
140.8
143.9

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

Cifra 8
Approximated Neural Models for the Polish (Europarl) cuerpo. Test set perplexity for Katz baseline
and LSTM models approximated in three ways. One uses LSTM samples to build a new Katz
modelo (WFA-SAMPLEKATZ(norte)). The remaining two use our approximation algorithm
(WFA-SAMPLEAPPROX(norte)), but with different topologies. One topology (KT) is known, usando
the baseline Katz topology, y uno (UT) is unknown, using the samples drawn for
WFA-SAMPLEAPPROX(norte) to also infer the topology.

Experiment on Polish. To repeat the above experiment on a different language and corpus,
we turn to the Polish language section13 of the Europarl corpus (Koehn 2005). Nosotros
chose Polish for this follow-up experiment due to the fact that it belongs to a different
language family than English, is relatively highly inﬂected (unlike English), and is
found in a publicly available corpus. We use the processed form of the corpus and
further process it to downcase all the words and remove punctuation. La resultante
data set has approximately 13M words in the training set and 210K words in the test
colocar. The selected vocabulary has approximately 30K words, consisting of all words that
appeared more than 20 times in the training corpus. Using this vocabulary, we create a
trigram Katz model and prune it to contain 2M n-grams using entropy pruning (Stolcke
2000). We use this pruned model as a baseline. The results are in Figure 8. The trend

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

13 https://www.statmt.org/europarl/v7/pl-en.tgz.

245

llll5e+051e+062e+065e+06180200220240# of samples (norte)test perplexitylKatz baselineWFA−SampleKatz(norte)WFA−SampleApprox(norte) (KT)WFA−SampleApprox(norte) (UT)

Ligüística computacional

Volumen 47, Número 2

Mesa 3
Test perplexity of k-gram models obtained by different approaches for the Polish (Europarl)
cuerpo. We use 32M samples for both WFA-SAMPLEKATZ(·) and WFA-SAMPLEAPPROX(·).

Modelo

Test perplexity

katz
WFA-SAMPLEKATZ(·)
Arisoy et al. (2014)
Adel et al. (2014)
WFA-SAMPLEAPPROX(·) (KT)
WFA-SAMPLEAPPROX(·) (UT)

185.5
189.3
197.8
187.1
181.4
177.0

is similar to that of the English Broadcast News corpus with the proposed algorithm
WFA-SAMPLEAPPROX(·) performing better than the other methods. We also compare
the proposed algorithms with other neural approximation algorithms. The comparison
results are shown in Table 3.

5.3 Lower Bounds on Perplexity

Best Approximation for Target Topology. The neural model in Section 5.2 has a perplexity
de 60.5, but the best perplexity for the approximated model is 140.8. Is there a better
approximation algorithm for the given target topology? We place bounds on that next.
Let T be the set of test sentences. The test-set log-perplexity of a model p can be

written as

1
|t|

(cid:88)

x∗∈T

registro

1
pag(x∗)

(cid:88)

x∗

ˆpt(x∗) registro

1
pag(x∗)

where ˆpt is the empirical distribution of test sentences. Observe that the best model with
topology A can be computed as

pag(cid:48)
a = argmin
pa∈P(A)

(cid:88)

x∗

ˆpt(x∗) registro

1
pa(x∗)

which is the model with topology A that has minimal KL divergence from the test
distribution ˆpt. This can be computed using WFA-APPROX. If we use this approach
on the English Broadcast News test set with the 2M n-gram Katz model, la resultante
model has perplexity of 121.1, mostrando que, under the assumption that the algorithm
ﬁnds the global KL divergence minimum, the test perplexity with this topology cannot
be improved beyond 121.1, irrespective of the method.

Approximation onto Best Trigram Topology. If we approximate the LSTM onto the best
trigram topology, how well does it perform over the test data? To test this, we build a
trigram model from the test data and approximate the LSTM on the trigram topology.
This approximated model has 11M n-grams and a perplexity of 81. This shows that for
large data sets, the largest shortfall of n-gram models in the approximation is due to the
n-gram topology.

246

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

5.4 WFA Sources

While distillation of neural models is an important use case for our algorithms, hay
other scenarios where approximations will be derived from WFA sources. In such cases,
no sampling is required and we can use the algorithms presented in Sections 4.1.1 o
4.1.2, a saber, WFA-APPROX. Por ejemplo, it is not uncommon for applications to place
maximum size restrictions on models, so that existing WFA models are too large and
must be reduced in size prior to use. Sección 5.4.1 presents a couple of experiments
focused on character language model size reduction that were motivated by such
size restrictions. Another scenario with WFA sources is when, for privacy reasons, No
language model training corpus is available in a domain, but only minimum-count-
thresholded (es decir., high frequency) word n-gram counts are provided. En la sección 5.4.2 nosotros
examine the estimation of open-vocabulary character language models from such data.

5.4.1 Creating Compact Language Models

Creating Compact Models for Infrequent Words. In low-memory applications such as on-
device keyboard decoding (Ouyang et al. 2017), it is often useful to have a character-
level WFA representation of a large set of vocabulary words that act only as unigrams,
Por ejemplo, those words beyond the 32K words of our trigram model. We explore how
to compactly represent such a unigram-only model.

To demonstrate our approach, we take all the words in the training set (without a
count cutoff) and build a character-level deterministic WFA of those words weighted by
their unigram probabilities. This is represented as a tree rooted at the initial state (a trie).
This automaton has 820K transitions. Storing this many transitions can be prohibitive;
we can reduce the size in two steps.

The ﬁrst step is to minimize this WFA using weighted minimization (Mohri 2000) a
produce pchar, which has a topology Achar. Although pchar is already much smaller (Tiene
378K transitions, a 54% reducción), we can go further by approximating onto the min-
imal deterministic unweighted automaton, Minimize(Achar). This gives us a model with
only 283K transitions, a further 25% reducción. Because Minimize(Achar) accepts exactly
the same words as Achar, we are not corrupting our model by adding or removing any
vocabulary items. En cambio, we ﬁnd an estimate that is as close as possible to the original,
but that is constrained to the minimal deterministic representation that preserves the
vocabulary.

To evaluate this approach, we randomly select a 20K sentence subset of the original
test set, and represent each selected string as a character-level sequence. We evaluate us-
ing cross entropy in bits-per-character, common for character-level models. El resultado-
ing cross entropy for pchar is 1.557 bits-per-character. By comparison, the cross entropy
for pchar approximated onto Minimize(Achar) es 1.560 bits-per-character. In exchange for
this small accuracy loss we are rewarded with a model that is 25% smaller. Wolf-Sonkin
et al. (2019) used the methods presented here to augment the vocabulary of an on-device
keyboard to deal with issues related to a lack of standard orthography.

Creating Compact WFA Language Models. Motivated by the previous experiment, nosotros
also consider applying (unweighted) minimization to Agreedy, the word-based trigram
topology that we pruned to 1M n-grams described earlier. En mesa 4 we show that
applying minimization to Agreedy and then approximating onto the resulting topology
leads to a reduction of 7% in the number of transitions needed to represent the model.
Sin embargo, the test perplexity also increases some. To control for this, we prune the

247

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 47, Número 2

Mesa 4
Test perplexity of models when approximated onto smaller topologies.

Topology

Test perplexity

# Transitions

Agreedy
Minimize(Agreedy)

greedy

A(cid:48)
Minimize(A(cid:48)

greedy)

155.7
156.4

154.1
154.9

1.13METRO
1.05METRO

1.22METRO
1.13METRO

original model to a 1.08M n-gram topology A(cid:48)
greedy instead of the 1M as before and
apply the same procedure to obtain an approximation on Minimize(A(cid:48)
greedy). We achieve
a 0.4% perplexity reduction compared to the approximation on Agreedy with very nearly
the same number of transitions.

5.4.2 Count Thresholded Data for Privacy. One increasingly common scenario that can
beneﬁt from these algorithms is modeling from frequency thresholded substring counts
rather than raw text. Por ejemplo, word n-grams and their frequencies may be pro-
vided from certain domains of interest only when they occur within at least k separate
documentos. With a sufﬁciently large k (decir 50), no n-gram can be traced to a speciﬁc
documento, thus providing privacy in the aggregation. This is known as k-anonymity
(Samarati 2001).

Sin embargo, for any given domain, there are many kinds of models that one may want
to build depending on the task, some of which may be trickier to estimate from such a
collection of word n-gram counts than with standard approaches for estimation from a
given corpus. Por ejemplo, character n-gram models can be of high utility for tasks like
language identiﬁcation, and have the beneﬁt of a relatively small memory footprint and
low latency in use. Sin embargo, character n-gram models might be harder to learn from a
k-anonymized corpus.

Here we will compare open-vocabulary character language models, which accept
all strings in Σ∗ for a character vocabulary Σ, trained in several ways. Each approach
relies on the training corpus and 32k vocabulary, with every out-of-vocabulary word
replaced by a single OOV symbol (cid:70). Además, for each approach we add 50 a
the unigram character count of any printable ASCII character, so that even those that
are unobserved in the words of our 32k vocabulary have some observations. Our three
approaches are:

Baseline corpus trained models: We count character 5-grams from the
k-anonymized corpus, then remove all n-grams that include the (cid:70) symbol
(in any position) prior to smoothing and normalization. Here we present
both Kneser-Ney and Witten Bell smoothed models, as both are popular
for character n-gram models.

2. Word trigram sampled model: First we count word trigrams in the

k-anonymized corpus and discard any n-gram with the (cid:70) symbol (in any
posición) prior to smoothing and normalization. We then sample one
million strings from a Katz smoothed model and build a character 5-gram
model from these strings. We also use this as our target topology for the
next approach.

248

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

3. Word trigram KL minimization estimation: We create a source model by

converting the 2M n-gram word trigram model into an open vocabulary
modelo. We do this using a specialized construction related to the
construction presented in Section 6 of Chen et al. (2019), brieﬂy described
abajo, that converts the word model into a character sequence model. Como
this model is still closed vocabulary (see below), we additionally smooth
the unigram distribution with a character trigram model trained from the
words in the symbol table (and including the 50 extra counts for every
printable ASCII character as with the other methods). From this source
modelo, we estimate a model on the sampled character 5-gram topology
from the previous approach, using our KL minimization algorithm.

Converting Word n-gram Source to Character Sequence Source Model. Brieﬂy, for every
state q in the n-gram automaton, the set of words labeling transitions leaving q are
represented as a trie of characters including a ﬁnal end-of-word symbol. Each resulting
transition labeled with the end-of-word symbol represents the last transition for that
particular word spelled out by that sequence of transitions, hence is assigned the same
next state as the original word transition. If q has a backoff transition pointing to
its backoff state q(cid:48), then each new internal state in the character trie backs off to the
corresponding state in the character trie leaving q(cid:48). The presence of the corresponding
state in the character trie leaving q(cid:48) is guaranteed because the n-gram automaton is
backoff-complete, as discussed in Section 3.3.

As stated above, this construction converts from word sequences to character se-
quences, but will only accept character sequences consisting of strings of in-vocabulary
palabras, eso es, this is still closed vocabulary. To make it open vocabulary, we further back
off the character trie leaving the unigram state to a character n-gram model estimated
from the symbol table (and additional ASCII character observations). This is done using
a very similar construction to that described above. The resulting model is used as the
source model for our KL minimization algorithm, to estimate a distribution over the
sampled character 5-gram topology.

We encode the test set as a sequence of characters, without using the symbol table
because our models are intended to be open vocabulary. Following typical practice for
open-vocabulary settings, we evaluate with bits-per-character. The results are presented
en mesa 5. Here we achieve slightly lower bits-per-character than even what we get

Mesa 5
Comparison of character 5-gram models derived from either the original corpus or a word
trigram model. Size of the models is presented in terms of the number of character n-grams, el
numbers of states and transitions in the automaton representation, and the ﬁle size in MB. El
two corpus estimated models have the same topology, hence the same size; as do the two word
trigram estimated models.

Fuente

n-grams
(x1000)

estados
(x1000)

Cuerpo

336

Word trigram

277

transitions MB

Estimation

bits/char

(x1000)

381

322

6.5

5.6

Kneser-Ney
Witten-Bell (WB)

Sampled (WB)
KL min

2.04
2.01

2.36
1.99

249

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 47, Número 2

straight from the corpus, perhaps due to better regularization of the word-based model
than with either Witten-Bell or Kneser-Ney on the character n-grams.

6. Software Library

All of the algorithms presented here are available in the OpenGrm libraries at http://
www.opengrm.org under the topic: SFST Library: operations to normalize, sample, combine,
and approximate stochastic ﬁnite-state transducers. We illustrate the use of this library by
showing how to implement some of the experiments in the previous section.

6.1 Example Data and Models

Instead of Broadcast News, we will use the text of Oscar Wilde’s The Importance of Being
Earnest for our tutorial example. This is a small tutorial corpus that we make available
at http://sfst.opengrm.org.

This corpus of approximately 1,688 sentences and 18,000-words has been upper-
cased and had punctuation removed. el primero 850 sentences were used as training data
and the remaining 838 sentences used as test data. A partir de estos, we produce two 1,000-
word vocabulary Katz bigram models, the ∼ 6k n-gram earnest train.mod and the ∼ 4k
n-gram earnest test.mod. We also used relative-entropy pruning to create the ∼ 2k
n-gram earnest train.pru. Los datos, the steps to generate these models, and how to
compute their perplexity using OpenGrm NGram are all fully detailed in the QuickTour
topic at http://sfst.opengrm.org.

6.2 Computing the Approximation

The following step shows how to compute the approximation of a ϕ-WFA model onto
a ϕ-WFA topology. In the example below, the ﬁrst argument, earnest train.mod, es el
source model, and the second argument, earnest train.pru, provides the topology.
The result is a ϕ-WFA whose perplexity can be measured as before.

$ sfstapprox -phi label=0 earnest train.mod earnest train.pru \ >earnest train.approx An alternative, equivalent way to perform this approximation is to break it into two steps, with the counting and normalization (KL divergence minimization) done separately. $ sfstcount -phi label=0 earnest train.mod earnest train.pru \

>earnest train.approx cnts

$ sfstnormalize -method=kl min -phi label=0 \

earnest train.approx cnts >earnest train.approx

We can now use these utilities to run some experiments analogous to our larger
Broadcast News experiments by using different target topologies. The results are shown
en mesa 6. We see in the idempotency experiment that the perplexity of the approxi-
mation on the same topology matches the source. In the greedy-pruning experiment,
the approximation onto the greedy-pruned topology yields a better perplexity than the
greedily-pruned model. Finalmente, the approximation onto the test-set bigram topology
gives a better perplexity than the training-set model since we include all the relevant
bigrams.

250

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

Mesa 6
Perplexities for example experiments.

Experimento

Fuente
Modelo

Topology

Approx.
Perplexity

Idempotency
Comparison to greedy pruning
Approx. onto best bigram topology

earnest train.mod
earnest train.mod
earnest train.mod

earnest train.mod
earnest train.pru
earnest test.pru

73.41
83.12
69.68

Mesa 7
Available Operations in the OpenGrm SFst Library.

Operation

sfstapprox

sfstcount

sfstintersect

Descripción

Approximates a stochastic ϕ-WFA
wrt a backoff-complete ϕ-WFA.
Counts from a stochastic ϕ-WFA
wrt a backoff-complete ϕ-WFA.

Intersects two ϕ-WFAs.

sfstnormalize -method=global

Globally normalizes a ϕ-WFA.

sfstnormalize -method=kl min

sfstnormalize -method=local

Normalizes a count ϕ-WFA using
KL divergence minimization.

Locally normalizes a ϕ-WFA.

sfstnormalize -method=phi

ϕ-normalizes a ϕ-WFA.

sfstperplexity

sfstrandgen

Computes perplexity of a

stochastic ϕ-WFA.

Randomly generates paths from a

stochastic ϕ-WFA.

sfstshortestdistance

Computes the shortest distance

sfsttrim

on a ϕ-WFA.

Removes useless states and
transitions in a ϕ-WFA.

6.3 Available Operations

Mesa 7 lists the available command-line operations in the OpenGrm SFst library. Nosotros
show command-line utilities here; there are corresponding C++ library functions that
can be called from within a program; see http://sfst.opengrm.org.

7. Summary and Discussion

In this article, we have presented an algorithm for minimizing the KL-divergence
between a probabilistic source model over sequences and a WFA target model. Eso
our algorithm is general enough to permit source models of arbitrary form (p.ej., RNNs)
and deriving an appropriate WFA topology—something we do not really touch on in
this article—is particularly important.

251

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 47, Número 2

Referencias
Adel, Heike, Katrin Kirchhoff, Ngoc Thang
Vu, Dominic Telaar, and Tanja Schultz.
2014. Comparing approaches to convert
recurrent neural networks into backoff
language models for efﬁcient decoding. En
Fifteenth Annual Conference of the
International Speech Communication
Asociación (Interspeech), pages 651–655.
Aho, Alfred V. and Margaret J. Corasick.

1975. Efﬁcient string matching: An aid to
bibliographic search. Communications of the
ACM, 18(6):333–340. https://doi.org/10
.1145/360825.360855

Alberto, J ¨urgen and Jarkko Kari. 2009. Digital

image compression. En m. Droste, W..
Kuich, and H. Vogler, editores, Manual
of Weighted Automata, Saltador,
pages 453–479. https://doi.org/10.1007
/978-3-642-01492-5 11

Allauzen, Cyril, Mehryar Mohri, and Brian
Roark. 2003. Generalized algorithms for
constructing language models. En
Proceedings of ACL, pages 40–47.
https://doi.org/10.3115/1075096
.1075102

Allauzen, Cyril, Michael Riley, johan

Schalkwyk, Wojciech Skut, and Mehryar
Mohri. 2007. OpenFst Library.
http://www.openfst.org.

Allauzen, Cyril and Michael D. Riley. 2018.
Algorithms for weighted ﬁnite automata
with failure transitions. En internacional
Conference on Implementation and
Application of Automata, pages 46–58.
https://doi.org/10.1007/978-3-319
-94812-6 5

Angluin, Dana. 1987. Learning regular sets

from queries and counterexamples.
Information and Computation, 75(2):87–106.
https://doi.org/10.1016/0890-5401
(87)90052-6

Angluin, Dana. 1988. Identifying languages

from stochastic examples. Reporte técnico
YALEU /DCS /RR-614, Yale University.

Arisoy, Ebru, Stanley F. Chen, Bhuvana

Ramabhadran, and Abhinav Sethy. 2014.
Converting neural network language
models into back-off language models for
efﬁcient decoding in automatic speech
recognition. IEEE/ACM Transactions on
Audio, Speech and Language Processing
(TASLP), 22(1):184–192. https://doi.org
/10.1109/TASLP.2013.2286919

Balle, Borja, Xavier Carreras, Franco M.
Luque, and Ariadna Quattoni. 2014.
Spectral learning of weighted automata.
Machine Learning, 96(1-2):33–63.
https://doi.org/10.1007/s10994-013
-5416-X

252

Balle, Borja and Mehryar Mohri. 2012.

Spectral learning of general weighted
automata via constrained matrix
completion. En avances en neurología
Sistemas de procesamiento de información,
pages 2159–2167.

Breuel, Tomas M.. 2008. The OCRopus open

source OCR system. En procedimientos de
IS&T/SPIE 20th Annual Symposium.
https://doi.org/10.1117/12.783598

Carrasco, Rafael C. 1994. Accurate

computation of the relative entropy
between stochastic regular grammars.
RAIRO-Theoretical Informatics and
Aplicaciones, 31(5):437–444.
https://doi.org/10.1051/ita
/1997310504371

Carrasco, Rafael C. and Jos´e Oncina. 1994.

Learning stochastic regular grammars by
means of a state merging method. En
International Colloquium on Grammatical
Inference, pages 139–152. https://doi.org
/10.1007/3-540-58473-0 144

Carrasco, Rafael C. and Jos´e Oncina. 1999.

Learning deterministic regular grammars
from stochastic samples in polynomial
tiempo. RAIRO-Theoretical Informatics and
Aplicaciones, 33(1):1–19. https://doi.org
/10.1051/ita:1999102

Chelba, Ciprian, Thorsten Brants, Will

Neveitt, and Peng Xu. 2010. Study on
interaction between entropy pruning and
Alisado Kneser-Ney. In Eleventh Annual
Conference of the International Speech
Communication Association (Interspeech),
pages 2422–2425.

Chen, Mingqing, Ananda Theertha Suresh,
Rajiv Mathews, Adeline Wong, Franc¸oise
Beaufays, Cyril Allauzen, y miguel
Riley. 2019. Federated learning of N-gram
language models. In Proceedings of the 23rd
Conference on Computational Natural
Aprendizaje de idiomas (CONLL), pages 121–130.
https://doi.org/10.18653/v1/K19-1012

Chen, Stanley and Joshua Goodman. 1998.

An empirical study of smoothing
techniques for language modeling.
TR-10-98, Harvard University.

Cicchello, Orlando and Stefan C. Kremer.

2003. Inducing grammars from sparse data
conjuntos: A survey of algorithms and results.
Journal of Machine Learning Research,
4(Oct):603–632.

Cortes, Corinna, Mehryar Mohri, Ashish

Rastogi, and Michael Riley. 2008. Sobre el
computation of the relative entropy of
probabilistic automata. International Journal
of Foundations of Computer Science,
19(01):219–242. https://doi.org/10.1142
/S0129054108005644

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Suresh et al.

Approximating Probabilistic Models as WFA

Dempster, Arthur P., Nan M. Laird, y
Donald B. Frotar. 1977. Máximo
likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical
Sociedad. Serie B (Methodological), 39(1):1–38.
https://doi.org/10.1111/j.2517-6161
.1977.tb01600.x

Deoras, Anoop, Tom´aˇs Mikolov, Stefan

Kombrink, Martin Karaﬁ´at, and Sanjeev
Khudanpur. 2011. Variational
approximation of long-span language
models for LVCSR. En 2011 IEEE
Congreso Internacional de Acústica, Discurso
y procesamiento de señales (ICASSP),
pages 5532–5535. https://doi.org/10
.1109/ICASSP.2011.5947612

Dupont, Pierre. 1996. Incremental regular
inferencia. En el Coloquio Internacional sobre
Grammatical Inference, pages 222–237.
https://doi.org/10.1007/BFb0033357

Durbin, Ricardo, Sean R. Eddy, Anders

Krogh, and Graeme J. Mitchison. 1998.
Biological Sequence Analysis: probabilístico
Models of Proteins and Nucleic Acids,
Prensa de la Universidad de Cambridge. https://
doi.org/10.1017/CBO9780511790492
Ebden, Peter and Richard Sproat. 2015. El
Kestrel TTS text normalization system.
Natural Language Engineering,
21(3):333–353. https://doi.org/10
.1017/S1351324914000175

Eisner, Jason. 2001. Expectation semirings:

Flexible EM for learning ﬁnite-state
transducers. In Proceedings of the ESSLLI
Workshop on Finite-state Methods in NLP,
pages 1–5.

Giles, C. Sotavento, Clifford B. Molinero, Dong Chen,
Hsing-Hen Chen, Guo-Zheng Sun, y
Yee-Chun Lee. 1992. Learning and
extracting ﬁnite state automata with
second-order recurrent neural networks.
Computación neuronal, 4(3):393–405.
https://doi.org/10.1162/neco.1992.4
.3.393

Hellsten, Lars, Brian Roark, Prasoon Goyal,
Cyril Allauzen, Franc¸oise Beaufays, Tom
Ouyang, Michael Riley, and David Rybach.
2017. Transliterated mobile keyboard input
via weighted ﬁnite-state transducers. En
FSMNLP 2017, pages 10–19. https://
doi.org/10.18653/v1/W17-4002

Horst, Reiner and Nguyen V. Thoai. 1999.
DC programming: Overview. Diario de
Optimization Theory and Applications,
103(1):1–43. https://doi.org/10.1023
/A:1021765131316

Iglesias, Gonzalo, Cyril Allauzen, William
Byrne, Adri`a de Gispert, y miguel
Riley. 2011. Hierarchical phrase-based
translation representations. In EMNLP
2011, pages 1373–1383.

Jacobsson, Henrik. 2005. Rule extraction
from recurrent neural networks: A
taxonomy and review. Computación neuronal,
17(6):1223–1263. https://doi.org/10
.1162/0899766053630350

katz, Slava M. 1987. Estimation of

probabilities from sparse data for the
language model component of a speech
reconocedor. IEEE Transactions on Acoustic,
Discurso, y procesamiento de señales, 35(3):400–401.
https://doi.org/10.1109/TASSP.1987
.1165125

Koehn, Philipp. 2005. Europarl: A parallel

corpus for statistical machine translation.
In MT Summit, 5:79–86.

Koneˇcn `y, Jakub, h. Brendan McMahan,
Felix X. Yu, Peter Richt´arik, Ananda
Theertha Suresh, and Dave Bacon. 2016.
Federated learning: Strategies for
improving communication efﬁciency. arXiv
preprint arXiv:1610.05492.

Lecorv´e, Gw´enol´e and Petr Motlicek. 2012.
Conversion of recurrent neural network
language models to weighted ﬁnite state
transducers for automatic speech
recognition. In Thirteenth Annual Conference
of the International Speech Communication
Asociación (Interspeech), pages 1668–1671.

Gold, mi. Marca. 1967. Language identiﬁcation

McMahan, Brendan, Eider Moore,

in the limit. Information and Control,
10(5):447–474. https://doi.org/10.1016
/S0019-9958(67)91165-5

Gold, mi. Marca. 1978. Complexity of

automaton identiﬁcation from given data.
Information and Control, 37(3):302–320.
https://doi.org/10.1016/S0019
-9958(78)90562-4

Hard, Andrew, Kanishka Rao, Rajiv

Mathews, Franc¸oise Beaufays, Sean
Augenstein, Hubert Eichner, Chlo´e
Kiddon, and Daniel Ramage. 2018.
Federated learning for mobile keyboard
predicción. arXiv preimpresión arXiv:1811.03604.

Daniel Ramage, Seth Hampson, y
Blaise Aguera y Arcas. 2017.
Communication-efﬁcient learning of deep
networks from decentralized data. En
Artiﬁcial Intelligence and Statistics,
pages 1273–1282.

Mohri, Mehryar. 1997. String-matching with
automata. Nordic Journal of Computing,
4(2):217–213.

Mohri, Mehryar. 2000. Minimization

algorithms for sequential transducers.
Informática Teórica,
234(1-2):177–201. https://doi.org
/10.1016/S0304-3975(98)00115-7

253

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 47, Número 2

Mohri, Mehryar. 2002. Semiring frameworks

and algorithms for shortest-distance
problemas. Journal of Automata, Idiomas
and Combinatorics, 7(3):321–350.

Mohri, Mehryar. 2009. Weighted automata
algoritmos. En m. Droste, W.. Kuich, y
h. Vogler, editores, Handbook of Weighted
Autómatas. Saltador, pages 213–254.
https://doi.org/10.1007/978-3-642
-01492-5 6

Mohri, Mehryar, Fernando C. norte. Pereira, y
Michael Riley. 2008. Speech recognition
with weighted ﬁnite-state transducers. En
j. Benesty, METRO. METRO. Sondhi, and Y. A. Huang,
editores, Handbook on Speech Processing and
Speech Communication, Saltador,
pages 559–584. https://doi.org/10
.1007/978-3-540-49127-9 28

Novak, Josef R., Nobuaki Minematsu, y
Keikichi Hirose. 2013. Failure transitions
for joint n-gram models and G2P
conversion. In Fourteenth Annual Conference
of the International Speech Communication
Asociación (Interspeech), pages 1821–1825.

Okudono, Takamasa, Masaki Waga, Taro
Sekiyama, and Ichiro Hasuo. 2020.
Weighted automata extraction from
recurrent neural networks via regression
on state spaces. In Proceedings of the AAAI
Conferencia sobre Inteligencia Artificial,
34:5306–5314. https://doi.org/10
.1609/aaai.v34i04.5977

Oncina, Jos´e and Pedro Garcia. 1992.
Identifying regular languages in
polynomial time. In Advances in Structural
and Syntactic Pattern Recognition. Mundo
Scientiﬁc, pages 99–108. https://doi.org
/10.1142/9789812797919 0007

Ouyang, Tom, David Rybach, Franc¸oise
Beaufays, and Michael Riley. 2017.
Mobile keyboard input decoding with
ﬁnite-state transducers. arXiv preprint
arXiv:1704.03987.

Parekh, Rajesh and Vasant Honavar. 2000.

Grammar inference, automata induction,
and language acquisition. En R. Valle, h.
Moisl, and H. Somers, editores, Handbook of
Natural Language Processing, pages 727–764.

Pitt, leonardo. 1989. Inductive inference,

DFAs, and computational complexity. En
International Workshop on Analogical and
Inductive Inference, pages 18–44.
https://doi.org/10.1007/3-540
-51734-0 50

Roark, brian, Richard Sproat, Cyril Allauzen,
Michael Riley, Jeffrey Sorensen, and Terry
Tai. 2012. The OpenGrm open-source
ﬁnite-state grammar software libraries.
Proceedings of the ACL 2012 Sistema
Demonstrations, pages 61–66.

254

Samarati, Pierangela. 2001. Protecting
respondents’ identities in microdata
release. IEEE Transactions on Knowledge and
Data Engineering, 13(6):1010–1027.
https://doi.org/10.1109/69
.971193

Sriperumbudur, Bharath K. and Gert R. GRAMO.

Lanckriet. 2009. On the convergence of the
concave-convex procedure. En procedimientos
of the 22nd International Conference on
Neural Information Processing Systems,
pages 1759–1767.

Stolcke, Andreas. 2000. Entropy-based
pruning of backoff language models.
arXiv preprint cs/0006025.

Sundermeyer, Martín, Ralf Schl ¨uter, y
Hermann Ney. 2012. LSTM neural
networks for language modeling.
Thirteenth Annual Conference of the
International Speech Communication
Asociación, pages 194–197.

Suresh, Ananda Theertha, Brian Roark,

Michael Riley, and Vlad Schogol. 2019.
Distilling weighted ﬁnite automata from
arbitrary probabilistic models. En
Proceedings of the 14th International
Conference on Finite-State Methods and
Natural Language Processing (FSMNLP),
pages 87–97. https://doi.org/10
.18653/v1/W19-3112

Ti ˜no, Peter and Vladimir Vojtek. 1997.
Extracting stochastic machines from
recurrent neural networks trained on
complex symbolic sequences. En
Proceedings of the First International
Conference on Knowledge-Based Intelligent
Electronic Systems, volumen 2,
pages 551–558, IEEE.

Weiss, gail, Yoav Goldberg, and Eran Yahav.
2018. Extracting automata from recurrent
neural networks using queries and
counterexamples. En internacional
Conference on Machine Learning,
pages 5244–5253.

Weiss, gail, Yoav Goldberg, and Eran Yahav.
2019. Learning deterministic weighted
automata with queries and
counterexamples. Avances en Neurología
Sistemas de procesamiento de información,
pages 8560–8571.

Wolf-Sonkin, lorenzo, Vlad Schogol, brian
Roark, and Michael Riley. 2019. Latin script
keyboards for South Asian languages with
ﬁnite-state normalization. En procedimientos de
the 14th International Conference on
Finite-State Methods and Natural Language
Procesando (FSMNLP), pages 108–117.
https://doi.org/10.18653/v1/W19
-3114

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
7
2
2
2
1
1
9
3
0
9
8
8
/
C
oh

yo
i

_
a
_
0
0
4
0
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Descargar PDF