LETTER
Communicated by Alberto Fachechi
Full-Span Log-Linear Model and Fast Learning Algorithm
Kazuya Takabatake
k.takabatake@aist.go.jp
Shotaro Akaho
s.akaho@aist.go.jp
Human Informatics and Interaction Research Institute, National Institute
of Advanced Industrial Science and Technology, Tsukuba, 305-8568, Japan
| . . . |Xn−1
The full-span log-linear (FSLL) model introduced in this letter is con-
sidered an nth order Boltzmann machine, where n is the number of all
variables in the target system. Let X = (X0
, . . . , Xn−1) be finite discrete
random variables that can take |X| = |X0
| different values. The
FSLL model has |X| − 1 parameters and can represent arbitrary positive
distributions of X. The FSLL model is a highest-order Boltzmann ma-
chine; nevertheless, we can compute the dual parameter of the model
distribution, which plays important roles in exponential families in
O(|X| log |X|) time. Furthermore, using properties of the dual parameters
of the FSLL model, we can construct an efficient learning algorithm. The
FSLL model is limited to small probabilistic models up to |X| ≈ 225; how-
ever, in this problem domain, the FSLL model flexibly fits various true
distributions underlying the training data without any hyperparameter
tuning. The experiments showed that the FSLL successfully learned six
training data sets such that |X| = 220 within 1 minute with a laptop PC.
1 Introduction
The main purpose of this letter is as follows:
• To introduce the full-span log-linear (FSLL) model and a fast learning
algorithm
• To demonstrate the performance of the FSLL model with experiments
Boltzmann machines (Ackley, Hinton, & Sejnowski, 1985) are multi-
variate probabilistic models that are widely used in the field of machine
learning. Here, we consider a Boltzmann machine with n binary variables
X = (X0
= 0, 1). We address fully connected Boltzmann ma-
chines with no hidden variables and no temperature parameter. A Boltz-
mann machine represents the following distribution pθ , which we refer to
as the model distribution:
, . . . , Xn−1)(Xi
Neural Computation 34, 1398–1424 (2022) © 2022 Massachusetts Institute of Technology.
https://doi.org/10.1162/neco_a_01496
Published under a Creative Commons
Attribution 4.0 International (CC BY 4.0) license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Full-Span Log-Linear Model and Fast Learning Algorithm
1399
pθ (x) = Z(θ )
−1elθ (x),
∈ {0, 1}, Z(θ ) =
xi
lθ (x) =
(cid:2)
0≤i< j
1400
K. Takabatake and S. Akaho
The FSLL model introduced in this letter can be considered an nth order
Boltzmann machine,2 where n is the number of all variables in the target
system. The FSLL model has |X| − 1 parameters and can represent arbitrary
positive distributions. Since the FSLL model is a highest-order Boltzmann
machine, the learning of FSLL is expected to be very slow. However, we
propose a fast learning algorithm. For example, this algorithm can learn a
joint distribution of 20 binary variables within 1 minute with a laptop PC.
Since the FSLL model has full degrees of freedom, a regularization mech-
anism to avoid overfitting is essential. For this purpose, we used a regu-
larization mechanism based on the minimum description length principle
(Rissanen, 2007).
The remainder of this letter is organized as follows. In section 2, we
present the FSLL model and its fast learning algorithm. In section 3, we
demonstrate the performance of the FSLL model by experiment. In section
4, we discuss the advantages and disadvantages of the FSLL model. In sec-
tion 5, we present the conclusions and extensions of the letter.
2 Full-Span Log-Linear Model
Before introducing the FSLL model, we define the notations used in this let-
ter. A random variable is denoted by a capital letter, such as X, and the value
that X takes is indicated by a lowercase letter, such as x. X also denotes the
set of values that the variable X can take; thus, |X| denotes the number of
values that X can take. (cid:8) f (cid:9)p denotes the expectation of f (X ) with distribu-
tion p(X ), that is, (cid:8) f (cid:9)p =
x∈X p(x) f (x). The differential operator ∂/∂θy is
abbreviated as ∂y. P denotes the set of all distributions of X, and P+ de-
notes all positive distributions of X.
(cid:5)
2.1 Model Distribution. The FSLL model is a multivariate probabilistic
, . . . , Xn−1,
|). The FSLL model has parameters
∈ Xi. The model
model designed for a system that has n discrete finite variables X0
where Xi takes an integer value in [0, |Xi
θ = {θy}, where y = (y0
distribution of the FSLL model is the following pθ :
(cid:2)
, . . . , yn−1) is a vector such that yi
pθ (x) = Z(θ )
−1elθ (x),
xi
∈ Xi
, Z(θ ) =
elθ (x),
lθ (x) =
(cid:2)
y∈X
θy(cid:4)y(x), (cid:4)y(x) =
x∈X
n−1(cid:8)
i=0
yi (xi).
φi
(2.1)
In equation 2.1, {φi
| linearly independent functions of Xi,
yi
which we refer to as the local basis functions, and {(cid:4)y}(y ∈ X ) are |X| func-
tions of X = (X0
, . . . , Xn−1), which we refer to as the (global) basis functions.
∈ Xi) are |Xi
}(yi
2
Furthermore, the FSLL model is not limited to binary variables.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Full-Span Log-Linear Model and Fast Learning Algorithm
1401
Using the following theorem recursively, we can prove that the global basis
functions are linearly independent functions of X.
Theorem 1. If { fi
}( j ∈
J) are linearly independent functions of X1, then { fi(X0)g j(X1)}(i ∈ I, j ∈ J) are
, X1).
linearly independent functions of (X0
}(i ∈ I) are linearly independent functions of X0, and {g j
The proof is provided in the appendix.
In the FSLL model, we determine the local basis functions as follows:
Case |Xi| = 2k:
(cid:9)
H1
= (1), H2k+1 =
H2k H2k
H2k −H2k
| − 1)
0(|Xi
φi
…
|−1(|Xi
| − 1)
|Xi
. . .
. . . φi
⎛
⎜
⎜
⎝
φi
0(0)
…
|−1(0)
|Xi
φi
(cid:10)
,
⎞
⎟
⎟
⎠ = H2k
(2.2)
H2k above is referred to as the Walsh–Hadamard matrix (Pratt, Andrews, &
Kane, 1969).
Else:
φi
0
≡ 1,
j(l) =
φi
(cid:17)
(cid:18)
1
−1
(0 < j = l)
(0 < j (cid:11)= l)
(2.3)
Since (cid:4)
≡ 1, an arbitrary θ
0
fore, we determine as θ
≡ 0.
0
0 gives the same model distribution. There-
2.2 Learning Algorithm. Algorithm 1 presents the outline of the learn-
ing algorithm of the FSLL model. This algorithm is a greedy search to find
a local minimum point of cost(θ ), which monotonically decreases as the it-
eration progresses.
In line 7 of the algorithm, candidate denotes θ (cid:12)
derived from θ by applying
one of the following modifications on the yth component:
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Candidate derived by appending θy: If θy = 0, then let θ (cid:12)
Candidate derived by adjusting θy: If θy (cid:11)= 0, then let θ (cid:12)
= arg minθy
= arg minθy
y
y
cost(θ ).
cost(θ ).
Candidate derived by removing θy: If θy (cid:11)= 0, then let θ (cid:12)
= 0.
y
2.2.1 Cost Function. We use a cost function based on the minimum
description length principle (Rissanen, 2007). Suppose that we send the
1402
K. Takabatake and S. Akaho
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
training data to a receiver by transmitting the parameters and compressed
data. Since θ is a sparse vector, we transmit only indexes y, such that θy (cid:11)= 0,
and the value of θy. Moreover, the index y = (y0
, . . . , yn−1) is a sparse vector
because higher-order basis functions are rarely used in the model distribu-
tion due to their expensive cost. Therefore, we transmit only indexes i, such
|) to transmit the sparse vector y.
∈ [1, |Xi
that yi
Then the description length3 to transmit the sparse vector y becomes
(cid:11)= 0, and the values of yi
(cid:2)
i:yi
(cid:11)=0
ln n(cid:19)(cid:20)(cid:21)(cid:22)
to send i∈[0,n)
| − 1)
+ ln(|Xi
(cid:19)
(cid:22)
(cid:20)(cid:21)
|)
∈[1,|Xi
to send yi
=
(cid:2)
i:yi
(cid:11)=0
ln n(|Xi
| − 1),
and the description length to transmit all index vectors y, such that θy (cid:11)= 0,
becomes
(cid:2)
(cid:2)
ln n(|Xi
| − 1).
y:θy(cid:11)=0
i:yi
(cid:11)=0
The minimum description length to transmit k parameters and the com-
pressed data is estimated as (Rissanen, 2007)
3
We use “nat” as the description length unit; thus, we use “ln” instead of “log.”
i:yi
(cid:11)=0
(cid:3)
ln pd
(cid:4)
Full-Span Log-Linear Model and Fast Learning Algorithm
1403
(cid:4)
(cid:3)
ln pθ
−N
+ k
2
pd
ln N,
where k is the number of nonzero parameters and N is the number of sam-
ples in the training data. The total description length to transmit y, θy and
the compressed data becomes
(cid:4)
(cid:3)
ln pθ
−N
+ k
2
pd
ln N +
(cid:2)
(cid:2)
ln n(|Xi
| − 1)
= −N
(cid:4)
(cid:3)
ln pθ
+
pd
i:yi
y:θy(cid:11)=0
⎛
⎝ ln N
2
(cid:2)
y:θy(cid:11)=0
(cid:11)=0
+
(cid:2)
⎞
ln n(|Xi
| − 1)
⎠ .
(2.4)
We divide equation 2.4 by N and add
geometric quantity and obtain the following cost function:
pd
to create information-
cost(θ , N) = KL(pd
(cid:5)pθ ) + r(θ , N),
r(θ , N) =
ry(N) = 1
N
⎛
⎝ ln N
2
(cid:2)
+
i:yi
(cid:11)=0
⎞
ln n(|Xi
| − 1)
⎠ .
(cid:2)
y:θy(cid:11)=0
ry(N),
(2.5)
2.2.2 Fast Algorithm to Compute ¯θ . The following vector ¯θ , referred to as
the dual parameter of θ , plays important roles in the exponential families
(Amari, 2016):
¯θy(θ ) = ∂y ln Z(θ )
(cid:3)
(cid:4)y
=
(cid:4)
pθ
(see appendix for derivation).
(2.6)
We identified an algorithm to compute ¯θ from pθ in O(|X| log |X|) time.
This algorithm borrowed ideas from the multidimensional discrete Fourier
transform (DFT) (Smith, 2010). Here, we consider a two-dimensional
(2D-)DFT4 for |X0
| pixels of data. The 2D-DFT transforms f into F
by the following equation:
| × |X1
F(y0
, y1) =
(cid:2)
x0
,x1
f (x0
, x1)(cid:4)y0y1 (x0
, x1),
(cid:23)
2πi
|X0
| y0x0
(cid:24)
(cid:23)
exp
(cid:24)
.
2πi
|X1
| y1x1
(cid:4)y0y1 (x0
, x1) = exp
4
Not 2D-FFT but 2D-DFT.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
1404
K. Takabatake and S. Akaho
To derive the value of F(y0
quired; therefore, it appears that O(|X0
F(y0
ing dimension-by-dimension manner:
|) time is re-
||X1
|2) time is required to derive
, y1. However, 2D-DFT is usually realized in the follow-
, y1) for a specific (y0
, y1) for all y0
, y1), O(|X0
|2|X1
F(y0
, y1) =
(cid:9)
(cid:2)
(cid:2)
x1
(cid:19)
x0
(cid:19)
f (x0
, x1) exp
(cid:20)(cid:21)
(cid:23)
2πi
|X0
| y0x0
(cid:24)(cid:10)
(cid:23)
exp
(cid:22)
2πi
|X1
| y1x1
DFT by X0
(cid:20)(cid:21)
DFT by X1
(2.7)
(cid:24)
.
(cid:22)
In equation 2.7, the DFT by X0 for all (y0
time, and the DFT by X1 for all y0
time. Therefore, the entire 2D DFT requires O(|X0
that is smaller than O(|X0
(cid:4)y0y1 (x0
|)
||X1
|2)
||X1
|)) time
||X1
|2). The key here is that the basis function
× X1 requires O(|X0
∈ X1 requires O(|X0
|(|X0
, x1) is a product of two univariate functions as follows:
, x1) ∈ X0
, y1
∈ X0
| + |X1
|2|X1
(cid:4)y0y1 (x0
, x1) = φ0
y0 (x0) = exp
φ0
(cid:23)
y0 (x0)φ1
y1 (x1),
(cid:24)
2πi
|X0
| y0x0
,
y1 (x1) = exp
φ1
(cid:23)
(cid:24)
.
2πi
|X1
| y1x1
We apply this principle to compute ¯θ .
Here, we consider how to derive ¯θ from pθ by the following equation:
(cid:2)
x
(cid:2)
(cid:4)
(cid:3)
(cid:4)y
pθ
=
=
=
pθ (x)(cid:4)y(x)
(cid:8)
pθ (x)
φi
yi (xi)
(cid:9)
(cid:2)
(cid:9)
i
(cid:2)
x
(cid:2)
(cid:9)
. . .
(cid:10)
(cid:10)
(cid:10)
pθ (x)φ0
y0 (x0)
φ1
y1 (x1)
. . .
φn−1
yn−1 (xn−1).
(2.8)
xn−1
x1
x0
We evaluate the right side of equation 2.8 from the innermost parenthesis
to the outermost parenthesis; that is, we determine the following function,
gi : X → R by the following recurrence sequence:
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
g0(x) = pθ (x),
gi+1(y0
(cid:2)
=
, . . . , yi−1
gi(y0
xi
, yi
, xi+1
, . . . , xn−1)
, . . . , yi−1
, xi
, xi+1
, . . . , xn−1)φi
yi (xi).
(2.9)
Full-Span Log-Linear Model and Fast Learning Algorithm
1405
|) time is required to ob-
Then the equation ¯θy = gn(y) holds. Since O(|X||Xi
tain gi+1 from gi, we can obtain gn from g0 in O(|X|
|) time. In the
case where |Xi
| = k, the computational cost to derive gn from g0 becomes
O(|X|kn). Moreover, considering k as a constant, we obtain the computa-
tional cost as O(|X|n) = O(|X| log |X|).
|Xi
(cid:5)
i
We can use the same algorithm to obtain the vector ¯d from pd. In this
case, let g0 = pd.
| is
2.2.3 Acceleration by Walsh-Hadamard Transform. In cases where |Xi
large, we can accelerate the computation of ¯θ by using Walsh-Hadamard
transform (WHT) (Fino & Algazi, 1976).
Let us recall the 2D-DFT in section 2.2.2. If |X0
| are powers of two,
we can use the fast Fourier transform (FFT) (Smith, 2010) for DFT by X0
and DFT by X1 in equation 2.7 and can reduce the computational cost to
O(|X0
|). We can apply this principle to computing ¯θ .
, xi+1
Here, we fix the values of y0
|−1(only the ith com-
, . . . , yi−1
| log |X0
, . . . , x|Xi
|, |X1
||X1
||X1
ponent is omitted) in equation 2.9. Then we obtain
(cid:2)
gi+1(. . . , yi
, . . .) =
i
xi
gi(. . . , xi
i
, . . .)φi
yi (xi).
Using matrix notation, we obtain
⎛
⎜
⎜
⎜
⎝
⎞
⎟
⎟
⎟
⎠
gi+1(. . . , 0
, . . .)
i
, . . .)
| − 1
i
. . .
...
gi+1(. . . , |Xi
⎛
⎜
⎜
⎝
=
φi
0(0)
...
|−1(0)
|Xi
φi
0(|Xi
| − 1)
φi
...
|−1(|Xi
|Xi
| − 1)
. . . φi
⎞
⎟
⎟
⎠
⎛
⎜
⎜
⎜
⎝
, . . .)
gi(. . . , 0
i
...
| − 1
gi(. . . , |Xi
i
, . . .)
⎞
⎟
⎟
⎟
⎠
.
(2.10)
|
| → R|Xi
We refer to this transform R|Xi
as the local transform. The local trans-
form usually requires O(|Xi
|2) time; however, if the matrix in equation 2.10
is a Walsh-Hadamard matrix (see equation 2.2), using Walsh-Hadamard
transform (WHT) (Fino & Algazi, 1976), we can perform the local transform
in O(|Xi
|) time.5
| log |Xi
Since the number of all combinations of y0
(the ith component is omitted) is |X|/|Xi
, . . . , xn−1
|, we can obtain the function
, . . . , yi−1
, xi+1
5
For k ≤ 4, directly multiplying the Hadamard matrix is faster than the WHT in our
environment; therefore, we use the WHT only in cases where k ≥ 5.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
1406
K. Takabatake and S. Akaho
(cid:5)
gi+1 from gi in O(|X| log |Xi
O(
|X| log |Xi
|) = O(|X| log |X|) time.
i
|) time. Moreover, gn is obtained from g0 in
2.2.4 O(1) Algorithm to Evaluate Candidate. In line 7 of algorithm 1, all
three types of candidates—appending, adjusting, and removing—are eval-
uated for all y ∈ X. Here, we consider the following equation:
∂y ¯θy =
(cid:4)
(cid:3)
(cid:4)2
y
pθ
= 1 − ¯θ 2
y
− ¯θ 2
y
(see the appendix for the derivation)
(∵ (cid:4)2
y
≡ 1).
(2.11)
(2.12)
Considering ¯θy as a univariate function of θy and equation 2.12 as an ordi-
nary differential equation, we obtain the following general solution:
¯θy(θy) = tanh(θy − c)
(see the appendix for the derivation),
(2.13)
where c is a constant determined by a boundary condition. For example, if
¯θy(θ 0
y and equation 2.13 becomes
y , then c is given by c = θ 0
y ) = ¯θ 0
− tanh
−1 ¯θ 0
y
¯θy(θy) = tanh(θy − θ 0
y
+ tanh
−1 ¯θ 0
y ).
Here, we define the follwing line Ay(θ 0) in P:
Ay(θ 0) = {θ |∀y
(cid:12) (cid:11)= y, θy = θ 0
y
}.
(2.14)
(2.15)
Equation 2.14 demonstrates that if ¯θ 0
point on the line Ay(θ 0) in O(1) time.
Here, the gradient vector of KL(pd
(cid:5)pθ ) is given by
y is known, we can derive any ¯θy at a
∂yKL(pd
(cid:3)
(cid:4)y
¯dy =
(cid:5)pθ ) = ¯θy − ¯dy,
(cid:4)
(see the appendix for the derivation).
(2.16)
pd
Therefore, for θ ∈ Ay(θ 0), we can obtain KL(pd
grating equation 2.16 as follows:
(cid:5)pθ ) − KL(pd
(cid:5)pθ 0 ) by inte-
KL(pd
(cid:5)pθ ) − KL(pd
(cid:5)pθ 0 ) =
(cid:25) θy
θ 0
y
¯θy(u) − ¯dy du
= 1 + ¯dy
2
ln
1 + ¯θ 0
y
1 + ¯θy
+ 1 − ¯dy
2
1 − ¯θ 0
y
1 − ¯θy
ln
(2.17)
(see the appendix for the derivation).
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Full-Span Log-Linear Model and Fast Learning Algorithm
1407
Then we can evaluate (cid:7)(θ (cid:12)
dates θ (cid:12)
as follows:
) = cost(θ (cid:12)
) − cost(θ 0) for three types of candi-
Appending θy: By equation 2.16, KL(pd
(cid:5)pθ ) is minimized when ¯θy = ¯dy.
Therefore,
(cid:7) = 1 + ¯dy
2
ln
1 + ¯θ 0
y
1 + ¯dy
+ 1 − ¯dy
2
ln
1 − ¯θ 0
y
1 − ¯dy
+ ry.
(2.18)
Adjusting θy: By equation 2.16, KL(pd
(cid:5)pθ ) is minimized when ¯θy = ¯dy.
Therefore,
(cid:7) = 1 + ¯dy
2
ln
1 + ¯θ 0
y
1 + ¯dy
+ 1 − ¯dy
2
ln
1 − ¯θ 0
y
1 − ¯dy
.
Removing θy: θy becomes 0. Therefore,
(cid:7) = 1 + ¯dy
2
1 + ¯θ 0
y
1 + ¯θy(0)
+ 1 − ¯dy
2
ln
1 − ¯θ 0
y
1 − ¯θy(0)
ln
− ry.
Among the three types of candidates—appending, adjusting, and
removing—the group of candidates by appending θy has almost |X| candi-
dates because θy is a very sparse vector. Therefore, it is important to reduce
the computational cost to evaluate candidates by appending. Evaluating (cid:7)
in equation 2.18 is an expensive task for a central processing unit (CPU)
because it involves logarithm computation.6
(cid:7) in equation 2.18 has the following lower bound (cid:7):
(cid:7) ≥ (cid:7)
= −
( ¯θ 0
y
1 − ( ¯θ 0
− ¯dy)2
y )2
+ ry
(see the appendix for the derivation).
(2.19)
Evaluating (cid:7) is much faster than evaluating (cid:7). In the candidate evaluation,
the candidate with the lower cost is the “winner.” If a candidate’s lower
bound is greater than the champion’s cost—the lowest cost ever found—
the candidate has no chance to win; therefore, we can discard the candidate
without precise evaluation of (cid:7).
Algorithm 2 presents the details of line 7 of algorithm 1. In line 5, if
y) ≥ (cid:7)θ 1 , the candidate θ (cid:12)
y) is
is discarded and the evaluation of (cid:7)(θ (cid:12)
(cid:7)(θ (cid:12)
6
Logarithm computation is 30 times slower than addition or multiplication in our
environment.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
1408
K. Takabatake and S. Akaho
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
(2.20)
skipped. This skipping effectively reduces the computational cost of evalu-
ating candidates.7
2.2.5 Updating θ and pθ . Let θ 0 denote θ before updating and θ 1 denote
θ after updating in line 10 of algorithm 1. The θ 1 differs from θ 0 only at the
y(cid:12)
th component. Therefore,
lθ 1 (x) = lθ 0 (x) + (θ 1
y(cid:12) − θ 0
y(cid:12) )(cid:4)y(cid:12) (x)
and
pθ 1 (x) = Z(θ 1)
= Z(θ 1)
= Z(θ 0)
Z(θ 1)
−1 exp lθ 1 (x)
(cid:6)
−1 exp
lθ 0 (x) + (θ 1
(cid:6)
pθ 0 (x) exp
(θ 1
y(cid:12) − θ 0
y(cid:12) − θ 0
(cid:7)
y(cid:12) )(cid:4)y(cid:12) (x)
(cid:7)
y(cid:12) )(cid:4)y(cid:12) (x)
.
Summing equation 2.20 for all x ∈ X, we obtain
1 = Z(θ 0)
Z(θ 1)
(cid:2)
x
(cid:6)
pθ 0 (x) exp
(θ 1
y(cid:12) − θ 0
(cid:7)
.
y(cid:12) )(cid:4)y(cid:12) (x)
7
In our environment, this skipping makes the evaluation of candidates more than 10
times faster.
Full-Span Log-Linear Model and Fast Learning Algorithm
1409
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
,
Therefore,
(cid:6)
pθ 0 (x) exp
(θ 1
y(cid:12) − θ 0
(cid:7)
y(cid:12) )(cid:4)y(cid:12) (x)
pθ 1 (x) = Z(θ 0)
Z(θ 1)
(cid:9)
(cid:2)
Z(θ 0)
Z(θ 1)
=
pθ 0 (x) exp
x
(cid:6)
(θ 1
y(cid:12) − θ 0
(cid:7)
y(cid:12) )(cid:4)y(cid:12) (x)
(cid:10)−1
.
(2.21)
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
Algorithm 3 presents the details of lines 10 and 11 in algorithm 1. This
algorithm requires O(|X|) time. It should be noted that the expensive expo-
(cid:27)
y(cid:12) )(cid:4)y(cid:12) (x)
nential computation exp
is not used in the for-loop.
y(cid:12) − θ 0
(θ 1
(cid:26)
2.2.6 Memory Requirements. In the FSLL model, most of the memory con-
sumption is dominated by four large tables for pθ , ¯θ , ¯d, and r, and each stores
|X| floating point numbers. However, θ does not require a large amount of
memory because it is a sparse vector.
For example, if the FSLL model is allowed to use 4 GB of memory, it can
| = 3), and 13
handle up to 26 binary variables, 16 three-valued variables(|Xi
four-valued variables.
2.3 Convergence to Target Distribution. One major point of interest
about algorithm 1 is whether the model distribution converges to a target
distribution at the limit of N → ∞ or not, where N is the number of samples
in the training data. As a result, we can guarantee this convergence.
Let us define the following symbols:
A(θ t ) = ∪yAy(θ t ) // Ay is defined by equation 2.15,
my(θ t, N) = min
θ ∈Ay(θ t )
m(θ t, N) = min
θ ∈A(θ t )
fN(θ ) // minimum in line Ay
fN(θ ) // minimum in all lines Ay
= min
y
my(θ t, N),
(2.22)
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
1410
K. Takabatake and S. Akaho
Ymin
= {y|my(θ t, N) = m(θ , N)},
arg my(θ t, N) = {θ |θ ∈ Ay(θ t ), fN(θ ) = my(θ t, N)},
arg m(θ t, N) = {θ |θ ∈ A(θ t ), fN(θ ) = m(θ t, N)},
= ∪
y∈Ymin arg my(θ t, N).
(2.23)
Then, algorithm 1 iteratively selects θ t+1 from arg m(θ t, N).
We first consider the case where the cost function is a continuous
function of θ with no regularization term. Here, if and only if
f (θ ) =
minθ (cid:12)∈A(θ ) f (θ (cid:12)), we say that θ is an axis minimum of f . Then the following
theorem holds (Beck, 2015):
f : R|X| → R be a continuous cost function such that B =
Theorem 2. Let
{θ | f (θ ) ≤ f (θ 0)} is a bounded close set. Then any accumulation point of {θ t}(t ∈
[0, ∞)) is an axis minimum of f .
The proof is provided in the appendix. Here, if f (θ ) has an unique axis
minimum at θ = a, the following corollary is derived from theorem 2:
Corollary 1. Let f be a function satisfying the condition in theorem 2. If f has
a unique axis minimum at θ = θ
min, then θ
min is also the global minimum and
limt→∞ θ t = θ
min.
The proof is provided in the appendix.
Let q be a positive distribution. By corollary 1, in the case where the cost
function is f (θ ) = KL(q(cid:5)pθ ), where q is a positive distribution,8 the equation
limt→∞ KL(q(cid:5)pθ t ) = 0 holds.
Then we extend the cost function to fN(θ ) = f (θ ) + r(θ , N), where f :
R|X| → R is a continuous function having the unique global minimum at
θ = θ
min, and r(θ , N) is a regularization term in equation 2.5. The following
theorem holds:
Theorem 3. Let f be a function satisfying the conition in theorem 2. Then
lim
N→∞
lim
t→∞
f (θ t (N)) = min
θ
f (θ )
holds.
Here, do not confuse limN→∞ limt→∞ f (θ t (N)) with limt→∞ limN→∞
f (θ t (N)).
lim
t→∞ lim
N→∞
f (θ t (N)) = min
θ
f (θ )
8
Positivity is needed to keep B bounded.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Full-Span Log-Linear Model and Fast Learning Algorithm
1411
is trivial by corollary 1, while
lim
N→∞
lim
t→∞
f (θ t (N)) = min
θ
f (θ )
needs a proof. The proof is provided in the appendix. In the case where
|pθ ),
pd is a positive distribution for sufficiently large N and f (θ ) = KL(pd
theorem 3 guarantees
lim
N→∞
lim
t→∞
KL(pd
(cid:5)pθ t (N)) = 0.
3 Experiments
In this section, to demonstrate the performance of the FSLL model, we com-
pare a full-span log-linear model that we refer to as FL with two Boltzmann
machines that we refer to as BM-DI and BM-PCD.
3.1 Full-Span Log-Linear Model FL. FL is a full-span log-linear model
that has been described in section 2. The model distribution of FL is given
by equation 2.1. The cost function is given by equation 2.5. The learning
algorithm is algorithm 1 described in section 2.2. The threshold to finish the
cost minimization is determined as (cid:8) = 10
−4.
3.2 Boltzmann Machine BM-DI. BM-DI (Boltzmann machine with di-
rect integration) is a fully connected Boltzmann machine having no hidden
variables and no temperature parameter. To examine the ideal performance
of the Boltzmann machine, we do not use the Monte Carlo approximation
in BM-DI to evaluate equation 1.2. The model distribution of BM-DI is given
(cid:5)pθ ) and has no regu-
by equation 1.1. The cost function of BM-DI is KL(pd
larization term.
To minimize the cost function with fewer evaluations of the gradient
vector, we used a pseudo-Newton method, the Broyden-Fletcher-Goldfarb-
Shanno (BFGS) algorithm (Nocedal & Wright, 2006), implemented in the
Java statistical analysis tool (JSAT; Raff, 2017).
3.3 Boltzmann Machine BM-PCD. BM-PCD (Boltzman machine with
persistent contrastive divergence method) is similar to BM-DI; however,
BM-PCD uses the persistent contrastive divergence method (Tieleman,
2008), a popular Monte Carlo method in Boltzmann machine learning. BM-
PCD has some hyperparameters. We tested various combinations of these
hyperparameters and determined them as learning rate = 0.01, number of
Markov chains = 100, and length of Markov chains = 10,000.
3.4 Training Data. We prepared six training data sets. These data sets
are artificial; therefore, their true distributions p∗(X ) are known. Each is an
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
1412
K. Takabatake and S. Akaho
Figure 1: Graphical structure of Ising5 × 4S/L.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 2: Graphical structure of BN20-37S/L.
independent and identically distributed (i.i.d.) data set drawn from its true
distribution.
3.4.1 Ising5 × 4S, Ising5 × 4L. These data sets were drawn from the dis-
tribution represented by the 2D Ising model (Newman & Barkema, 1999)
with 5 × 4 nodes (see Figure 1).
Every Xi takes the value 0 or 1. Ising5 × 4S has 1000 samples, while
Ising5 × 4L has 100,000 samples. The true distribution is represented as
⎞
⎞
⎛
⎝ 1
2
(cid:18)
p∗(x) = Z
−1 exp
(cid:17)
=
si
1
−1
xi
xi
= 1
= 0
⎠ , Z =
sis j
(cid:2)
x
exp
(cid:2)
(cid:8)i, j(cid:9)
,
⎛
⎝ 1
2
(cid:2)
(cid:8)i, j(cid:9)
⎠ ,
sis j
where (cid:8)i, j(cid:9) denotes the adjacent variables in Figure 1. Boltzmann machines
can represent p∗.
3.4.2 BN20-37S, BN20-37L. These data sets were drawn from the distri-
bution represented by the Bayesian network with 20 nodes and 37 edges
(see Figure 2).
X0 has no parents, X1 has X0 as a parent, and the other Xi(i ≥ 2) individ-
}. The graphical structure and
, . . . , Xi−1
ually has two parents Yi0
∈ {X0
, Yi1
Full-Span Log-Linear Model and Fast Learning Algorithm
1413
contents of the conditional distribution table of each node are determined
randomly. Every Xi takes the value 0 or 1. BN20-37S has 1000 samples, while
BN20-37L has 100,000 samples. The true distribution is represented as
p∗(x) =
(cid:8)
i
p∗(xi
|yi0
, yi1)
= exp
(cid:10)
ln p∗(xi
|yi0
, yi1)
.
(cid:9)
(cid:2)
i
Since ln p∗(xi
chines can represent p∗.
|yi0
, yi1) are third-order terms, third-order Boltzmann ma-
3.4.3 BN20-54S, BN20-54L. These data sets were drawn from the distri-
bution represented by the following Bayesian network with 20 nodes and
, X1 has par-
54 edges. X0 has no parents, X1 has X0 as a parent, X2 has X0
ents, and the other Xi(i ≥ 3) individually has three parents: Yi0
∈
, Yi1
{X0
}. The graphical structure and contents of the conditional dis-
tribution table of each node are determined randomly. Every Xi takes the
value zero or one. BN20-54S has 1000 samples, and BN20-54L has 100,000
samples. The true distribution is represented as
, . . . , Xi−1
, Yi2
p∗(x) =
(cid:8)
i
p∗(xi
|yi0
, yi1
, yi2)
= exp
(cid:10)
ln p∗(xi
|yi0
, yi1
, yi2)
.
(cid:9)
(cid:2)
i
|yi0
Since ln p∗(xi
machines can represent p∗.
, yi1
, yi2) are fourth-order terms, fourth-order Boltzmann
3.5 Experimental Platform. All experiments were conducted on a lap-
top PC (CPU: Intel Core i7-6700K @4 GHz; memory: 64 GB; operating
system: Windows 10 Pro). All programs were written in and executed on
Java 8.
3.6 Results. Table 1 represents performance comparisons between FL
BM-DI and BM-PCD. We evaluated the accuracy of the learned distribution
by KL(p∗(cid:5)pθ ). Figure 3 illustrates the comparison of KL(p∗(cid:5)pθ ).
For Ising5 × 4S/L, a performance difference between FL and BMs (BM-
DI and BM-PCD) was not remarkable because both FL and BMs could rep-
(cid:5)pθ ) (cid:17) KL(p∗(cid:5)pθ ) implies
resent the true distribution p∗. The fact that KL(pd
that overfitting to pd was successfully suppressed. FL used fewer basis func-
tions than BMs used, which implies that some basis functions of BM were
useless to represent p∗. Regarding the accuracy of the model distribution,
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
1414
K. Takabatake and S. Akaho
Table 1: Performance Comparison between FL and BM.
Data
Ising5 × 4S
Ising5 × 4L
BN20-37S
BN20-37L
BN20-54S
BN20-54L
Model
KL(pd
(cid:5)pθ )
KL(p∗(cid:5)pθ )
#Basis
Time
2.501 nat
FL
BM-DI
2.424
BM-PCD 2.504
0.476
FL
BM-DI
0.473
BM-PCD 0.528
4.355
FL
4.746
BM-DI
BM-PCD 4.803
0.697
FL
BM-DI
1.422
BM-PCD 1.477
3.288
FL
BM-DI
3.743
BM-PCD 3.826
0.430
FL
BM-DI
1.545
BM-PCD 1.620
0.012 nat
0.087
0.094
0.004
0.002
0.053
0.317
0.863
0.903
0.026
0.750
0.806
0.697
1.301
1.338
0.057
1.166
1.242
31
210
210
37
210
210
39
210
210
105
210
210
41
210
210
192
210
210
5 sec
13
3
9
12
3
5
17
3
12
19
3
5
23
3
23
21
3
Notes: pd: empirical distribution of training data. p∗: true distribution.
#Basis: number of used (θy (cid:11)= 0) basis functions. Time: CPU time for
learning(median of three trials).
BM-PCD is less accurate than FL and BM-DI. This disadvantage becomes
noticeable when the model distribution is close to the true distribution.
Even when large training data are given, some error remains in the model
distribution of BM-PCD (e.g., Ising5 × 4L).
For BN20-37S/L and BN20-54S/L, FL outperformed BMs because only
FL could represent p∗. To fit p∗, FL adaptively selected 39 basis functions for
BN20-37S and 105 basis functions for BN20-37L from |X| − 1 = 220 − 1 basis
functions. This fact implies that FL constructed a more complex model to fit
p∗ as the training data increased. Furthermore, a comparison of KL(p∗(cid:5)pθ )
revealed that the accuracy of the model distribution was remarkably im-
proved as the size of training data increased in FL. In contrast, BMs could
not fit p∗ even if a large training data set was supplied.
Figure 4 illustrates the CPU time to learn the training data sets. BM-PCD
was the fastest, and FL was faster than BM-DI for five out of six training
data sets. The learning time of BM-PCD is constant because we used a fixed
length (10,000) of Markov chains. FL had |X| − 1 = 220 − 1 basis functions,
while BM-DI had n(n + 1)/2 = 210 basis functions. Nevertheless, FL was
faster than BM-DI.
For BN20-54L, FL takes a long time to learn because it uses 192 basis func-
tions to construct the model distribution. Using the 192 basis functions, FL
successfully constructed a model distribution that fit p∗, while BMs failed.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Full-Span Log-Linear Model and Fast Learning Algorithm
1415
Figure 3: Comparing accuracy of model by KL(p∗(cid:5)pθ ) (lower is better).
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 4: Comparing CPU time to learn (lower is better).
4 Discussion
The major disadvantage of the FSLL model is that it is not feasible for large
problems due to memory consumption and learning speed. If we use a typ-
ical personal computer, the problem size should be limited as |X| (cid:3) 225.
However, as far as we use the FSLL model in this problem domain, the
model has the following theoretical and practical advantages.
The first advantage is that the FSLL model can represent arbitrary dis-
tributions of X. Furthermore, it is guaranteed that the model distribution
1416
K. Takabatake and S. Akaho
converges to any target distribution as the limit of the training data size is
infinity.
Here, we view learning machines from an information geometry per-
spective. The dimension of the distribution space P+ (all positive distribu-
tions of X) is |X| − 1, and a learning machine having M parameters spans
an M-dimensional manifold in P+ to represent its model distribution (we
refer to this manifold as the model manifold).
A learning machine with M < |X| − 1 parameters cannot represent ar-
bitrary distributions in P. Moreover, if M < |X| − 1, there is no guarantee
that the true distribution is close to the model manifold, and if the model
manifold is remote from the true distribution, the machine’s performance
will be poor. This poor performance is not improved even if infinite training
data are given.
The FSLL model extends the manifold’s dimension to |X| − 1 by intro-
ducing higher-order factors. The model manifold becomes P itself; thus,
there is no more expansion. Therefore, we refer to the model as the full-Span
log-linear model. The main interest of this letter is that as far as the problem
size is |X| (cid:3) 225, the FSLL model becomes a feasible and practical model.
For example, suppose that we construct a full-span model by adding hid-
den nodes into a Boltzmann machine having 20 visible nodes. The number
of parameters of the Boltzmann machine is E + n, where E is the number of
edges and n is the number of nodes. Therefore, it is not practical to construct
the full-span model for 20 visible nodes because it requires ≈220 edges.
The second advantage is that the FSLL model has no hyperparameters;
therefore, no hyperparameter tuning is needed. For example, if we use a
Boltzmann machine with hidden nodes that learns the true distribution
with contrastive divergence methods, we need to determine hyperparam-
eters such as the learning rate, mini-batch size, and the number of hidden
and graphical structure of nodes. The FSLL model, however, automatically
learns the training data without human participation.
5 Conclusion and Extension
Suppose that we let the FSLL model learn training data consisting of 20
binary variables. The dimension of the function space spanned by possi-
ble positive distributions is 220 − 1. The FSLL model has 220 − 1 parameters
and can fit arbitrary positive distributions. The basis functions of the FSLL
model have the following properties:
• Each basis function is a product of univariate functions.
• The basis functions take values 1 or −1.
The proposed learning algorithm exploited these properties and realized
fast learning.
Our experiments demonstrated the following:
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Full-Span Log-Linear Model and Fast Learning Algorithm
1417
• The FSLL model could learn the training data with 20 binary variables
within 1 minute with a laptop PC.
• The FSLL model successfully learned the true distribution underlying
the training data; even higher-order terms that depend on three or
more variables existed.
• The FSLL model constructed a more complex model to fit the true
distribution as the training data increased; however, the learning time
became longer.
In this letter, we have presented a basic version of the FSLL model; how-
ever, we can extend it as follows (Takabatake & Akaho, 2014, 2015):
• Introducing L1 regularization (Andrew & Gao, 2007)
• Introducing hidden variables
Appendix: Proofs and Derivations of Equations
A.1 Proof of Theorem 1. For brevity and clarity of expression, we use
predicate logic notation here. The statement “{ fi(X0)g j(X1)}(i ∈ I, j ∈ J) are
, X1)” is equivalent to the following
linearly independent functions of (X0
proposition:
∀x0
∀x1
⎡
(cid:2)
⎣
i, j
⎤
ai j fi(x0)g j(x1) = 0
⎦ ⇒ ∀i∀ j
ai j
= 0
!
.
This proposition is proved as follows:
⎤
⎡
ai j fi(x0)g j(x1) = 0
⎦
(cid:10)
(cid:2)
∀x0
∀x1
⎣
i, j
⎡
⇒ ∀x0
⎣∀x1
"
"
⇒ ∀x0
∀ j
(cid:2)
(cid:9)
(cid:2)
(cid:2)
⎡
⎣
j
i
ai j fi(x0)
g j(x1) = 0
##
⎤
⎤
⎦
⎦
ai j fi(x0) = 0
∵ {g j
} are linearly independent
⇒ ∀i∀ j
ai j
!
i
= 0
. ∵ { fi
} are linearly independent.
(cid:4)
A.2 Derivation of Equation 2.6.
¯θy(θ ) = ∂y ln Z
= 1
Z
∂yZ
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
1418
K. Takabatake and S. Akaho
(cid:2)
∂yelθ (x)
(cid:4)y(x)elθ (x)
x
pθ (x)(cid:4)y(x)
x
(cid:2)
= 1
Z
= 1
Z
(cid:2)
=
(cid:4)
x
(cid:3)
(cid:4)y
.
pθ
=
A.3 Derivation of Equation 2.11.
(cid:4)
(cid:3)
∂y ¯θy = ∂y
(cid:2)
(cid:4)y
pθ
x
(cid:2)
x
(cid:2)
x
(cid:2)
x
(cid:2)
=
=
=
=
=
=
(cid:4)y(x)∂y pθ (x)
(cid:4)y(x)pθ (x)∂y ln pθ (x)
(cid:4)y(x)pθ (x)∂y(lθ (x) − ln Z)
(cid:4)y(x)pθ (x)((cid:4)y(x) − ∂y ln Z)
pθ (x)(cid:4)y(x)2 − ¯θy
(cid:2)
x
pθ (x)(cid:4)y(x)
(cid:4)
x
(cid:3)
(cid:4)2
y
− ¯θ 2
y
.
pθ
A.4 Derivation of Equation 2.13.
(by equation 2.12)
1 =
∂ ¯θy
1 − ¯θ 2
y
(cid:23)
∂y ¯θy
1 + ¯θy
= 1
2
(cid:24)
+
∂y ¯θy
1 − ¯θy
(cid:6)
(cid:7)
∂y ln(1 + ¯θy) − ∂y ln(1 − ¯θy)
.
= 1
2
Integrating equation A.1, we obtain
1
2
ln
1 + ¯θy
1 − ¯θy
(cid:25)
=
1dθy
= θy − c,
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
(A.1)
(A.2)
Full-Span Log-Linear Model and Fast Learning Algorithm
1419
where c is a constant of integration. Since the left side of equation A.2 equals
tanh
−1( ¯θy), we obtain the equation ¯θy = tanh(θy − c).
A.5 Derivation of Equation 2.16.
∂yKL(pd
(cid:5)pθ ) = ∂y
(cid:3)
(cid:27)
(cid:4)
ln pθ
pd
(cid:26)(cid:3)
(cid:4)
−
pd
(cid:4)
ln pd
(cid:3)
ln pθ
(cid:3)
(cid:4)
lθ − ln Z
pd
= −∂y
= −∂y
(cid:3)
= −
∂ylθ
(cid:4)
(cid:4)y
(cid:3)
= −
pd
= ¯θy − ¯d.
pd
+ ∂y ln Z
(cid:4)
pd
+ ¯θ
A.6 Derivation of Equation 2.17.
(cid:25) θy
(cid:25) θy
¯θy(u) − ¯dydu =
tanh(u − c)du − ¯dy(θy − θ 0
y )
(by equation 2.13)
θ 0
y
θ 0
y
1
2
$
"
=
=
θy
= [ln cosh(u − c)]
u=θ 0
y
−1 ¯θy − tanh
− ¯dy(tanh
(cid:9)
%θy
−1 ¯θ 0
y )
1
2
ln
1 + ¯θy
1 − ¯θy
(cid:10)
− 1
2
ln
1 + ¯θ 0
y
1 − ¯θ 0
y
ln cosh2(u − c)
− ¯dy
u=θ 0
y
#θy
1
1 − tanh2(u − c)
ln
1
2
(cid:9)
u=θ 0
y
(cid:10)
−
¯dy
2
ln
1 + ¯θy
1 − ¯θy
1 − ¯θ 0
y
1 + ¯θ 0
y
= 1
2
ln
1
1 − ¯θ 2
y
− ln
1
1 − ( ¯θ 0
y )2
−
¯dy
2
ln
1 + ¯θy
1 − ¯θy
1 − ¯θ 0
y
1 + ¯θ 0
y
1 + ¯θ 0
y
1 + ¯θy
ln
= 1
2
= − 1 + ¯dy
2
+ 1 + ¯dy
2
= 1 + ¯dy
2
ln
1 − ¯θ 0
y
1 − ¯θy
−
¯dy
2
ln
1 + ¯θy
1 − ¯θy
1 − ¯θ 0
y
1 + ¯θ 0
y
2
ln(1 + ¯θ 0
ln(1 + ¯θy) − 1 − ¯dy
y ) + 1 − ¯dy
2
+ 1 − ¯dy
2
1 + ¯θ 0
y
1 + ¯θy
ln
ln(1 − ¯θy)
ln(1 − ¯θ 0
y )
1 − ¯θ 0
y
1 − ¯θy
.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
1420
K. Takabatake and S. Akaho
A.7 Derivation of Equation 2.19.
+ ry
(cid:10)
(cid:9)
− ln
1 − ¯dy
1 − ¯θ 0
y
(cid:10)
1 − 1 − ¯dy
1 − ¯θ 0
y
≥ 0, − ln x ≥ 1 − x
1 + ¯θ 0
y
1 + ¯dy
1 − ¯θ 0
y
1 − ¯dy
(cid:9)
ln
+ 1 − ¯dy
2
(cid:10)
+ 1 − ¯dy
2
(cid:7) = 1 + ¯dy
2
= 1 + ¯dy
2
ln
(cid:9)
(cid:9)
≥ 1 + ¯dy
2
∵ 1 + ¯dy
2
= 1 + ¯dy
2
− ln
1 + ¯dy
1 + ¯θ 0
y
(cid:10)
1 − 1 + ¯dy
1 + ¯θ 0
y
1 − ¯dy
2
− 1 − ¯dy
2
≥ 0,
¯θ 0
− ¯dy
y
1 + ¯θ 0
y
(cid:9)
+ 1 − ¯dy
2
¯θ 0
− ¯dy
y
1 − ¯θ 0
y
(cid:10)
+ ry
+ ry
¯θ 0
y
=
− ¯dy
2
¯θ 0
y
=
− ¯dy
2
− 1 − ¯dy
1 − ¯θ 0
y
1 + ¯dy
1 + ¯θ 0
y
(1 + ¯dy)(1 − ¯θ 0
(1 + ¯θ 0
y ) − (1 − ¯dy)(1 + ¯θ 0
y )
y )(1 − ¯θ 0
y )
+ ry
+ ry
+ ry
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
¯θ 0
y
=
− ¯dy
2
2 ¯dy − 2 ¯θ 0
y
1 − ( ¯θ 0
y )2
+ ry
= −
( ¯θ 0
y
1 − ( ¯θ 0
− ¯dy)2
y )2
+ ry
= (cid:7).
A.8 Proof of Theorem 2. Since B is a bounded closed set, {θ t} has one
or more accumulation point(s) in B.
As an assumption of a proof by contradiction, assume that a is an accu-
mulation point of {θ t} and the proposition
∃y∃b ∈ Ay(a),
f (b) < f (a)
holds. Let η j ∈ Ay(θ j ) be the point such that η j
= by. Since f (a) − f (b) >
y
0, f (θ ) is continuous at θ = b, and a is an accumulation point of {θ t}, the
proposition
∃θ
,
j
f (η j ) < f (a) holds (∵ in Figure 5, f (η j ) < f (a) for sufficiently small δ). / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Full-Span Log-Linear Model and Fast Learning Algorithm 1421 Figure 5: Existence of θ j such that f (η j ) < f (a). Here, η j ∈ Ay(θ j ), and the inequality f (θ j+1) = min θ ∈∪yAy (θ j ) f (θ ) ≤ f (η j ) holds. Therefore, the inequality f (θ j+1) ≤ f (η j ) < f (a) holds. Since f (θ t ) monotonically decreases as t increases, the proposition ∀t ≥ j + 1, f (θ t ) ≤ f (η j ) < f (a) holds. Since f (θ ) is continuous at θ = a, no sub-sequence of {θ t} can con- verge to a, that is, a is not an accumulation point of {θ t}. This fact contradicts (cid:4) the assumption we made at the beginning of this proof. A.9 Proof of Corollary 1. By theorem 2, any accumulation point of {θ t} is an axis minimum of f ; however, the axis minimum of f is unique; there- fore, θ = a is a unique accumulation point of {θ t}, and θ t → a(t → ∞). (cid:4) A.10 Proof of Theorem 3. Let us define the following symbols: my(θ t, ∞) = min θ ∈Ay(θ t ) f (θ ), arg my(θ t, ∞) = {θ |θ ∈ Ay(θ t ), f (θ ) = my(θ t, ∞)}. Figure 6 illustrates the sectional view of fN(θ ) along the line Ay(θ t ). As fN(θ ) has a gap with depth ry (see equation 2.5) at shown in the figure, θy = 0. Here, we can ignore the gap at θy = 0 for sufficiently large N, that is, the proposition ∃N (cid:12)∀N, N > N
(cid:12) (cid:20)⇒ arg my(θ t, N) = arg my(θ t, ∞)
holds. Moreover, the proposition
∃N
(cid:12)∀N∀y, N > N
(cid:12) (cid:20)⇒ arg my(θ t, N) = arg my(θ t, ∞)
(A.3)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
1422
K. Takabatake and S. Akaho
Figure 6: View of fN(θ).
holds. By equation 2.23, the proposition
∃N
(cid:12)∀N, N > N
(cid:12) (cid:20)⇒ arg m(θ t, N) = arg m(θ t, ∞)
holds. Here, we compare a sequence {θ t (N)} with the cost function fN
and a sequence {θ t (∞)} with the cost function f (= f∞). By corollary 1,
limt→∞ f (θ t (∞)) = minθ f (θ ), that is,
∀(cid:8) > 0∃t
(cid:12)∀t,
t ≥ t
(cid:12) (cid:20)⇒ f (θ t (∞)) < min θ f (θ ) + (cid:8) (A.4) holds. Here, let T(N) be the smallest t such that m(θ t+1, N) (cid:11)= m(θ t+1, ∞). Then limN→∞ T(N) = ∞, that is, ∀t (cid:12)∃N, T(N) ≥ t (cid:12) (A.5) holds. By equations A.4 and A.5, the proposition ∀(cid:8) > 0∃N, ∀t,
t ≥ T(N) (cid:20)⇒ f (θ t (∞)) < min θ f (θ ) + (cid:8) (A.6) holds, and therefore the proposition ∀(cid:8) > 0∃N,
f (θ T(N)(∞)) < min θ f (θ ) + (cid:8) (A.7) also holds. Here, by the definition of T(N), θ T(N)(∞) = θ T(N)(N). Therefore, we can modify equation A.7 as ∀(cid:8) > 0∃N,
f (θ T(N)(N)) < min θ f (θ ) + (cid:8). (A.8) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Full-Span Log-Linear Model and Fast Learning Algorithm 1423 Since f (θ t (N)) monotonically decreases as t grows, ∀(cid:8) > 0∃N∀t,
t ≥ T(N) (cid:20)⇒ f (θ t (N)) < min
θ
f (θ ) + (cid:8).
Using the notation of lim, we obtain
lim
N→∞
lim
t→∞
f (θ t (N)) = min
θ
f (θ ).
Acknowledgments
This research is supported by KAKENHI 17H01793.
References
Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltz-
mann machines. Cognitive Science, 9(1), 147–169. 10.1207/s15516709cog0901_7
Amari, S. (2016). Information geometry and its applications. Berlin: Springer.
Andrew, G., & Gao, J. (2007). Scalable training of L1-regularized log-linear models.
In Proceedings of the 24th International Conference on Machine Learning (pp. 33–40).
New York: ACM.
Beck, A. (2015). On the convergence of alternating minimization for convex program-
ming with applications to iteratively reweighted least squares and decomposition
schemes. SIAM Journal on Optimization, 25(1), 185–209. 10.1137/13094829X
Dennis Jr., J. E., & Moré, J. J. (1977). Quasi-Newton methods, motivation and theory.
SIAM Review, 19(1), 46–89. 10.1137/1019005
Fino, B. J., & Algazi, V. R. (1976). Unified matrix treatment of the fast Walsh-
Hadamard transform. IEEE Transactions on Computers, 25(11), 1142–1146. 10.1109/
TC.1976.1674569
Newman, M., & Barkema, G. (1999). Monte Carlo methods in statistical physic. New
York: Oxford University Press.
Nocedal, J., & Wright, S. (2006). Numerical optimization. New York: Springer Science
& Business Media.
Pratt, W. K., Andrews, H. C., & Kane, J. (1969). Hadamard transform image coding.
In Proceedings of the IEEE, 57(1), 58–68. 10.1109/PROC.1969.6869
Raff, E. (2017). JSAT: Java statistical analysis tool, a library for machine learning.
Journal of Machine Learning Research, 18(23), 1–5.
Rissanen, J. (2007). Information and complexity in statistical modeling. Berlin: Springer.
Sejnowski, T. J. (1986). Higher-order Boltzmann machines. In AIP Conference Pro-
ceedings (vol. 151, pp. 398–403). Melville, NY: American Institute for Physics.
10.1063/1.36246
Smith, W. W. (2010). Handbook of real-time fast Fourier transforms. Piscataway, NJ: IEEE.
Takabatake, K., & Akaho, S. (2014). Basis functions for fast learning of log-linear
models. IEICE Technical Report, 114(306), 307–312. (in Japanese)
(cid:4)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
1424
K. Takabatake and S. Akaho
Takabatake, K., & Akaho, S. (2015). Full-span log-linear model with L1 regularization
and its performance. IEICE Technical Report, 115(323), 153–157. (in Japanese)
Tieleman, T. (2008). Training restricted Boltzmann machines using approximations
to the likelihood gradient. In Proceedings of the 25th International Conference on Ma-
chine Learning (pp. 1064–1071). New York: ACM.
Received August 25, 2021; accepted January 22, 2022.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Download pdf