LETTER

LETTER

Communicated by Alberto Fachechi

Full-Span Log-Linear Model and Fast Learning Algorithm

Kazuya Takabatake
k.takabatake@aist.go.jp
Shotaro Akaho
s.akaho@aist.go.jp
Human Informatics and Interaction Research Institute, National Institute
of Advanced Industrial Science and Technology, Tsukuba, 305-8568, Japan

| . . . |Xn−1

The full-span log-linear (FSLL) model introduced in this letter is con-
sidered an nth order Boltzmann machine, where n is the number of all
variables in the target system. Let X = (X0
, . . . , Xn−1) be finite discrete
random variables that can take |X| = |X0
| different values. The
FSLL model has |X| − 1 parameters and can represent arbitrary positive
distributions of X. The FSLL model is a highest-order Boltzmann ma-
chine; nevertheless, we can compute the dual parameter of the model
distribution, which plays important roles in exponential families in
O(|X| log |X|) time. Furthermore, using properties of the dual parameters
of the FSLL model, we can construct an efficient learning algorithm. The
FSLL model is limited to small probabilistic models up to |X| ≈ 225; how-
ever, in this problem domain, the FSLL model flexibly fits various true
distributions underlying the training data without any hyperparameter
tuning. The experiments showed that the FSLL successfully learned six
training data sets such that |X| = 220 within 1 minute with a laptop PC.

1 Introduction

The main purpose of this letter is as follows:

• To introduce the full-span log-linear (FSLL) model and a fast learning

algorithm

• To demonstrate the performance of the FSLL model with experiments

Boltzmann machines (Ackley, Hinton, & Sejnowski, 1985) are multi-
variate probabilistic models that are widely used in the field of machine
learning. Here, we consider a Boltzmann machine with n binary variables
X = (X0
= 0, 1). We address fully connected Boltzmann ma-
chines with no hidden variables and no temperature parameter. A Boltz-
mann machine represents the following distribution pθ , which we refer to
as the model distribution:

, . . . , Xn−1)(Xi

Neural Computation 34, 1398–1424 (2022) © 2022 Massachusetts Institute of Technology.
https://doi.org/10.1162/neco_a_01496
Published under a Creative Commons
Attribution 4.0 International (CC BY 4.0) license.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

/

e
d
u
n
e
c
o
a
r
t
i
c
e

p
d

/

l

f
/

/

/

/

3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d

.

/

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Full-Span Log-Linear Model and Fast Learning Algorithm

1399

pθ (x) = Z(θ )

−1elθ (x),

∈ {0, 1}, Z(θ ) =

xi

lθ (x) =

(cid:2)

0≤i< j 0.

1400

K. Takabatake and S. Akaho

The FSLL model introduced in this letter can be considered an nth order
Boltzmann machine,2 where n is the number of all variables in the target
system. The FSLL model has |X| − 1 parameters and can represent arbitrary
positive distributions. Since the FSLL model is a highest-order Boltzmann
machine, the learning of FSLL is expected to be very slow. However, we
propose a fast learning algorithm. For example, this algorithm can learn a
joint distribution of 20 binary variables within 1 minute with a laptop PC.
Since the FSLL model has full degrees of freedom, a regularization mech-
anism to avoid overfitting is essential. For this purpose, we used a regu-
larization mechanism based on the minimum description length principle
(Rissanen, 2007).

The remainder of this letter is organized as follows. In section 2, we
present the FSLL model and its fast learning algorithm. In section 3, we
demonstrate the performance of the FSLL model by experiment. In section
4, we discuss the advantages and disadvantages of the FSLL model. In sec-
tion 5, we present the conclusions and extensions of the letter.

2 Full-Span Log-Linear Model

Before introducing the FSLL model, we define the notations used in this let-
ter. A random variable is denoted by a capital letter, such as X, and the value
that X takes is indicated by a lowercase letter, such as x. X also denotes the
set of values that the variable X can take; thus, |X| denotes the number of
values that X can take. (cid:8) f (cid:9)p denotes the expectation of f (X ) with distribu-
tion p(X ), that is, (cid:8) f (cid:9)p =
x∈X p(x) f (x). The differential operator ∂/∂θy is
abbreviated as ∂y. P denotes the set of all distributions of X, and P+ de-
notes all positive distributions of X.

(cid:5)

2.1 Model Distribution. The FSLL model is a multivariate probabilistic
, . . . , Xn−1,
|). The FSLL model has parameters
∈ Xi. The model

model designed for a system that has n discrete finite variables X0
where Xi takes an integer value in [0, |Xi
θ = {θy}, where y = (y0
distribution of the FSLL model is the following pθ :
(cid:2)

, . . . , yn−1) is a vector such that yi

pθ (x) = Z(θ )

−1elθ (x),

xi

∈ Xi

, Z(θ ) =

elθ (x),

lθ (x) =

(cid:2)

y∈X

θy(cid:4)y(x), (cid:4)y(x) =

x∈X

n−1(cid:8)

i=0

yi (xi).
φi

(2.1)

In equation 2.1, {φi
| linearly independent functions of Xi,
yi
which we refer to as the local basis functions, and {(cid:4)y}(y ∈ X ) are |X| func-
tions of X = (X0
, . . . , Xn−1), which we refer to as the (global) basis functions.

∈ Xi) are |Xi

}(yi

2

Furthermore, the FSLL model is not limited to binary variables.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

/

e
d
u
n
e
c
o
a
r
t
i
c
e

p
d

/

l

f
/

/

/

/

3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d

.

/

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Full-Span Log-Linear Model and Fast Learning Algorithm

1401

Using the following theorem recursively, we can prove that the global basis
functions are linearly independent functions of X.
Theorem 1. If { fi
}( j ∈
J) are linearly independent functions of X1, then { fi(X0)g j(X1)}(i ∈ I, j ∈ J) are
, X1).
linearly independent functions of (X0

}(i ∈ I) are linearly independent functions of X0, and {g j

The proof is provided in the appendix.
In the FSLL model, we determine the local basis functions as follows:

Case |Xi| = 2k:

(cid:9)

H1

= (1), H2k+1 =

H2k H2k
H2k −H2k
| − 1)
0(|Xi
φi

|−1(|Xi

| − 1)

|Xi

. . .

. . . φi



φi
0(0)

|−1(0)

|Xi

φi

(cid:10)

,



⎠ = H2k

(2.2)

H2k above is referred to as the Walsh–Hadamard matrix (Pratt, Andrews, &
Kane, 1969).

Else:

φi
0

≡ 1,

j(l) =
φi

(cid:17)

(cid:18)

1
−1

(0 < j = l) (0 < j (cid:11)= l) (2.3) Since (cid:4) ≡ 1, an arbitrary θ 0 fore, we determine as θ ≡ 0. 0 0 gives the same model distribution. There- 2.2 Learning Algorithm. Algorithm 1 presents the outline of the learn- ing algorithm of the FSLL model. This algorithm is a greedy search to find a local minimum point of cost(θ ), which monotonically decreases as the it- eration progresses. In line 7 of the algorithm, candidate denotes θ (cid:12) derived from θ by applying one of the following modifications on the yth component: l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Candidate derived by appending θy: If θy = 0, then let θ (cid:12) Candidate derived by adjusting θy: If θy (cid:11)= 0, then let θ (cid:12) = arg minθy = arg minθy y y cost(θ ). cost(θ ). Candidate derived by removing θy: If θy (cid:11)= 0, then let θ (cid:12) = 0. y 2.2.1 Cost Function. We use a cost function based on the minimum description length principle (Rissanen, 2007). Suppose that we send the 1402 K. Takabatake and S. Akaho l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 training data to a receiver by transmitting the parameters and compressed data. Since θ is a sparse vector, we transmit only indexes y, such that θy (cid:11)= 0, and the value of θy. Moreover, the index y = (y0 , . . . , yn−1) is a sparse vector because higher-order basis functions are rarely used in the model distribu- tion due to their expensive cost. Therefore, we transmit only indexes i, such |) to transmit the sparse vector y. ∈ [1, |Xi that yi Then the description length3 to transmit the sparse vector y becomes (cid:11)= 0, and the values of yi (cid:2) i:yi (cid:11)=0 ln n(cid:19)(cid:20)(cid:21)(cid:22) to send i∈[0,n) | − 1) + ln(|Xi (cid:19) (cid:22) (cid:20)(cid:21) |) ∈[1,|Xi to send yi = (cid:2) i:yi (cid:11)=0 ln n(|Xi | − 1), and the description length to transmit all index vectors y, such that θy (cid:11)= 0, becomes (cid:2) (cid:2) ln n(|Xi | − 1). y:θy(cid:11)=0 i:yi (cid:11)=0 The minimum description length to transmit k parameters and the com- pressed data is estimated as (Rissanen, 2007) 3 We use “nat” as the description length unit; thus, we use “ln” instead of “log.” i:yi (cid:11)=0 (cid:3) ln pd (cid:4) Full-Span Log-Linear Model and Fast Learning Algorithm 1403 (cid:4) (cid:3) ln pθ −N + k 2 pd ln N, where k is the number of nonzero parameters and N is the number of sam- ples in the training data. The total description length to transmit y, θy and the compressed data becomes (cid:4) (cid:3) ln pθ −N + k 2 pd ln N + (cid:2) (cid:2) ln n(|Xi | − 1) = −N (cid:4) (cid:3) ln pθ + pd i:yi y:θy(cid:11)=0 ⎛ ⎝ ln N 2 (cid:2) y:θy(cid:11)=0 (cid:11)=0 + (cid:2) ⎞ ln n(|Xi | − 1) ⎠ . (2.4) We divide equation 2.4 by N and add geometric quantity and obtain the following cost function: pd to create information- cost(θ , N) = KL(pd (cid:5)pθ ) + r(θ , N), r(θ , N) = ry(N) = 1 N ⎛ ⎝ ln N 2 (cid:2) + i:yi (cid:11)=0 ⎞ ln n(|Xi | − 1) ⎠ . (cid:2) y:θy(cid:11)=0 ry(N), (2.5) 2.2.2 Fast Algorithm to Compute ¯θ . The following vector ¯θ , referred to as the dual parameter of θ , plays important roles in the exponential families (Amari, 2016): ¯θy(θ ) = ∂y ln Z(θ ) (cid:3) (cid:4)y = (cid:4) pθ (see appendix for derivation). (2.6) We identified an algorithm to compute ¯θ from pθ in O(|X| log |X|) time. This algorithm borrowed ideas from the multidimensional discrete Fourier transform (DFT) (Smith, 2010). Here, we consider a two-dimensional (2D-)DFT4 for |X0 | pixels of data. The 2D-DFT transforms f into F by the following equation: | × |X1 F(y0 , y1) = (cid:2) x0 ,x1 f (x0 , x1)(cid:4)y0y1 (x0 , x1), (cid:23) 2πi |X0 | y0x0 (cid:24) (cid:23) exp (cid:24) . 2πi |X1 | y1x1 (cid:4)y0y1 (x0 , x1) = exp 4 Not 2D-FFT but 2D-DFT. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 1404 K. Takabatake and S. Akaho To derive the value of F(y0 quired; therefore, it appears that O(|X0 F(y0 ing dimension-by-dimension manner: |) time is re- ||X1 |2) time is required to derive , y1. However, 2D-DFT is usually realized in the follow- , y1) for a specific (y0 , y1) for all y0 , y1), O(|X0 |2|X1 F(y0 , y1) = (cid:9) (cid:2) (cid:2) x1 (cid:19) x0 (cid:19) f (x0 , x1) exp (cid:20)(cid:21) (cid:23) 2πi |X0 | y0x0 (cid:24)(cid:10) (cid:23) exp (cid:22) 2πi |X1 | y1x1 DFT by X0 (cid:20)(cid:21) DFT by X1 (2.7) (cid:24) . (cid:22) In equation 2.7, the DFT by X0 for all (y0 time, and the DFT by X1 for all y0 time. Therefore, the entire 2D DFT requires O(|X0 that is smaller than O(|X0 (cid:4)y0y1 (x0 |) ||X1 |2) ||X1 |)) time ||X1 |2). The key here is that the basis function × X1 requires O(|X0 ∈ X1 requires O(|X0 |(|X0 , x1) is a product of two univariate functions as follows: , x1) ∈ X0 , y1 ∈ X0 | + |X1 |2|X1 (cid:4)y0y1 (x0 , x1) = φ0 y0 (x0) = exp φ0 (cid:23) y0 (x0)φ1 y1 (x1), (cid:24) 2πi |X0 | y0x0 , y1 (x1) = exp φ1 (cid:23) (cid:24) . 2πi |X1 | y1x1 We apply this principle to compute ¯θ . Here, we consider how to derive ¯θ from pθ by the following equation: (cid:2) x (cid:2) (cid:4) (cid:3) (cid:4)y pθ = = = pθ (x)(cid:4)y(x) (cid:8) pθ (x) φi yi (xi) (cid:9) (cid:2) (cid:9) i (cid:2) x (cid:2) (cid:9) . . . (cid:10) (cid:10) (cid:10) pθ (x)φ0 y0 (x0) φ1 y1 (x1) . . . φn−1 yn−1 (xn−1). (2.8) xn−1 x1 x0 We evaluate the right side of equation 2.8 from the innermost parenthesis to the outermost parenthesis; that is, we determine the following function, gi : X → R by the following recurrence sequence: l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 g0(x) = pθ (x), gi+1(y0 (cid:2) = , . . . , yi−1 gi(y0 xi , yi , xi+1 , . . . , xn−1) , . . . , yi−1 , xi , xi+1 , . . . , xn−1)φi yi (xi). (2.9) Full-Span Log-Linear Model and Fast Learning Algorithm 1405 |) time is required to ob- Then the equation ¯θy = gn(y) holds. Since O(|X||Xi tain gi+1 from gi, we can obtain gn from g0 in O(|X| |) time. In the case where |Xi | = k, the computational cost to derive gn from g0 becomes O(|X|kn). Moreover, considering k as a constant, we obtain the computa- tional cost as O(|X|n) = O(|X| log |X|). |Xi (cid:5) i We can use the same algorithm to obtain the vector ¯d from pd. In this case, let g0 = pd. | is 2.2.3 Acceleration by Walsh-Hadamard Transform. In cases where |Xi large, we can accelerate the computation of ¯θ by using Walsh-Hadamard transform (WHT) (Fino & Algazi, 1976). Let us recall the 2D-DFT in section 2.2.2. If |X0 | are powers of two, we can use the fast Fourier transform (FFT) (Smith, 2010) for DFT by X0 and DFT by X1 in equation 2.7 and can reduce the computational cost to O(|X0 |). We can apply this principle to computing ¯θ . , xi+1 Here, we fix the values of y0 |−1(only the ith com- , . . . , yi−1 | log |X0 , . . . , x|Xi |, |X1 ||X1 ||X1 ponent is omitted) in equation 2.9. Then we obtain (cid:2) gi+1(. . . , yi , . . .) = i xi gi(. . . , xi i , . . .)φi yi (xi). Using matrix notation, we obtain ⎛ ⎜ ⎜ ⎜ ⎝ ⎞ ⎟ ⎟ ⎟ ⎠ gi+1(. . . , 0 , . . .) i , . . .) | − 1 i . . . ... gi+1(. . . , |Xi ⎛ ⎜ ⎜ ⎝ = φi 0(0) ... |−1(0) |Xi φi 0(|Xi | − 1) φi ... |−1(|Xi |Xi | − 1) . . . φi ⎞ ⎟ ⎟ ⎠ ⎛ ⎜ ⎜ ⎜ ⎝ , . . .) gi(. . . , 0 i ... | − 1 gi(. . . , |Xi i , . . .) ⎞ ⎟ ⎟ ⎟ ⎠ . (2.10) | | → R|Xi We refer to this transform R|Xi as the local transform. The local trans- form usually requires O(|Xi |2) time; however, if the matrix in equation 2.10 is a Walsh-Hadamard matrix (see equation 2.2), using Walsh-Hadamard transform (WHT) (Fino & Algazi, 1976), we can perform the local transform in O(|Xi |) time.5 | log |Xi Since the number of all combinations of y0 (the ith component is omitted) is |X|/|Xi , . . . , xn−1 |, we can obtain the function , . . . , yi−1 , xi+1 5 For k ≤ 4, directly multiplying the Hadamard matrix is faster than the WHT in our environment; therefore, we use the WHT only in cases where k ≥ 5. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 1406 K. Takabatake and S. Akaho (cid:5) gi+1 from gi in O(|X| log |Xi O( |X| log |Xi |) = O(|X| log |X|) time. i |) time. Moreover, gn is obtained from g0 in 2.2.4 O(1) Algorithm to Evaluate Candidate. In line 7 of algorithm 1, all three types of candidates—appending, adjusting, and removing—are eval- uated for all y ∈ X. Here, we consider the following equation: ∂y ¯θy = (cid:4) (cid:3) (cid:4)2 y pθ = 1 − ¯θ 2 y − ¯θ 2 y (see the appendix for the derivation) (∵ (cid:4)2 y ≡ 1). (2.11) (2.12) Considering ¯θy as a univariate function of θy and equation 2.12 as an ordi- nary differential equation, we obtain the following general solution: ¯θy(θy) = tanh(θy − c) (see the appendix for the derivation), (2.13) where c is a constant determined by a boundary condition. For example, if ¯θy(θ 0 y and equation 2.13 becomes y , then c is given by c = θ 0 y ) = ¯θ 0 − tanh −1 ¯θ 0 y ¯θy(θy) = tanh(θy − θ 0 y + tanh −1 ¯θ 0 y ). Here, we define the follwing line Ay(θ 0) in P: Ay(θ 0) = {θ |∀y (cid:12) (cid:11)= y, θy = θ 0 y }. (2.14) (2.15) Equation 2.14 demonstrates that if ¯θ 0 point on the line Ay(θ 0) in O(1) time. Here, the gradient vector of KL(pd (cid:5)pθ ) is given by y is known, we can derive any ¯θy at a ∂yKL(pd (cid:3) (cid:4)y ¯dy = (cid:5)pθ ) = ¯θy − ¯dy, (cid:4) (see the appendix for the derivation). (2.16) pd Therefore, for θ ∈ Ay(θ 0), we can obtain KL(pd grating equation 2.16 as follows: (cid:5)pθ ) − KL(pd (cid:5)pθ 0 ) by inte- KL(pd (cid:5)pθ ) − KL(pd (cid:5)pθ 0 ) = (cid:25) θy θ 0 y ¯θy(u) − ¯dy du = 1 + ¯dy 2 ln 1 + ¯θ 0 y 1 + ¯θy + 1 − ¯dy 2 1 − ¯θ 0 y 1 − ¯θy ln (2.17) (see the appendix for the derivation). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Full-Span Log-Linear Model and Fast Learning Algorithm 1407 Then we can evaluate (cid:7)(θ (cid:12) dates θ (cid:12) as follows: ) = cost(θ (cid:12) ) − cost(θ 0) for three types of candi- Appending θy: By equation 2.16, KL(pd (cid:5)pθ ) is minimized when ¯θy = ¯dy. Therefore, (cid:7) = 1 + ¯dy 2 ln 1 + ¯θ 0 y 1 + ¯dy + 1 − ¯dy 2 ln 1 − ¯θ 0 y 1 − ¯dy + ry. (2.18) Adjusting θy: By equation 2.16, KL(pd (cid:5)pθ ) is minimized when ¯θy = ¯dy. Therefore, (cid:7) = 1 + ¯dy 2 ln 1 + ¯θ 0 y 1 + ¯dy + 1 − ¯dy 2 ln 1 − ¯θ 0 y 1 − ¯dy . Removing θy: θy becomes 0. Therefore, (cid:7) = 1 + ¯dy 2 1 + ¯θ 0 y 1 + ¯θy(0) + 1 − ¯dy 2 ln 1 − ¯θ 0 y 1 − ¯θy(0) ln − ry. Among the three types of candidates—appending, adjusting, and removing—the group of candidates by appending θy has almost |X| candi- dates because θy is a very sparse vector. Therefore, it is important to reduce the computational cost to evaluate candidates by appending. Evaluating (cid:7) in equation 2.18 is an expensive task for a central processing unit (CPU) because it involves logarithm computation.6 (cid:7) in equation 2.18 has the following lower bound (cid:7): (cid:7) ≥ (cid:7) = − ( ¯θ 0 y 1 − ( ¯θ 0 − ¯dy)2 y )2 + ry (see the appendix for the derivation). (2.19) Evaluating (cid:7) is much faster than evaluating (cid:7). In the candidate evaluation, the candidate with the lower cost is the “winner.” If a candidate’s lower bound is greater than the champion’s cost—the lowest cost ever found— the candidate has no chance to win; therefore, we can discard the candidate without precise evaluation of (cid:7). Algorithm 2 presents the details of line 7 of algorithm 1. In line 5, if y) ≥ (cid:7)θ 1 , the candidate θ (cid:12) y) is is discarded and the evaluation of (cid:7)(θ (cid:12) (cid:7)(θ (cid:12) 6 Logarithm computation is 30 times slower than addition or multiplication in our environment. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 1408 K. Takabatake and S. Akaho l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 (2.20) skipped. This skipping effectively reduces the computational cost of evalu- ating candidates.7 2.2.5 Updating θ and pθ . Let θ 0 denote θ before updating and θ 1 denote θ after updating in line 10 of algorithm 1. The θ 1 differs from θ 0 only at the y(cid:12) th component. Therefore, lθ 1 (x) = lθ 0 (x) + (θ 1 y(cid:12) − θ 0 y(cid:12) )(cid:4)y(cid:12) (x) and pθ 1 (x) = Z(θ 1) = Z(θ 1) = Z(θ 0) Z(θ 1) −1 exp lθ 1 (x) (cid:6) −1 exp lθ 0 (x) + (θ 1 (cid:6) pθ 0 (x) exp (θ 1 y(cid:12) − θ 0 y(cid:12) − θ 0 (cid:7) y(cid:12) )(cid:4)y(cid:12) (x) (cid:7) y(cid:12) )(cid:4)y(cid:12) (x) . Summing equation 2.20 for all x ∈ X, we obtain 1 = Z(θ 0) Z(θ 1) (cid:2) x (cid:6) pθ 0 (x) exp (θ 1 y(cid:12) − θ 0 (cid:7) . y(cid:12) )(cid:4)y(cid:12) (x) 7 In our environment, this skipping makes the evaluation of candidates more than 10 times faster. Full-Span Log-Linear Model and Fast Learning Algorithm 1409 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . , Therefore, (cid:6) pθ 0 (x) exp (θ 1 y(cid:12) − θ 0 (cid:7) y(cid:12) )(cid:4)y(cid:12) (x) pθ 1 (x) = Z(θ 0) Z(θ 1) (cid:9) (cid:2) Z(θ 0) Z(θ 1) = pθ 0 (x) exp x (cid:6) (θ 1 y(cid:12) − θ 0 (cid:7) y(cid:12) )(cid:4)y(cid:12) (x) (cid:10)−1 . (2.21) / e d u n e c o a r t i c e - p d / l Algorithm 3 presents the details of lines 10 and 11 in algorithm 1. This algorithm requires O(|X|) time. It should be noted that the expensive expo- (cid:27) y(cid:12) )(cid:4)y(cid:12) (x) nential computation exp is not used in the for-loop. y(cid:12) − θ 0 (θ 1 (cid:26) 2.2.6 Memory Requirements. In the FSLL model, most of the memory con- sumption is dominated by four large tables for pθ , ¯θ , ¯d, and r, and each stores |X| floating point numbers. However, θ does not require a large amount of memory because it is a sparse vector. For example, if the FSLL model is allowed to use 4 GB of memory, it can | = 3), and 13 handle up to 26 binary variables, 16 three-valued variables(|Xi four-valued variables. 2.3 Convergence to Target Distribution. One major point of interest about algorithm 1 is whether the model distribution converges to a target distribution at the limit of N → ∞ or not, where N is the number of samples in the training data. As a result, we can guarantee this convergence. Let us define the following symbols: A(θ t ) = ∪yAy(θ t ) // Ay is defined by equation 2.15, my(θ t, N) = min θ ∈Ay(θ t ) m(θ t, N) = min θ ∈A(θ t ) fN(θ ) // minimum in line Ay fN(θ ) // minimum in all lines Ay = min y my(θ t, N), (2.22) f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 1410 K. Takabatake and S. Akaho Ymin = {y|my(θ t, N) = m(θ , N)}, arg my(θ t, N) = {θ |θ ∈ Ay(θ t ), fN(θ ) = my(θ t, N)}, arg m(θ t, N) = {θ |θ ∈ A(θ t ), fN(θ ) = m(θ t, N)}, = ∪ y∈Ymin arg my(θ t, N). (2.23) Then, algorithm 1 iteratively selects θ t+1 from arg m(θ t, N). We first consider the case where the cost function is a continuous function of θ with no regularization term. Here, if and only if f (θ ) = minθ (cid:12)∈A(θ ) f (θ (cid:12)), we say that θ is an axis minimum of f . Then the following theorem holds (Beck, 2015): f : R|X| → R be a continuous cost function such that B = Theorem 2. Let {θ | f (θ ) ≤ f (θ 0)} is a bounded close set. Then any accumulation point of {θ t}(t ∈ [0, ∞)) is an axis minimum of f . The proof is provided in the appendix. Here, if f (θ ) has an unique axis minimum at θ = a, the following corollary is derived from theorem 2: Corollary 1. Let f be a function satisfying the condition in theorem 2. If f has a unique axis minimum at θ = θ min, then θ min is also the global minimum and limt→∞ θ t = θ min. The proof is provided in the appendix. Let q be a positive distribution. By corollary 1, in the case where the cost function is f (θ ) = KL(q(cid:5)pθ ), where q is a positive distribution,8 the equation limt→∞ KL(q(cid:5)pθ t ) = 0 holds. Then we extend the cost function to fN(θ ) = f (θ ) + r(θ , N), where f : R|X| → R is a continuous function having the unique global minimum at θ = θ min, and r(θ , N) is a regularization term in equation 2.5. The following theorem holds: Theorem 3. Let f be a function satisfying the conition in theorem 2. Then lim N→∞ lim t→∞ f (θ t (N)) = min θ f (θ ) holds. Here, do not confuse limN→∞ limt→∞ f (θ t (N)) with limt→∞ limN→∞ f (θ t (N)). lim t→∞ lim N→∞ f (θ t (N)) = min θ f (θ ) 8 Positivity is needed to keep B bounded. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Full-Span Log-Linear Model and Fast Learning Algorithm 1411 is trivial by corollary 1, while lim N→∞ lim t→∞ f (θ t (N)) = min θ f (θ ) needs a proof. The proof is provided in the appendix. In the case where |pθ ), pd is a positive distribution for sufficiently large N and f (θ ) = KL(pd theorem 3 guarantees lim N→∞ lim t→∞ KL(pd (cid:5)pθ t (N)) = 0. 3 Experiments In this section, to demonstrate the performance of the FSLL model, we com- pare a full-span log-linear model that we refer to as FL with two Boltzmann machines that we refer to as BM-DI and BM-PCD. 3.1 Full-Span Log-Linear Model FL. FL is a full-span log-linear model that has been described in section 2. The model distribution of FL is given by equation 2.1. The cost function is given by equation 2.5. The learning algorithm is algorithm 1 described in section 2.2. The threshold to finish the cost minimization is determined as (cid:8) = 10 −4. 3.2 Boltzmann Machine BM-DI. BM-DI (Boltzmann machine with di- rect integration) is a fully connected Boltzmann machine having no hidden variables and no temperature parameter. To examine the ideal performance of the Boltzmann machine, we do not use the Monte Carlo approximation in BM-DI to evaluate equation 1.2. The model distribution of BM-DI is given (cid:5)pθ ) and has no regu- by equation 1.1. The cost function of BM-DI is KL(pd larization term. To minimize the cost function with fewer evaluations of the gradient vector, we used a pseudo-Newton method, the Broyden-Fletcher-Goldfarb- Shanno (BFGS) algorithm (Nocedal & Wright, 2006), implemented in the Java statistical analysis tool (JSAT; Raff, 2017). 3.3 Boltzmann Machine BM-PCD. BM-PCD (Boltzman machine with persistent contrastive divergence method) is similar to BM-DI; however, BM-PCD uses the persistent contrastive divergence method (Tieleman, 2008), a popular Monte Carlo method in Boltzmann machine learning. BM- PCD has some hyperparameters. We tested various combinations of these hyperparameters and determined them as learning rate = 0.01, number of Markov chains = 100, and length of Markov chains = 10,000. 3.4 Training Data. We prepared six training data sets. These data sets are artificial; therefore, their true distributions p∗(X ) are known. Each is an l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 1412 K. Takabatake and S. Akaho Figure 1: Graphical structure of Ising5 × 4S/L. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 2: Graphical structure of BN20-37S/L. independent and identically distributed (i.i.d.) data set drawn from its true distribution. 3.4.1 Ising5 × 4S, Ising5 × 4L. These data sets were drawn from the dis- tribution represented by the 2D Ising model (Newman & Barkema, 1999) with 5 × 4 nodes (see Figure 1). Every Xi takes the value 0 or 1. Ising5 × 4S has 1000 samples, while Ising5 × 4L has 100,000 samples. The true distribution is represented as ⎞ ⎞ ⎛ ⎝ 1 2 (cid:18) p∗(x) = Z −1 exp (cid:17) = si 1 −1 xi xi = 1 = 0 ⎠ , Z = sis j (cid:2) x exp (cid:2) (cid:8)i, j(cid:9) , ⎛ ⎝ 1 2 (cid:2) (cid:8)i, j(cid:9) ⎠ , sis j where (cid:8)i, j(cid:9) denotes the adjacent variables in Figure 1. Boltzmann machines can represent p∗. 3.4.2 BN20-37S, BN20-37L. These data sets were drawn from the distri- bution represented by the Bayesian network with 20 nodes and 37 edges (see Figure 2). X0 has no parents, X1 has X0 as a parent, and the other Xi(i ≥ 2) individ- }. The graphical structure and , . . . , Xi−1 ually has two parents Yi0 ∈ {X0 , Yi1 Full-Span Log-Linear Model and Fast Learning Algorithm 1413 contents of the conditional distribution table of each node are determined randomly. Every Xi takes the value 0 or 1. BN20-37S has 1000 samples, while BN20-37L has 100,000 samples. The true distribution is represented as p∗(x) = (cid:8) i p∗(xi |yi0 , yi1) = exp (cid:10) ln p∗(xi |yi0 , yi1) . (cid:9) (cid:2) i Since ln p∗(xi chines can represent p∗. |yi0 , yi1) are third-order terms, third-order Boltzmann ma- 3.4.3 BN20-54S, BN20-54L. These data sets were drawn from the distri- bution represented by the following Bayesian network with 20 nodes and , X1 has par- 54 edges. X0 has no parents, X1 has X0 as a parent, X2 has X0 ents, and the other Xi(i ≥ 3) individually has three parents: Yi0 ∈ , Yi1 {X0 }. The graphical structure and contents of the conditional dis- tribution table of each node are determined randomly. Every Xi takes the value zero or one. BN20-54S has 1000 samples, and BN20-54L has 100,000 samples. The true distribution is represented as , . . . , Xi−1 , Yi2 p∗(x) = (cid:8) i p∗(xi |yi0 , yi1 , yi2) = exp (cid:10) ln p∗(xi |yi0 , yi1 , yi2) . (cid:9) (cid:2) i |yi0 Since ln p∗(xi machines can represent p∗. , yi1 , yi2) are fourth-order terms, fourth-order Boltzmann 3.5 Experimental Platform. All experiments were conducted on a lap- top PC (CPU: Intel Core i7-6700K @4 GHz; memory: 64 GB; operating system: Windows 10 Pro). All programs were written in and executed on Java 8. 3.6 Results. Table 1 represents performance comparisons between FL BM-DI and BM-PCD. We evaluated the accuracy of the learned distribution by KL(p∗(cid:5)pθ ). Figure 3 illustrates the comparison of KL(p∗(cid:5)pθ ). For Ising5 × 4S/L, a performance difference between FL and BMs (BM- DI and BM-PCD) was not remarkable because both FL and BMs could rep- (cid:5)pθ ) (cid:17) KL(p∗(cid:5)pθ ) implies resent the true distribution p∗. The fact that KL(pd that overfitting to pd was successfully suppressed. FL used fewer basis func- tions than BMs used, which implies that some basis functions of BM were useless to represent p∗. Regarding the accuracy of the model distribution, l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 1414 K. Takabatake and S. Akaho Table 1: Performance Comparison between FL and BM. Data Ising5 × 4S Ising5 × 4L BN20-37S BN20-37L BN20-54S BN20-54L Model KL(pd (cid:5)pθ ) KL(p∗(cid:5)pθ ) #Basis Time 2.501 nat FL BM-DI 2.424 BM-PCD 2.504 0.476 FL BM-DI 0.473 BM-PCD 0.528 4.355 FL 4.746 BM-DI BM-PCD 4.803 0.697 FL BM-DI 1.422 BM-PCD 1.477 3.288 FL BM-DI 3.743 BM-PCD 3.826 0.430 FL BM-DI 1.545 BM-PCD 1.620 0.012 nat 0.087 0.094 0.004 0.002 0.053 0.317 0.863 0.903 0.026 0.750 0.806 0.697 1.301 1.338 0.057 1.166 1.242 31 210 210 37 210 210 39 210 210 105 210 210 41 210 210 192 210 210 5 sec 13 3 9 12 3 5 17 3 12 19 3 5 23 3 23 21 3 Notes: pd: empirical distribution of training data. p∗: true distribution. #Basis: number of used (θy (cid:11)= 0) basis functions. Time: CPU time for learning(median of three trials). BM-PCD is less accurate than FL and BM-DI. This disadvantage becomes noticeable when the model distribution is close to the true distribution. Even when large training data are given, some error remains in the model distribution of BM-PCD (e.g., Ising5 × 4L). For BN20-37S/L and BN20-54S/L, FL outperformed BMs because only FL could represent p∗. To fit p∗, FL adaptively selected 39 basis functions for BN20-37S and 105 basis functions for BN20-37L from |X| − 1 = 220 − 1 basis functions. This fact implies that FL constructed a more complex model to fit p∗ as the training data increased. Furthermore, a comparison of KL(p∗(cid:5)pθ ) revealed that the accuracy of the model distribution was remarkably im- proved as the size of training data increased in FL. In contrast, BMs could not fit p∗ even if a large training data set was supplied. Figure 4 illustrates the CPU time to learn the training data sets. BM-PCD was the fastest, and FL was faster than BM-DI for five out of six training data sets. The learning time of BM-PCD is constant because we used a fixed length (10,000) of Markov chains. FL had |X| − 1 = 220 − 1 basis functions, while BM-DI had n(n + 1)/2 = 210 basis functions. Nevertheless, FL was faster than BM-DI. For BN20-54L, FL takes a long time to learn because it uses 192 basis func- tions to construct the model distribution. Using the 192 basis functions, FL successfully constructed a model distribution that fit p∗, while BMs failed. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Full-Span Log-Linear Model and Fast Learning Algorithm 1415 Figure 3: Comparing accuracy of model by KL(p∗(cid:5)pθ ) (lower is better). l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 4: Comparing CPU time to learn (lower is better). 4 Discussion The major disadvantage of the FSLL model is that it is not feasible for large problems due to memory consumption and learning speed. If we use a typ- ical personal computer, the problem size should be limited as |X| (cid:3) 225. However, as far as we use the FSLL model in this problem domain, the model has the following theoretical and practical advantages. The first advantage is that the FSLL model can represent arbitrary dis- tributions of X. Furthermore, it is guaranteed that the model distribution 1416 K. Takabatake and S. Akaho converges to any target distribution as the limit of the training data size is infinity. Here, we view learning machines from an information geometry per- spective. The dimension of the distribution space P+ (all positive distribu- tions of X) is |X| − 1, and a learning machine having M parameters spans an M-dimensional manifold in P+ to represent its model distribution (we refer to this manifold as the model manifold). A learning machine with M < |X| − 1 parameters cannot represent ar- bitrary distributions in P. Moreover, if M < |X| − 1, there is no guarantee that the true distribution is close to the model manifold, and if the model manifold is remote from the true distribution, the machine’s performance will be poor. This poor performance is not improved even if infinite training data are given. The FSLL model extends the manifold’s dimension to |X| − 1 by intro- ducing higher-order factors. The model manifold becomes P itself; thus, there is no more expansion. Therefore, we refer to the model as the full-Span log-linear model. The main interest of this letter is that as far as the problem size is |X| (cid:3) 225, the FSLL model becomes a feasible and practical model. For example, suppose that we construct a full-span model by adding hid- den nodes into a Boltzmann machine having 20 visible nodes. The number of parameters of the Boltzmann machine is E + n, where E is the number of edges and n is the number of nodes. Therefore, it is not practical to construct the full-span model for 20 visible nodes because it requires ≈220 edges. The second advantage is that the FSLL model has no hyperparameters; therefore, no hyperparameter tuning is needed. For example, if we use a Boltzmann machine with hidden nodes that learns the true distribution with contrastive divergence methods, we need to determine hyperparam- eters such as the learning rate, mini-batch size, and the number of hidden and graphical structure of nodes. The FSLL model, however, automatically learns the training data without human participation. 5 Conclusion and Extension Suppose that we let the FSLL model learn training data consisting of 20 binary variables. The dimension of the function space spanned by possi- ble positive distributions is 220 − 1. The FSLL model has 220 − 1 parameters and can fit arbitrary positive distributions. The basis functions of the FSLL model have the following properties: • Each basis function is a product of univariate functions. • The basis functions take values 1 or −1. The proposed learning algorithm exploited these properties and realized fast learning. Our experiments demonstrated the following: l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Full-Span Log-Linear Model and Fast Learning Algorithm 1417 • The FSLL model could learn the training data with 20 binary variables within 1 minute with a laptop PC. • The FSLL model successfully learned the true distribution underlying the training data; even higher-order terms that depend on three or more variables existed. • The FSLL model constructed a more complex model to fit the true distribution as the training data increased; however, the learning time became longer. In this letter, we have presented a basic version of the FSLL model; how- ever, we can extend it as follows (Takabatake & Akaho, 2014, 2015): • Introducing L1 regularization (Andrew & Gao, 2007) • Introducing hidden variables Appendix: Proofs and Derivations of Equations A.1 Proof of Theorem 1. For brevity and clarity of expression, we use predicate logic notation here. The statement “{ fi(X0)g j(X1)}(i ∈ I, j ∈ J) are , X1)” is equivalent to the following linearly independent functions of (X0 proposition: ∀x0 ∀x1 ⎡ (cid:2) ⎣ i, j ⎤ ai j fi(x0)g j(x1) = 0 ⎦ ⇒ ∀i∀ j ai j = 0 ! . This proposition is proved as follows: ⎤ ⎡ ai j fi(x0)g j(x1) = 0 ⎦ (cid:10) (cid:2) ∀x0 ∀x1 ⎣ i, j ⎡ ⇒ ∀x0 ⎣∀x1 " " ⇒ ∀x0 ∀ j (cid:2) (cid:9) (cid:2) (cid:2) ⎡ ⎣ j i ai j fi(x0) g j(x1) = 0 ## ⎤ ⎤ ⎦ ⎦ ai j fi(x0) = 0 ∵ {g j } are linearly independent ⇒ ∀i∀ j ai j ! i = 0 . ∵ { fi } are linearly independent. (cid:4) A.2 Derivation of Equation 2.6. ¯θy(θ ) = ∂y ln Z = 1 Z ∂yZ l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 1418 K. Takabatake and S. Akaho (cid:2) ∂yelθ (x) (cid:4)y(x)elθ (x) x pθ (x)(cid:4)y(x) x (cid:2) = 1 Z = 1 Z (cid:2) = (cid:4) x (cid:3) (cid:4)y . pθ = A.3 Derivation of Equation 2.11. (cid:4) (cid:3) ∂y ¯θy = ∂y (cid:2) (cid:4)y pθ x (cid:2) x (cid:2) x (cid:2) x (cid:2) = = = = = = (cid:4)y(x)∂y pθ (x) (cid:4)y(x)pθ (x)∂y ln pθ (x) (cid:4)y(x)pθ (x)∂y(lθ (x) − ln Z) (cid:4)y(x)pθ (x)((cid:4)y(x) − ∂y ln Z) pθ (x)(cid:4)y(x)2 − ¯θy (cid:2) x pθ (x)(cid:4)y(x) (cid:4) x (cid:3) (cid:4)2 y − ¯θ 2 y . pθ A.4 Derivation of Equation 2.13. (by equation 2.12) 1 = ∂ ¯θy 1 − ¯θ 2 y (cid:23) ∂y ¯θy 1 + ¯θy = 1 2 (cid:24) + ∂y ¯θy 1 − ¯θy (cid:6) (cid:7) ∂y ln(1 + ¯θy) − ∂y ln(1 − ¯θy) . = 1 2 Integrating equation A.1, we obtain 1 2 ln 1 + ¯θy 1 − ¯θy (cid:25) = 1dθy = θy − c, l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 (A.1) (A.2) Full-Span Log-Linear Model and Fast Learning Algorithm 1419 where c is a constant of integration. Since the left side of equation A.2 equals tanh −1( ¯θy), we obtain the equation ¯θy = tanh(θy − c). A.5 Derivation of Equation 2.16. ∂yKL(pd (cid:5)pθ ) = ∂y (cid:3) (cid:27) (cid:4) ln pθ pd (cid:26)(cid:3) (cid:4) − pd (cid:4) ln pd (cid:3) ln pθ (cid:3) (cid:4) lθ − ln Z pd = −∂y = −∂y (cid:3) = − ∂ylθ (cid:4) (cid:4)y (cid:3) = − pd = ¯θy − ¯d. pd + ∂y ln Z (cid:4) pd + ¯θ A.6 Derivation of Equation 2.17. (cid:25) θy (cid:25) θy ¯θy(u) − ¯dydu = tanh(u − c)du − ¯dy(θy − θ 0 y ) (by equation 2.13) θ 0 y θ 0 y 1 2 $ " = = θy = [ln cosh(u − c)] u=θ 0 y −1 ¯θy − tanh − ¯dy(tanh (cid:9) %θy −1 ¯θ 0 y ) 1 2 ln 1 + ¯θy 1 − ¯θy (cid:10) − 1 2 ln 1 + ¯θ 0 y 1 − ¯θ 0 y ln cosh2(u − c) − ¯dy u=θ 0 y #θy 1 1 − tanh2(u − c) ln 1 2 (cid:9) u=θ 0 y (cid:10) − ¯dy 2 ln 1 + ¯θy 1 − ¯θy 1 − ¯θ 0 y 1 + ¯θ 0 y = 1 2 ln 1 1 − ¯θ 2 y − ln 1 1 − ( ¯θ 0 y )2 − ¯dy 2 ln 1 + ¯θy 1 − ¯θy 1 − ¯θ 0 y 1 + ¯θ 0 y 1 + ¯θ 0 y 1 + ¯θy ln = 1 2 = − 1 + ¯dy 2 + 1 + ¯dy 2 = 1 + ¯dy 2 ln 1 − ¯θ 0 y 1 − ¯θy − ¯dy 2 ln 1 + ¯θy 1 − ¯θy 1 − ¯θ 0 y 1 + ¯θ 0 y 2 ln(1 + ¯θ 0 ln(1 + ¯θy) − 1 − ¯dy y ) + 1 − ¯dy 2 + 1 − ¯dy 2 1 + ¯θ 0 y 1 + ¯θy ln ln(1 − ¯θy) ln(1 − ¯θ 0 y ) 1 − ¯θ 0 y 1 − ¯θy . l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 1420 K. Takabatake and S. Akaho A.7 Derivation of Equation 2.19. + ry (cid:10) (cid:9) − ln 1 − ¯dy 1 − ¯θ 0 y (cid:10) 1 − 1 − ¯dy 1 − ¯θ 0 y ≥ 0, − ln x ≥ 1 − x 1 + ¯θ 0 y 1 + ¯dy 1 − ¯θ 0 y 1 − ¯dy (cid:9) ln + 1 − ¯dy 2 (cid:10) + 1 − ¯dy 2 (cid:7) = 1 + ¯dy 2 = 1 + ¯dy 2 ln (cid:9) (cid:9) ≥ 1 + ¯dy 2 ∵ 1 + ¯dy 2 = 1 + ¯dy 2 − ln 1 + ¯dy 1 + ¯θ 0 y (cid:10) 1 − 1 + ¯dy 1 + ¯θ 0 y 1 − ¯dy 2 − 1 − ¯dy 2 ≥ 0, ¯θ 0 − ¯dy y 1 + ¯θ 0 y (cid:9) + 1 − ¯dy 2 ¯θ 0 − ¯dy y 1 − ¯θ 0 y (cid:10) + ry + ry ¯θ 0 y = − ¯dy 2 ¯θ 0 y = − ¯dy 2 − 1 − ¯dy 1 − ¯θ 0 y 1 + ¯dy 1 + ¯θ 0 y (1 + ¯dy)(1 − ¯θ 0 (1 + ¯θ 0 y ) − (1 − ¯dy)(1 + ¯θ 0 y ) y )(1 − ¯θ 0 y ) + ry + ry + ry l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . ¯θ 0 y = − ¯dy 2 2 ¯dy − 2 ¯θ 0 y 1 − ( ¯θ 0 y )2 + ry = − ( ¯θ 0 y 1 − ( ¯θ 0 − ¯dy)2 y )2 + ry = (cid:7). A.8 Proof of Theorem 2. Since B is a bounded closed set, {θ t} has one or more accumulation point(s) in B. As an assumption of a proof by contradiction, assume that a is an accu- mulation point of {θ t} and the proposition ∃y∃b ∈ Ay(a), f (b) < f (a) holds. Let η j ∈ Ay(θ j ) be the point such that η j = by. Since f (a) − f (b) >
y
0, f (θ ) is continuous at θ = b, and a is an accumulation point of {θ t}, the
proposition

∃θ

,

j

f (η j ) < f (a) holds (∵ in Figure 5, f (η j ) < f (a) for sufficiently small δ). / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Full-Span Log-Linear Model and Fast Learning Algorithm 1421 Figure 5: Existence of θ j such that f (η j ) < f (a). Here, η j ∈ Ay(θ j ), and the inequality f (θ j+1) = min θ ∈∪yAy (θ j ) f (θ ) ≤ f (η j ) holds. Therefore, the inequality f (θ j+1) ≤ f (η j ) < f (a) holds. Since f (θ t ) monotonically decreases as t increases, the proposition ∀t ≥ j + 1, f (θ t ) ≤ f (η j ) < f (a) holds. Since f (θ ) is continuous at θ = a, no sub-sequence of {θ t} can con- verge to a, that is, a is not an accumulation point of {θ t}. This fact contradicts (cid:4) the assumption we made at the beginning of this proof. A.9 Proof of Corollary 1. By theorem 2, any accumulation point of {θ t} is an axis minimum of f ; however, the axis minimum of f is unique; there- fore, θ = a is a unique accumulation point of {θ t}, and θ t → a(t → ∞). (cid:4) A.10 Proof of Theorem 3. Let us define the following symbols: my(θ t, ∞) = min θ ∈Ay(θ t ) f (θ ), arg my(θ t, ∞) = {θ |θ ∈ Ay(θ t ), f (θ ) = my(θ t, ∞)}. Figure 6 illustrates the sectional view of fN(θ ) along the line Ay(θ t ). As fN(θ ) has a gap with depth ry (see equation 2.5) at shown in the figure, θy = 0. Here, we can ignore the gap at θy = 0 for sufficiently large N, that is, the proposition ∃N (cid:12)∀N, N > N

(cid:12) (cid:20)⇒ arg my(θ t, N) = arg my(θ t, ∞)

holds. Moreover, the proposition

∃N

(cid:12)∀N∀y, N > N

(cid:12) (cid:20)⇒ arg my(θ t, N) = arg my(θ t, ∞)

(A.3)

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

/

e
d
u
n
e
c
o
a
r
t
i
c
e

p
d

/

l

f
/

/

/

/

3
4
6
1
3
9
8
2
0
2
3
4
6
6
n
e
c
o
_
a
_
0
1
4
9
6
p
d

.

/

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

1422

K. Takabatake and S. Akaho

Figure 6: View of fN(θ).

holds. By equation 2.23, the proposition

∃N

(cid:12)∀N, N > N

(cid:12) (cid:20)⇒ arg m(θ t, N) = arg m(θ t, ∞)

holds. Here, we compare a sequence {θ t (N)} with the cost function fN
and a sequence {θ t (∞)} with the cost function f (= f∞). By corollary 1,
limt→∞ f (θ t (∞)) = minθ f (θ ), that is,

∀(cid:8) > 0∃t

(cid:12)∀t,

t ≥ t

(cid:12) (cid:20)⇒ f (θ t (∞)) < min θ f (θ ) + (cid:8) (A.4) holds. Here, let T(N) be the smallest t such that m(θ t+1, N) (cid:11)= m(θ t+1, ∞). Then limN→∞ T(N) = ∞, that is, ∀t (cid:12)∃N, T(N) ≥ t (cid:12) (A.5) holds. By equations A.4 and A.5, the proposition ∀(cid:8) > 0∃N, ∀t,

t ≥ T(N) (cid:20)⇒ f (θ t (∞)) < min θ f (θ ) + (cid:8) (A.6) holds, and therefore the proposition ∀(cid:8) > 0∃N,

f (θ T(N)(∞)) < min θ f (θ ) + (cid:8) (A.7) also holds. Here, by the definition of T(N), θ T(N)(∞) = θ T(N)(N). Therefore, we can modify equation A.7 as ∀(cid:8) > 0∃N,

f (θ T(N)(N)) < min θ f (θ ) + (cid:8). (A.8) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Full-Span Log-Linear Model and Fast Learning Algorithm 1423 Since f (θ t (N)) monotonically decreases as t grows, ∀(cid:8) > 0∃N∀t,

t ≥ T(N) (cid:20)⇒ f (θ t (N)) < min θ f (θ ) + (cid:8). Using the notation of lim, we obtain lim N→∞ lim t→∞ f (θ t (N)) = min θ f (θ ). Acknowledgments This research is supported by KAKENHI 17H01793. References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltz- mann machines. Cognitive Science, 9(1), 147–169. 10.1207/s15516709cog0901_7 Amari, S. (2016). Information geometry and its applications. Berlin: Springer. Andrew, G., & Gao, J. (2007). Scalable training of L1-regularized log-linear models. In Proceedings of the 24th International Conference on Machine Learning (pp. 33–40). New York: ACM. Beck, A. (2015). On the convergence of alternating minimization for convex program- ming with applications to iteratively reweighted least squares and decomposition schemes. SIAM Journal on Optimization, 25(1), 185–209. 10.1137/13094829X Dennis Jr., J. E., & Moré, J. J. (1977). Quasi-Newton methods, motivation and theory. SIAM Review, 19(1), 46–89. 10.1137/1019005 Fino, B. J., & Algazi, V. R. (1976). Unified matrix treatment of the fast Walsh- Hadamard transform. IEEE Transactions on Computers, 25(11), 1142–1146. 10.1109/ TC.1976.1674569 Newman, M., & Barkema, G. (1999). Monte Carlo methods in statistical physic. New York: Oxford University Press. Nocedal, J., & Wright, S. (2006). Numerical optimization. New York: Springer Science & Business Media. Pratt, W. K., Andrews, H. C., & Kane, J. (1969). Hadamard transform image coding. In Proceedings of the IEEE, 57(1), 58–68. 10.1109/PROC.1969.6869 Raff, E. (2017). JSAT: Java statistical analysis tool, a library for machine learning. Journal of Machine Learning Research, 18(23), 1–5. Rissanen, J. (2007). Information and complexity in statistical modeling. Berlin: Springer. Sejnowski, T. J. (1986). Higher-order Boltzmann machines. In AIP Conference Pro- ceedings (vol. 151, pp. 398–403). Melville, NY: American Institute for Physics. 10.1063/1.36246 Smith, W. W. (2010). Handbook of real-time fast Fourier transforms. Piscataway, NJ: IEEE. Takabatake, K., & Akaho, S. (2014). Basis functions for fast learning of log-linear models. IEICE Technical Report, 114(306), 307–312. (in Japanese) (cid:4) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 1424 K. Takabatake and S. Akaho Takabatake, K., & Akaho, S. (2015). Full-span log-linear model with L1 regularization and its performance. IEICE Technical Report, 115(323), 153–157. (in Japanese) Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th International Conference on Ma- chine Learning (pp. 1064–1071). New York: ACM. Received August 25, 2021; accepted January 22, 2022. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 4 6 1 3 9 8 2 0 2 3 4 6 6 n e c o _ a _ 0 1 4 9 6 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3
Download pdf