Communicated by Scott Kirkpatrick

Deterministic Boltzmann Learning Performs Steepest
Descent in Weight-Space

Geoffrey E. Hinton
Department of Computer Science, University o f Toronto,
10 King’s College Road, Toronto M5S 1 A4, Canada

The Boltzmann machine learning procedure has been successfully ap-
plied in deterministic networks of analog units that use a mean field ap-
proximation to efficiently simulate a truly stochastic system (Peterson
and Anderson 1987). This type of ”deterministic Boltzmann machine”
(DBM) learns much faster than the equivalent ”stochastic Boltzmann
machine” (SBM), but since the learning procedure for DBM’s is only
based on an analogy with SBM‘s, there is no existing proof that it per-
forms gradient descent in any function, and it has only been justified
by simulations. By using the appropriate interpretation for the way in
which a DBM represents the probability of an output vector given an
input vector, it is shown that the DBM performs steepest descent in the
same function as the original SBM, except at rare discontinuities. A
very simple way of forcing the weights to become symmetrical is also
described, and this makes the DBM more biologically plausible than
back-propagation (Werbos 1974; Parker 1985; Rumelhart et al. 1986).

1 Introduction

The promising results obtained by Peterson and Anderson (Peterson and
Anderson 1987) using a DBM are hard to assess because they present no
mathematical guarantee that the learning does gradient descent in any
error function (except in the limiting case of a very large net with small
random weights). It is quite conceivable that in a DBM the computed
gradient might have a small systematic difference from the true gradient
of the normal performance measure for each training case, and when
these slightly incorrect gradients are added together over many cases
their resultant might bear little relation to the resultant of the true case-
wise gradients (see Fig. 1).

2 The Learning Procedure for Stochastic Boltzmann Machines ~

A Boltzmann machine (Hinton and Sejnowski 1986) is a network of sym-
metrically connected binary units that asynchronously update their states

Neural Computation 1, 143-150 (1989) @ 1989 Massachusetts Institute of Technology

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

1
1
1
4
3
8
1
1
8
1
5
n
e
c
o
1
9
8
9
1
1
1
4
3
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

144

Geoffrey E. Hinton

according to a sfochastic decision rule. The units have states of 1 or 0 and
the probability that unit i adopts the state 1 is given by

pi = o(- c S j W i j )

T i

(2.1)

where s j is the state of the jth unit, wij is the weight on the connection
between the jth and the ith unit, T is the “temperature” and c is a smooth
non-linear function defined as

1
o(z) = –
1 + e-”

(2.2)

If the binary states of units are updated asynchronously and repeatedly
using equation 2.1, the network will reach “thermal equilibrium” so that
the relative probabilities of global configurations are determined by their
energies according to the Boltzmann distribution:

where Pa is the probability of a global configuration and E, is its energy
defined by

where s:
is the binary state of unit i in the oth global configuration, and
bias terms are ignored because they can always be treated as weights on
connections from a permanently active unit.

At any given temperature, T , the Boltzmann distribution is the one
that minimizes the Helmholtz free energy, F , of the distribution. F is
defined by the equation

a + b

Figure 1: The true gradients of the performance measure are a and b for two
training cases. Even fairly accurate estimates, 2 and 6, can have a resultant that
points in a very different direction.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

1
1
1
4
3
8
1
1
8
1
5
n
e
c
o
1
9
8
9
1
1
1
4
3
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Deterministic Boltzmann Learning Performs Steepest Descent

F = ( E ) – T H

145

(2.5)

where ( E ) is the expected value of the energy given the probability dis-
tribution over configurations and H is the entropy of the distribution. It
can be shown that minima of F (which will be denoted by F’) satisfy
the equation

In a stochastic Boltzmann machine, the probability of an output vector,

Op, given an input vector, I, is represented by

(2.7)

where F& is the minimum free energy with I , and Op clamped, and F,*
is the minimum free energy with just I , clamped. A very natural way
to observe P-(OplI,) is to allow the network to reach thermal equilib-
rium with I , clamped, and to observe the probability of 00. The key to
Boltzmann machine learning is the simple way in which a small change
to a weight, w , ~ , affects the free energy and hence the log probability of
an output vector in a network at thermal equilibrium.

(2.8)

where ( s , s j ) is the expected value of s,sj in the minimum free energy
distribution. The simple relationship between weight changes and log
probabilities of output vectors makes it easy to teach the network an
input-output mapping. The network is “shown” the mapping that it
is required to perform by clamping an input vector on the input units
and clamping the required output vector on the output units (with the
appropriate conditional probability). It is then allowed to reach thermal
equilibrium at T = 1, and at equilibrium each connection measures how
often the units it connects are simultaneously active. This is repeated
for all input-output pairs so that each connection can measure (s,sJ)+,
the expected probability, averaged over all cases, that unit i and unit
j are simultaneously active at thermal equilibrium when the input and
output vectors are both clamped. The network must also be run in just
the same way but without clamping the output units to measure ( s , s j ) – ,
the expected probability that both units are active at thermal equilibrium
when the output vector is determined by the network. Each weight is
then updated by

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

1
1
1
4
3
8
1
1
8
1
5
n
e
c
o
1
9
8
9
1
1
1
4
3
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

146

Geoffrey E. Hinton

It follows from equation 2.7 and equation 2.8 that if E is sufficiently
small this performs steepest descent in an information theoretic measure,
G, of the difference between the behavior of the output units when they
are clamped and their behavior when they are not clamped.

(2.10)

where I, is a state vector over the input units, Op is a state vector over the
output units, P+ is a probability measured at thermal equilibrium when
both the input and output units are clamped, and P- is a probability
measured when only the input units are clamped.

Stochastic Boltzmann machines learn slowly, partly because of the
time required to reach thermal equilibrium and partly because the learn-
ing is driven by the difference between two noisy variables, so these vari-
ables must be sampled for a long time at thermal equilibrium to reduce
the noise. If we could achieve the same simple relationships between log
probabilities and weights in a deterministic system, learning would be
much faster.

3 Mean field theory

Under certain conditions, a stochastic system can be approximated by a
deterministic one by replacing the stochastic binary variables of equation
2.1 by deterministic real-valued variables that represent their mean values

We could now perform discrete, asynchronous updates of the pi using
equation 3.1 or we could use a synchronous, discrete time approximation
of the set of differential equations

(3.2)

We shall view the pi as a representation of a probability distribution
over all binary global configurations. Since many different distributions
can give rise to the same mean values for the pi we shall assume that
the distribution being represented is the one that maximizes the entropy,
subject to the constraints imposed on the mean values by the pi. Equiva-
lently, it is the distribution in which the pi are treated as the mean values
of independent stochastic binary variables. Using equation 2.5 we can cal-
culate the free energy of the distribution represented by the state of a
DBM (at T = 1).

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u
n
e
c
o
a
r
t
i
c
e
–
p
d

f
/

1
1
1
4
3
8
1
1
8
1
5
n
e
c
o
1
9
8
9
1
1
1
4
3
p
d

b
y
g
u
e
s
t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Deterministic Boltzmann Learning Performs Steepest Descent

F = – C ~ z ~ 3 ~ 2 3

+ C[pz log(pa) + (1 – pz) lOg(1 – pz)l

2<3 a 147 (3.3) Although the dynamics of system defined by equation 3.2 do not consist in following gradient F , it can be shown that always moves direction has positive cosine with -F so settles to one minima (Hopfield 1984). Mean field systems are normally viewed as approximations really contain higher order statistics, but they also exact strongly limited probability distributions represent because use only N real values over 2N binary states. Within limits their representa- tional powers, an efficient way manipulating these large constrained distributions. 4 Deterministic Boltzmann machine learning In DBM, we shall define representation P-(OpII,) exactly 2.7, now F& and F,* will refer free energies particular network actually into. Unfortunately, DBM this is no longer equivalent obvious defining P-(OplI,) which clamp I, on input units, settle minimum F,, interpret output units distribution vectors, using maximum entropy assumption. The reason for choosing first definition rather than second this: Provided stable states change radically when weights changed slightly, mean version procedure changes each weight proportion log P-(OgII,), what required perform steepest descent performance measure G 2.10. When zut3 incremented infinitessimal amount cp,p3 two things happen F* (see Fig. 2). First, energy represented state decreased cpp2,p2, and, order, all nearby same amount. Second, p at minimized slightly slightly. But, movement effect value >
Communicated by Scott Kirkpatrick image

Download pdf