REVIEW - Ricerca sull'intelligenza artificiale specializzata al MIT

REVIEW

Communicated by Dana Ballard

Predictive Coding, Variational Autoencoders,
and Biological Connections

Joseph Marino*
josephmarino@deepmind.com
Computation and Neural Systems, California Institute of Technology,
Pasadena, CA 91125, U.S.A.

We present a review of predictive coding, from theoretical neuroscience,
and variational autoencoders, from machine learning, identifying the
common origin and mathematical framework underlying both areas. As
each area is prominent within its respective field, more firmly connect-
ing these areas could prove useful in the dialogue between neuroscience
and machine learning. After reviewing each area, we discuss two possi-
ble correspondences implied by this perspective: cortical pyramidal den-
drites as analogous to (nonlinear) deep networks and lateral inhibition
as analogous to normalizing flows. These connections may provide new
directions for further investigations in each field.

1 introduzione

1.1 Cybernetics. Machine learning and theoretical neuroscience once
overlapped under the field of cybernetics (Wiener, 1948; Ashby, 1956).
Within this field, perception and control, in both biological and nonbi-
ological systems, were formulated in terms of negative feedback and
feedforward processes. Negative feedback attempts to minimize error
signals by feeding the errors back into the system, whereas feedforward
processing attempts to preemptively reduce error through prediction.
Cybernetics formalized these techniques using probabilistic models, Quale
estimate the likelihood of random outcomes, and variational calculus, UN
technique for estimating functions, particularly probability distributions
(Wiener, 1948). This resulted in the first computational models of neuron
function and learning (McCulloch & Pitts, 1943; Rosenblatt, 1958; Widrow
& Hoff, 1960), a formal definition of information (Wiener, 1942; Shannon,
1948) (with connections to neural systems Barlow, 1961B), and algorithms
for negative feedback perception and control (MacKay, 1956; Kalman,
1960). Yet with advances in these directions (see Prieto et al., 2016) IL
cohesion of cybernetics diminished, with the new ideas taking root in, for

*The author is now at DeepMind, London, U.K.

Calcolo neurale 34, 1–44 (2022)
https://doi.org/10.1162/neco_a_01458

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
N
e
C
o
UN
R
T
io
C
e
–
P
D

F
/

3
4
1
1
2
0
0
7
7
8
9
N
e
C
o
_
UN
_
0
1
4
5
8
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

J. Marino

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
N
e
C
o
UN
R
T
io
C
e
–
P
D

F
/

3
4
1
1
2
0
0
7
7
8
9
N
e
C
o
_
UN
_
0
1
4
5
8
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figura 1: Concept overview. Cybernetics influenced the areas that became the-
oretical neuroscience and machine learning, resulting in shared mathematical
concepts. This review explores the connections between predictive coding, from
theoretical neuroscience, and variational autoencoders, from machine learning.

esempio, theoretical neuroscience, machine learning, and control theory.
The transfer of ideas is shown in Figure 1.

1.2 Neuroscience and Machine Learning: Convergence and Diver-
gence. A renewed dialogue between neuroscience and machine learning
formed in the 1980s and 1990s. Neuroscientists, bolstered by new physi-
ological and functional analyses, began making traction in studying neu-
ral systems in probabilistic and information-theoretic terms (Laughlin,
1981; Srinivasan, Laughlin, & Dubs, 1982; Barlow, 1989; Bialek, Rieke,
Van Steveninck, & Warland, 1991). In machine learning, improvements in
probabilistic modeling (Pearl, 1986) and artificial neural networks (Rumel-
hart, Hinton, & Williams, 1986) combined with ideas from statistical me-
chanics (Hopfield, 1982; Ackley, Hinton, & Sejnowski, 1985) to yield new
classes of models and training techniques. This convergence of ideas,

Predictive Coding, Variational Autoencoders, and Biological Connections

primarily centered around perception, resulted in new theories of neural
processing and improvements in their mathematical underpinnings.

In particular, the notion of predictive coding emerged within neuro-
science (Srinivasan et al., 1982; Rao & Ballard, 1999). In its most general
form, predictive coding postulates that neural circuits are engaged in es-
timating probabilistic models of other neural activity and sensory inputs,
with feedback and feedforward processes playing a central role. These
models were initially formulated in early sensory areas, Per esempio, In
the retina (Srinivasan et al., 1982) and thalamus (Dong & Atick, 1995), us-
ing feedforward processes to predict future neural activity. Similar notions
were extended to higher-level sensory processing in neocortex by David
Mumford (1991, 1992). Top-down neural projections (from higher-level to
lower-level sensory areas) were hypothesized to convey sensory predic-
zioni, whereas bottom-up neural projections were hypothesized to convey
prediction errors. Through negative feedback, these errors then updated
state estimates. These ideas were formalized by Rao and Ballard (1999), for-
mulating a simplified artificial neural network model of images, reminis-
cent of a Kalman filter (Kalman, 1960).

Feedback and feedforward processes also featured prominently in ma-
chine learning. Infatti, the primary training algorithm for artificial neural
networks, backpropagation (Rumelhart et al., 1986), literally feeds (prop-
agates) the output prediction errors back through the network—negative
feedback. During this period, the technique of variational inference was
rediscovered within machine learning (Hinton & Van Camp, 1993; Neal
& Hinton, 1998), recasting probabilistic inference using variational calcu-
lus. This technique proved essential in formulating the Helmholtz machine
(Dayan et al., 1995; Dayan & Hinton, 1996), a hierarchical unsupervised
probabilistic model parameterized by artificial neural networks. Similar ad-
vances were made in autoregressive probabilistic models (Frey, Hinton, &
Dayan, 1996; Bengio & Bengio, 2000), using artificial neural networks to
form sequential feedforward predictions, as well as new classes of invertible
probabilistic models (Comon, 1994; Parra, Deco, & Miesbach, 1995; Deco &
Brauer, 1995; Campana & Sejnowski, 1997).

These new ideas regarding variational inference and probabilistic mod-
els, particularly the Helmholtz machine (Dayan, Hinton, Neal, & Zemel,
1995), influenced predictive coding. Specifically, Karl Friston utilized vari-
ational inference to formulate hierarchical dynamical models of neocortex
(Friston, 2005, 2008UN). In line with Mumford (1992), these models contain
multiple levels, with each level attempting to predict its future activity
(feedforward) as well as lower-level activity, closer to the input data. Predic-
tion errors across levels facilitate updating higher-level estimates (negative
feedback). Such models have incorporated many biological aspects, includ-
ing local learning rules (Friston, 2005) and attention (Spratling, 2008; Feld-
Uomo & Friston, 2010; Kanai, Komura, Shipp, & Friston, 2015), and have been
compared with neural circuits (Bastos et al., 2012; Keller & Mrsic-Flogel,

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
N
e
C
o
UN
R
T
io
C
e
–
P
D

F
/

3
4
1
1
2
0
0
7
7
8
9
N
e
C
o
_
UN
_
0
1
4
5
8
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

J. Marino

2018; Walsh, McGovern, Clark, and O’Connell, 2020). While predictive cod-
ing and other Bayesian brain theories are increasingly popular (Doya, Ishii,
Pouget, & Rao, 2007; Friston, 2009; Clark, 2013), validating these models
is hampered by the difficulty of distinguishing between specific design
choices and general theoretical claims (Gershman, 2019). Further, a large
gap remains between the simplified implementations of these models and
the complexity of neural systems.

Progress in machine learning picked up in the early 2010s, with ad-
vances in parallel computing as well as standardized data sets (Deng
et al., 2009). In this era of deep learning (LeCun, Bengio, & Hinton, 2015;
Schmidhuber, 2015), questo è, artificial neural networks with multiple layers,
a flourishing of ideas emerged around probabilistic modeling. Building off
previous work, more expressive classes of deep hierarchical (Gregor, Dani-
helka, Mnih, Blundell, & Wierstra, 2014; Mnih & Gregor, 2014; Kingma &
Welling, 2014; Rezende, Mohamed, & Wierstra, 2014), autoregressive (Uria,
Murray, & Larochelle, 2014; van den Oord, Kalchbrenner, & Kavukcuoglu,
2016), and invertible (Dinh, Krueger, & Bengio, 2015; Dinh, Sohl-Dickstein,
& Bengio, 2017) probabilistic models were developed. Of particular impor-
tance is a model class known as variational autoencoders (VAEs; Kingma
& Welling, 2014; Rezende et al., 2014), a relative of the Helmholtz machine,
which closely resembles hierarchical predictive coding. Unfortunately, Di-
spite this similarity, the machine learning community remains largely obliv-
ious to the progress in predictive coding and vice versa.

1.3 Connecting Predictive Coding and VAEs. This review aims to
bridge the divide between predictive coding and VAEs. While this work
provides unique contributions, it is inspired by previous work at this in-
tersection. In particular, van den Broeke (2016) outlines hierarchical proba-
bilistic models in predictive coding and machine learning. Likewise, Lotter,
Kreiman, and Cox (2017, 2018) implement predictive coding techniques in
deep probabilistic models, comparing these models with neural phenom-
ena.

After reviewing background mathematical concepts in section 2, we dis-
cuss the basic formulations of predictive coding in section 3 and variational
autoencoders in section 4, and we identify commonalities in their model
formulations and inference techniques in section 5. Based on these connec-
zioni, in section 6, we discuss two possible correspondences between ma-
chine learning and neuroscience seemingly suggested by this perspective:

• Dendrites of pyramidal neurons and deep artificial networks, af-
firming a more nuanced perspective over the analogy of biological
and artificial neurons

• Lateral inhibition and normalizing flows, providing a more general

framework for normalization.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu
N
e
C
o
UN
R
T
io
C
e
–
P
D

F
/

3
4
1
1
2
0
0
7
7
8
9
N
e
C
o
_
UN
_
0
1
4
5
8
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Predictive Coding, Variational Autoencoders, and Biological Connections

Like the work of van den Broeke (2016) and Lotter et al. (2017, 2018), we
hope that these connections will inspire future research in exploring this
promising direction.

2 Background

2.1 Maximum Log Likelihood. Consider a random variable, x ∈ RM,
with a corresponding distribution, pdata(X), defining the probability of ob-
serving each possible value. This distribution is the result of an underly-
ing data-generating process, Per esempio, the emission and scattering of
photons. While we do not have direct access to pdata, we can sample obser-
vations, x ∼ pdata(X), yielding an empirical distribution, (cid:2)pdata(X). Often we
wish to model pdata, Per esempio, for prediction or compression. We refer
to this model as pθ (X), with parameters θ . Estimating the model parameters
involves maximizing the log likelihood of data samples under the model’s
distribution:

θ ∗ ← arg max

x∼pdata (X)

(cid:3)
(cid:4)
log pθ (X)

(2.1)

This is the maximum log-likelihood objective, which is found throughout
machine learning and probabilistic modeling (Murphy, 2012). In practice,
we do not have access to pdata(X) and instead approximate the objective us-
ing data samples, questo è, using (cid:2)pdata(X).

2.2 Probabilistic Models.

2.2.1 Dependency Structure. A probabilistic model includes the depen-
dency structure (see section 2.2.1) and the parameterization of these depen-
dencies (see section 2.2.2). The dependency structure is the set of conditional
dependencies between variables (Guarda la figura 2). One common form is given
by autoregressive models (Frey et al., 1996; Bengio & Bengio, 2000), Quale
use the chain rule of probability:

pθ (X) =

M(cid:5)

j=1

pθ (x j

|X< j ). (2.2) By inducing an ordering over the M dimensions of x, we can factor the joint distribution, pθ (x), into a product of M conditional distributions, each con- ditioned on the previous dimensions, x< j. A natural use case arises in mod- eling sequential data, where time provides an ordering over a sequence of T variables, x1:T : pθ (x1:T ) = T(cid:5) t=1 pθ (xt|x
REVIEW image

Scarica il pdf