REVIEW

REVIEW

Communicated by Dana Ballard

Predictive Coding, Variational Autoencoders,
and Biological Connections

Joseph Marino*
josephmarino@deepmind.com
Computation and Neural Systems, California Institute of Technology,
Pasadena, CA 91125, 美国.

We present a review of predictive coding, from theoretical neuroscience,
and variational autoencoders, from machine learning, identifying the
common origin and mathematical framework underlying both areas. 作为
each area is prominent within its respective field, more firmly connect-
ing these areas could prove useful in the dialogue between neuroscience
和机器学习. After reviewing each area, we discuss two possi-
ble correspondences implied by this perspective: cortical pyramidal den-
drites as analogous to (nonlinear) deep networks and lateral inhibition
as analogous to normalizing flows. These connections may provide new
directions for further investigations in each field.

1 介绍

1.1 Cybernetics. Machine learning and theoretical neuroscience once
overlapped under the field of cybernetics (Wiener, 1948; Ashby, 1956).
Within this field, perception and control, in both biological and nonbi-
ological systems, were formulated in terms of negative feedback and
feedforward processes. Negative feedback attempts to minimize error
signals by feeding the errors back into the system, whereas feedforward
processing attempts to preemptively reduce error through prediction.
Cybernetics formalized these techniques using probabilistic models, 哪个
estimate the likelihood of random outcomes, and variational calculus, A
technique for estimating functions, particularly probability distributions
(Wiener, 1948). This resulted in the first computational models of neuron
function and learning (McCulloch & Pitts, 1943; Rosenblatt, 1958; Widrow
& Hoff, 1960), a formal definition of information (Wiener, 1942; Shannon,
1948) (with connections to neural systems Barlow, 1961乙), and algorithms
for negative feedback perception and control (MacKay, 1956; Kalman,
1960). Yet with advances in these directions (see Prieto et al., 2016) 这
cohesion of cybernetics diminished, with the new ideas taking root in, 为了

*The author is now at DeepMind, 伦敦, U.K.

神经计算 34, 1–44 (2022)
https://doi.org/10.1162/neco_a_01458

© 2021 麻省理工学院

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

/

3
4
1
1
2
0
0
7
7
8
9
n
e
C

_
A
_
0
1
4
5
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

2

J. Marino

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

/

3
4
1
1
2
0
0
7
7
8
9
n
e
C

_
A
_
0
1
4
5
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 1: Concept overview. Cybernetics influenced the areas that became the-
oretical neuroscience and machine learning, resulting in shared mathematical
概念. This review explores the connections between predictive coding, 从
theoretical neuroscience, and variational autoencoders, from machine learning.

例子, theoretical neuroscience, 机器学习, and control theory.
The transfer of ideas is shown in Figure 1.

1.2 Neuroscience and Machine Learning: Convergence and Diver-
根杰斯. A renewed dialogue between neuroscience and machine learning
formed in the 1980s and 1990s. Neuroscientists, bolstered by new physi-
ological and functional analyses, began making traction in studying neu-
ral systems in probabilistic and information-theoretic terms (Laughlin,
1981; 斯里尼瓦桑, Laughlin, & Dubs, 1982; 巴洛, 1989; Bialek, Rieke,
Van Steveninck, & Warland, 1991). In machine learning, improvements in
probabilistic modeling (Pearl, 1986) and artificial neural networks (Rumel-
hart, 欣顿, & 威廉姆斯, 1986) combined with ideas from statistical me-
chanics (Hopfield, 1982; Ackley, 欣顿, & Sejnowski, 1985) to yield new
classes of models and training techniques. This convergence of ideas,

Predictive Coding, Variational Autoencoders, and Biological Connections

3

primarily centered around perception, resulted in new theories of neural
processing and improvements in their mathematical underpinnings.

尤其, the notion of predictive coding emerged within neuro-
科学 (Srinivasan et al., 1982; 饶 & Ballard, 1999). In its most general
形式, predictive coding postulates that neural circuits are engaged in es-
timating probabilistic models of other neural activity and sensory inputs,
with feedback and feedforward processes playing a central role. 这些
models were initially formulated in early sensory areas, 例如, 在
the retina (Srinivasan et al., 1982) and thalamus (Dong & Atick, 1995), 我们-
ing feedforward processes to predict future neural activity. Similar notions
were extended to higher-level sensory processing in neocortex by David
Mumford (1991, 1992). Top-down neural projections (from higher-level to
lower-level sensory areas) were hypothesized to convey sensory predic-
系统蒸发散, whereas bottom-up neural projections were hypothesized to convey
prediction errors. Through negative feedback, these errors then updated
state estimates. These ideas were formalized by Rao and Ballard (1999), 为了-
mulating a simplified artificial neural network model of images, reminis-
cent of a Kalman filter (Kalman, 1960).

Feedback and feedforward processes also featured prominently in ma-
chine learning. 的确, the primary training algorithm for artificial neural
网络, backpropagation (Rumelhart et al., 1986), literally feeds (支柱-
agates) the output prediction errors back through the network—negative
feedback. 在这段时期, the technique of variational inference was
rediscovered within machine learning (欣顿 & Van Camp, 1993; Neal
& 欣顿, 1998), recasting probabilistic inference using variational calcu-
字. This technique proved essential in formulating the Helmholtz machine
(Dayan et al., 1995; Dayan & 欣顿, 1996), a hierarchical unsupervised
probabilistic model parameterized by artificial neural networks. Similar ad-
vances were made in autoregressive probabilistic models (弗雷, 欣顿, &
Dayan, 1996; 本吉奥 & 本吉奥, 2000), using artificial neural networks to
form sequential feedforward predictions, as well as new classes of invertible
probabilistic models (Comon, 1994; Parra, 德科, & Miesbach, 1995; 德科 &
Brauer, 1995; 钟 & Sejnowski, 1997).

These new ideas regarding variational inference and probabilistic mod-
这, particularly the Helmholtz machine (Dayan, 欣顿, Neal, & Zemel,
1995), influenced predictive coding. 具体来说, Karl Friston utilized vari-
ational inference to formulate hierarchical dynamical models of neocortex
(弗里斯顿, 2005, 2008A). In line with Mumford (1992), these models contain
multiple levels, with each level attempting to predict its future activity
(feedforward) as well as lower-level activity, closer to the input data. Predic-
tion errors across levels facilitate updating higher-level estimates (negative
feedback). Such models have incorporated many biological aspects, 包括-
ing local learning rules (弗里斯顿, 2005) 和关注 (Spratling, 2008; Feld-
男人 & 弗里斯顿, 2010; Kanai, Komura, Shipp, & 弗里斯顿, 2015), and have been
compared with neural circuits (Bastos et al., 2012; 凯勒 & Mrsic-Flogel,

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

/

3
4
1
1
2
0
0
7
7
8
9
n
e
C

_
A
_
0
1
4
5
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

4

J. Marino

2018; Walsh, McGovern, 克拉克, and O’Connell, 2020). While predictive cod-
ing and other Bayesian brain theories are increasingly popular (Doya, Ishii,
Pouget, & 饶, 2007; 弗里斯顿, 2009; 克拉克, 2013), validating these models
is hampered by the difficulty of distinguishing between specific design
choices and general theoretical claims (格什曼, 2019). 更远, 一个大的
gap remains between the simplified implementations of these models and
the complexity of neural systems.

Progress in machine learning picked up in the early 2010s, with ad-
vances in parallel computing as well as standardized data sets (Deng
等人。, 2009). In this era of deep learning (乐存, 本吉奥, & 欣顿, 2015;
施米德胡贝尔, 2015), 那是, artificial neural networks with multiple layers,
a flourishing of ideas emerged around probabilistic modeling. Building off
previous work, more expressive classes of deep hierarchical (Gregor, Dani-
helka, Mnih, Blundell, & Wierstra, 2014; Mnih & Gregor, 2014; Kingma &
Welling, 2014; Rezende, Mohamed, & Wierstra, 2014), autoregressive (Uria,
穆雷, & 拉罗谢尔, 2014; van den Oord, Kalchbrenner, & Kavukcuoglu,
2016), and invertible (Dinh, Krueger, & 本吉奥, 2015; Dinh, Sohl-Dickstein,
& 本吉奥, 2017) probabilistic models were developed. Of particular impor-
tance is a model class known as variational autoencoders (VAEs; Kingma
& Welling, 2014; Rezende et al., 2014), a relative of the Helmholtz machine,
which closely resembles hierarchical predictive coding. 很遗憾, 的-
spite this similarity, the machine learning community remains largely obliv-
ious to the progress in predictive coding and vice versa.

1.3 Connecting Predictive Coding and VAEs. This review aims to
bridge the divide between predictive coding and VAEs. While this work
provides unique contributions, it is inspired by previous work at this in-
tersection. 尤其, van den Broeke (2016) outlines hierarchical proba-
bilistic models in predictive coding and machine learning. 同样地, Lotter,
Kreiman, and Cox (2017, 2018) implement predictive coding techniques in
deep probabilistic models, comparing these models with neural phenom-
ena.

After reviewing background mathematical concepts in section 2, we dis-
cuss the basic formulations of predictive coding in section 3 and variational
autoencoders in section 4, and we identify commonalities in their model
formulations and inference techniques in section 5. Based on these connec-
系统蒸发散, in section 6, we discuss two possible correspondences between ma-
chine learning and neuroscience seemingly suggested by this perspective:

• Dendrites of pyramidal neurons and deep artificial networks, af-
firming a more nuanced perspective over the analogy of biological
and artificial neurons

• Lateral inhibition and normalizing flows, providing a more general

framework for normalization.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

/

e
d

n
e
C

A
r
t

C
e

p
d

/

F
/

/

/

/

/

3
4
1
1
2
0
0
7
7
8
9
n
e
C

_
A
_
0
1
4
5
8
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Predictive Coding, Variational Autoencoders, and Biological Connections

5

Like the work of van den Broeke (2016) and Lotter et al. (2017, 2018), 我们
hope that these connections will inspire future research in exploring this
promising direction.

2 Background

2.1 Maximum Log Likelihood. Consider a random variable, x ∈ RM,
with a corresponding distribution, pdata(X), defining the probability of ob-
serving each possible value. This distribution is the result of an underly-
ing data-generating process, 例如, the emission and scattering of
photons. While we do not have direct access to pdata, we can sample obser-
vations, x ∼ pdata(X), yielding an empirical distribution, (西德:2)pdata(X). Often we
wish to model pdata, 例如, for prediction or compression. We refer
to this model as pθ (X), with parameters θ . Estimating the model parameters
involves maximizing the log likelihood of data samples under the model’s
分配:

θ ∗ ← arg max

x∼pdata (X)

(西德:3)
(西德:4)
log pθ (X)

.

(2.1)

This is the maximum log-likelihood objective, which is found throughout
machine learning and probabilistic modeling (墨菲, 2012). 在实践中,
we do not have access to pdata(X) and instead approximate the objective us-
ing data samples, 那是, 使用 (西德:2)pdata(X).

2.2 Probabilistic Models.

2.2.1 Dependency Structure. A probabilistic model includes the depen-
dency structure (参见部分 2.2.1) and the parameterization of these depen-
dencies (参见部分 2.2.2). The dependency structure is the set of conditional
dependencies between variables (见图 2). One common form is given
by autoregressive models (Frey et al., 1996; 本吉奥 & 本吉奥, 2000), 哪个
use the chain rule of probability:

p (X) =

中号(西德:5)

j=1

p (x j

|X< j ). (2.2) By inducing an ordering over the M dimensions of x, we can factor the joint distribution, pθ (x), into a product of M conditional distributions, each con- ditioned on the previous dimensions, x< j. A natural use case arises in mod- eling sequential data, where time provides an ordering over a sequence of T variables, x1:T : pθ (x1:T ) = T(cid:5) t=1 pθ (xt|x
REVIEW image
REVIEW image
REVIEW image
REVIEW image

下载pdf