信 - 麻省理工学院人工智能研究专业

信

Communicated by Justin Dauwels

Reverse-Engineering Neural Networks to Characterize
Their Cost Functions

Takuya Isomura
takuya.isomura@riken.jp
Brain Intelligence Theory Unit, 日本理化学研究所脑科学中心,
你的, 埼玉 351-0198, 日本

Karl Friston
k.friston@ucl.ac.uk
Wellcome Centre for Human Neuroimaging, Institute of Neurology,
伦敦大学学院, 伦敦, WC1N 3AR, U.K.

This letter considers a class of biologically plausible cost functions for
神经网络, where the same cost function is minimized by both neu-
ral activity and plasticity. We show that such cost functions can be cast
as a variational bound on model evidence under an implicit generative
模型. Using generative models based on partially observed Markov de-
cision processes (POMDP), we show that neural activity and plasticity
perform Bayesian inference and learning, 分别, by maximizing
model evidence. Using mathematical and numerical analyses, we estab-
lish the formal equivalence between neural network cost functions and
variational free energy under some prior beliefs about latent states that
generate inputs. These prior beliefs are determined by particular con-
stants (例如, thresholds) that define the cost function. This means that the
Bayes optimal encoding of latent or hidden states is achieved when the
network’s implicit priors match the process that generates its inputs. 这
equivalence is potentially important because it suggests that any hyper-
parameter of a neural network can itself be optimized—by minimization
with respect to variational free energy. 此外, it enables one to
characterize a neural network formally, in terms of its prior beliefs.

1 介绍

Cost functions are ubiquitous in scientific fields that entail optimization—
including physics, 化学, 生物学, 工程, 和机器学习.
此外, any optimization problem that can be specified using a cost
function can be formulated as a gradient descent. In the neurosciences, 这
enables one to treat neuronal dynamics and plasticity as an optimization
过程 (马尔, 1969; Albus, 1971; Schultz, Dayan, & Montague, 1997; Sut-
吨 & Barto, 1998; Linsker, 1988; 棕色的, Yamada, & Sejnowski, 2001). 这些

神经计算 32, 2085–2121 (2020) © 2020 麻省理工学院.
https://doi.org/10.1162/neco_a_01315
在知识共享下发布
归因 4.0 国际的 (抄送 4.0) 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2086

时间. Isomura and K. 弗里斯顿

examples highlight the importance of specifying a problem in terms of cost
功能, from which neural and synaptic dynamics can be derived. 在
也就是说, cost functions provide a formal (IE。, normative) expression of
the purpose of a neural network and prescribe the dynamics of that neural
网络. 至关重要的是, once the cost function has been established and an initial
condition has been selected, it is no longer necessary to solve the dynam-
集成电路. 反而, one can characterize the neural network’s behavior in terms
of fixed points, basin of attraction and structural stability—based only on
the cost function. 简而言之, it is important to identify the cost function to
understand the dynamics, plasticity, and function of a neural network.

A ubiquitous cost function in neurobiology, theoretical biology, and ma-
chine learning is model evidence or equivalently, marginal likelihood or
surprise—namely, the probability of some inputs or data under a model of
how those inputs were generated by unknown or hidden causes (Bishop,
2006; Dayan & Abbott, 2001). 一般来说, the evaluation of surprise is in-
tractable (especially for neural networks) as it entails a logarithm of an in-
tractable marginal (IE。, integral). 然而, this evaluation can be converted
into an optimization problem by inducing a variational bound on surprise.
In machine learning, this is known as an evidence lower bound (ELBO; Blei,
Kucukelbir, & McAuliffe, 2017), while the same quantity is known as vari-
ational free energy in statistical physics and theoretical neurobiology.

Variational free energy minimization is a candidate principle that gov-
erns neuronal activity and synaptic plasticity (弗里斯顿, Kilner, & Harrison,
2006; 弗里斯顿, 2010). 这里, surprise reflects the improbability of sensory in-
puts given a model of how those inputs were caused. 反过来, 最小化
variational free energy, as a proxy for surprise, corresponds to inferring the
(unobservable) causes of (observable) 结果. To the extent that bi-
ological systems minimize variational free energy, it is possible to say that
they infer and learn the hidden states and parameters that generate their
sensory inputs (冯·亥姆霍兹, 1925; Knill & Pouget, 2004; DiCarlo, Zoc-
colan, & Rust, 2012) and consequently predict those inputs (饶 & Ballard,
1999; 弗里斯顿, 2005). This is generally referred to as perceptual inference
based on an internal generative model about the external world (Dayan,
欣顿, Neal, & Zemel, 1995; 乔治 & Hawkins, 2009; Bastos et al., 2012).
Variational free energy minimization provides a unified mathematical
formulation of these inference and learning processes in terms of self-
organizing neural networks that function as Bayes optimal encoders. 更多的-
超过, organisms can use the same cost function to control their surrounding
environment by sampling predicted (IE。, preferred) 输入. This is known
as active inference (弗里斯顿, Mattout, & Kilner, 2011). The ensuing free-
energy principle suggests that active inference and learning are mediated
by changes in neural activity, synaptic strengths, and the behavior of an or-
ganism to minimize variational free energy as a proxy for surprise. Cru-
cially, variational free energy and model evidence rest on a generative
model of continuous or discrete hidden states. A number of recent studies

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Reverse-Engineering Neural Networks

2087

have used Markov decision process (MDP) generative models to elaborate
schemes that minimize variational free energy (弗里斯顿, FitzGerald, Rigoli,
Schwartenbeck, & Pezzulo, 2016, 2017; 弗里斯顿, Parr, & de Vries, 2017; Fris-
吨, 林等人。, 2017). This minimization reproduces various interesting dy-
namics and behaviors of real neuronal networks and biological organisms.
然而, it remains to be established whether variational free energy min-
imization is an apt explanation for any given neural network, as opposed
to the optimization of alternative cost functions.

原则, any neural network that produces an output or a decision
can be cast as performing some form of inference in terms of Bayesian deci-
sion theory. On this reading, the complete class theorem suggests that any
neural network can be regarded as performing Bayesian inference under
some prior beliefs; 所以, it can be regarded as minimizing variational
free energy. The complete class theorem (Wald, 1947; 棕色的, 1981) 状态
that for any pair of decisions and cost functions, there are some prior beliefs
(implicit in the generative model) that render the decisions Bayes optimal.
This suggests that it should be theoretically possible to identify an implicit
generative model within any neural network architecture, which renders
its cost function a variational free energy or ELBO. 然而, 虽然
complete class theorem guarantees the existence of a generative model, 它
does not specify its form. 下文中, we show that a ubiquitous class
of neural networks implements approximates Bayesian inference under a
generic discrete state space model with a known form.

In brief, we adopt a reverse-engineering approach to identify a plausible
cost function for neural networks and show that the resulting cost function
is formally equivalent to variational free energy. 这里, we define a cost func-
tion as a function of sensory input, neural activity, and synaptic strengths
and suppose that neural activity and synaptic plasticity follow a gradient
descent on the cost function (assumption 1). For simplicity, we consider
single-layer feedforward neural networks comprising firing-rate neuron
models—receiving sensory inputs weighted by synaptic strengths—whose
firing intensity is determined by the sigmoid activation function (assump-
的 2). We focus on blind source separation (BSS), namely the problem
of separating sensory inputs into multiple hidden sources or causes (是-
louchrani, Abed-Meraim, Cardoso, & Moulines, 1997; Cichocki, Zdunek,
Phan, & Amari, 2009; Comon & Jutten, 2010), which provides the minimum
setup for modeling causal inference. A famous example of BSS is the cocktail
party effect: the ability of a partygoer to disambiguate an individual’s voice
from the noise of a crowd (Brown et al., 2001; Mesgarani & 张, 2012).
Previously, we observed BSS performed by in vitro neural networks (Iso-
mura, Kotani, & Jimbo, 2015) and reproduced this self-supervised process
using an MDP and variational free energy minimization (Isomura & Fris-
吨, 2018). These works suggest that variational free energy minimization
offers a plausible account of the empirical behavior of in vitro networks.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2088

时间. Isomura and K. 弗里斯顿

在这项工作中, we ask whether variational free energy minimization can
account for the normative behavior of a canonical neural network that min-
imizes its cost function, by considering all possible cost functions, 之内
a generic class. Using mathematical analysis, we identify a class of cost
functions—from which update rules for both neural activity and synaptic
plasticity can be derived. The gradient descent on the ensuing cost func-
tion naturally leads to Hebbian plasticity (Hebb, 1949; Bliss & Lømo, 1973;
Malenka & Bear, 2004) with an activity-dependent homeostatic term. 我们
show that these cost functions are formally homologous to variational free
energy under an MDP. 至关重要的是, this means the hyperparameters (IE。, 任何
variables or constants) of the neural network can be associated with prior
beliefs of the generative model. 原则, this allows one to optimize the
neural network hyperparameters (例如, thresholds and learning rates), 给定
some priors over the causes (IE。, latent states) of inputs to the neural net-
工作. 此外, estimating hyperparameters from the dynamics of (在
silico or in vitro) neural networks allows one to quantify the network’s im-
plicit prior beliefs. In this letter, we focus on the mathematical foundations
for applications to in vitro and in vivo neuronal networks in subsequent
工作.

2 方法

在这个部分, we formulate the same computation in terms of variational
Bayesian inference and neural networks to demonstrate their correspon-
登塞. We first derive the form of a variational free energy cost function un-
der a specific generative model, a Markov decision process.1 We present the
derivations carefully, with a focus on the form of the ensuing Bayesian be-
lief updating. The functional form of this update will reemerge later, 什么时候
reverse engineering the cost functions implicit in neural networks. 这些
correspondences are depicted in Figure 1 和表 1. This section starts
with a description of Markov decision processes as a general kind of gener-
ative model and then considers the minimization of variational free energy
under these models.

2.1 Generative Models. Under an MDP model (see Figure 1A), a mini-
mal BSS setup (in a discrete space) reduces to the likelihood mapping from
Ns hidden sources or states st ≡
to No observations ot ≡
s(1)
(西德:2)
t
t
哦(1)
. Each source and observation takes a value of one (ON state)
t

, . . . , 哦(不)

, . . . , s(Ns )

(西德:3)

(西德:2)

时间

Strictly speaking, the generative model we use in this letter is a hidden Markov model
(HMM) because we do not consider probabilistic transitions between hidden states that
depend on control variables. 然而, for consistency with the literature on variational
treatments of discrete statespace models, we retain the MDP formalism noting that we are
using a special case (with unstructured state transitions).

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Reverse-Engineering Neural Networks

2089

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

数字 1: Comparison between an MDP scheme and a neural network. (A) MDP
scheme expressed as a Forney factor graph (Forney, 2001; Dauwels, 2007) 基于
on the formulation in Friston, Parr et al. (2017). In this BSS setup, the prior
D determines hidden states st, while st determines observation ot through the
likelihood mapping A. Inference corresponds to the inversion of this gener-
ative process. 这里, D∗ indicates the true prior, while D indicates the prior
under which the network operates. If D = D∗, the inference is optimal; 其他-
明智的, it is biased. (乙) Neural network comprising a singlelayer feedforward net-
work with a sigmoid activation function. The network receives sensory inputs
)时间
)T that are generated from hidden states st
ot
t
, . . . , xtNx )时间 . 这里, xt j should encode the
and outputs neural activities xt
posterior expectation about a binary state s( j)
. In an analogy with the cocktail
t
party effect, st and ot correspond to individual speakers and auditory inputs,
分别.

, . . . , 哦(不)

, . . . , s(Ns )

= (哦(1)
t

= (s(1)
t

= (xt1

(西德:2)

or zero (OFF state) at each time step, 那是, s( j)
∈ {1, 0}. Throughout this
t
letter, j denotes the jth hidden state, while i denotes the ith observation.
(西德:3)
(西德:3)
The probability of s( j)
,
t
, D( j)
where D( j) ≡
0

(西德:2)
s( j)
t
= 1 (see Figure 1A, 顶部).

follows a categorical distribution P

∈ R2 with D( j)
1

+ D( j)
0

= Cat

D( j)
1

, 哦(我)
t

D( j)

The probability of an outcome is determined by the likelihood mapping
from all hidden states to each kind of observation in terms of a categor-
= Cat(A(我)) (see Figure 1A, 中间). 这里,
ical distribution, 磷
each element of the tensor A(我) ∈ R2×2Ns parameterizes the probability that

|st, A(我)

哦(我)
t

(西德:3)

(西德:2)

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2090

时间. Isomura and K. 弗里斯顿

桌子 1: Correspondence of Variables and Functions.

Neural Network Formation

Neural activity

Sensory inputs

Synaptic strengths

Perturbation term

Threshold

h j1

Initial synaptic strengths

Variational
Bayes Formation

State posterior

观察结果

Parameter posterior

State prior

⇐⇒ s( j)
xt j
t1
ot ⇐⇒ ot
(西德:4)
−1

(西德:5)

Wj1

ˆWj1

A(·, j)

⇐⇒ sig
≡ sig(Wj1 ) ⇐⇒ A(·, j)
⇐⇒ ln D( j)
φ
j1
(西德:4)
1
(西德:2)1 − A(·, j)
· (西德:2)1 + ln D( j)
1

(西德:5)

⇐⇒ ln

(西德:7) ˆW init
j1

⇐⇒ a(·, j)

Parameter prior

(西德:3)

(西德:2)

(西德:3)

(西德:2)

哦(我)
t

A(我)
·(西德:2)我

= k|st = (西德:2)我

(西德:3)
, where k ∈ {1, 0} are possible observations and (西德:2)l ∈ {1, 0}Ns
磷
encodes a particular combination of hidden states. The prior distribution of
each column of A(我), denoted by A(我)
=
·(西德:2)我
∈ R2. We use Dirichlet distribu-
Dir
系统蒸发散, as they are tractable and widely used for random variables that take
a continuous value between zero and one. 此外, learning the likeli-
hood mapping leads to biologically plausible update rules, which have the
form of associative or Hebbian plasticity (see below and Friston et al., 2016,
欲了解详情).

with concentration parameter a(我)
·(西德:2)我

, has a Dirichlet distribution P

We use ˜o ≡ (o1

, . . . , ot ) and ˜s ≡ (s1

, . . . , st ) to denote sequences of obser-
vations and hidden states, 分别. With this notation in place, the gen-
erative model (IE。, the joint distribution over outcomes, hidden states and
the parameters of their likelihood mapping) can be expressed as

A(我)
·(西德:2)我

磷 ( ˜o, ˜s, A) =P (A)

t(西德:6)

τ =1

磷 (oτ |sτ , A) 磷 (sτ )

不(西德:6)

我=1

磷(A(我)) ·

⎧
⎨

不(西德:6)

⎩

我=1

t(西德:6)

τ =1

(西德:4)
t |sτ , A(我)
哦(我)

磷

(西德:5) Ns(西德:6)

(西德:4)
s( j)
t

磷

j=1

⎫
⎬

(西德:5)

⎭

(2.1)

Throughout this letter, t denotes the current time, and τ denotes an arbitrary
time from the past to the present, 1 ≤ τ ≤ t.

2.2 Minimization of Variational Free Energy. In this MDP scheme, 这
aim is to minimize surprise by minimizing variational free energy as a

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Reverse-Engineering Neural Networks

2091

proxy, 那是, performing approximate or variational Bayesian inference.
From the generative model, we can motivate a mean-field approximation
to the posterior (IE。, 认出) density as follows,

问 ( ˜s, A) = Q (A) 问 ( ˜s) =

不(西德:6)

我=1

问(A(我)) ·

(西德:4)

问

s( j)
t

(西德:5)
,

t(西德:6)

Ns(西德:6)

τ =1

j=1

(2.2)

(西德:3)

(西德:2)

(西德:3)

A(我)

= Dir

where A(我) is the likelihood mapping (IE。, tensor), and the marginal poste-
(西德:2)
rior distributions of s( j)
τ and A(我) have a categorical Q
和
(西德:2)
A(我)
Dirichlet distribution Q
, 分别. For simplicity, 我们
assume that A(我) factorizes into the product of the likelihood mappings from
the jth hidden state to the ith observation: A(我)
(在哪里
⊗ denotes the outer product and A(我, j) ∈ R2×2). 这 (mean-field) 大约-
mation simplifies the computation of state posteriors and serves to specify a
particular form of Bayesian model which corresponds to a class of canonical
神经网络 (见下文).

k· ⊗ · · · ⊗ A(我,Ns )

k· ≈ A(我,1)

(西德:2)
= Cat

s( j)
t

(西德:3)

下文中, a case variable in bold indicates the posterior expecta-
tion of the corresponding variable in italics. 例如, s( j)
takes the value
t
0 或者 1, while the posterior expectation s( j)
τ ∈ R2 is the expected value of s( j)
t
that lies between zero and one. 而且, A(我, j) ∈ R2×2 denotes positive con-
centration parameters. 以下, we use the posterior expectation of ln A(我, j) 到
encode posterior beliefs about the likelihood, which is given by

熔点(我, j)
· j

≡ EQ(A(我, j))

= ln a(我, j)

·l

(西德:14)

(西德:13)
熔点(我, j)
· j
(西德:4)
A(我, j)

− ln

1我

(西德:4)

= ψ

(西德:5)

A(我, j)
·l
(西德:5)

− ψ
(西德:15)(西德:4)

(西德:4)
A(我, j)
(西德:5)−1

1我

+ A(我, j)
(西德:16)

0我

(西德:5)

+ A(我, j)

0我

+ 氧

A(我, j)
·l

(2.3)

where l ∈ {1, 0}. 这里, ψ (·) ≡ (西德:6)(西德:11)
(·) /(西德:6) (·) denotes the digamma function,
which arises naturally from the definition of the Dirichlet distribution. (看
Friston et al., 2016, for details.) EQ(A(我, j)) [·] denotes the expectation over the
posterior of A(我, j).

The ensuing variational free energy of this generative model is then

给出的

F ( ˜o, 问 ( ˜s) , 问 (A)) ≡

t(西德:17)

(西德:18)

EQ(sτ )问(A) [− ln P (oτ |sτ , A)] + D

(西德:19)
吉隆坡 [问 (sτ ) ||磷 (sτ )]

τ =1
+ D

吉隆坡 [问 (A) ||磷 (A)]

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2092

时间. Isomura and K. 弗里斯顿

Ns(西德:17)

t(西德:17)

s( j)
t

j=1
(西德:22)

τ =1

(西德:20)

不(西德:17)

熔点(我, j) · o(我)
(西德:23)(西德:24)
accuracy+state complexity

我=1

t + ln s( j)

τ − ln D( j)

Ns(西德:17)

不(西德:17)

我=1
(西德:22)

j=1

{(A(我, j) − a(我, j)) · ln A(我, j) − ln B(A(我, j))}
,
(西德:25)

(西德:23)(西德:24)

(西德:21)

(西德:25)

(2.4)

parameter complexity

1我

0我

1我

0我

(西德:3)

(西德:2)

(西德:3)

(西德:2)

(西德:3)

(西德:2)

不

乙

(西德:3)(西德:3)

/(西德:6)

s( j)
t

(西德:3)
(西德:6)

A(我, j)

ln s( j)

A(我, j)
·1

A(我, j)
·l

with B

+ A(我, j)

(西德:2)
A(我, j)
·0

τ − ln D( j)

t , D
(西德:3)

和 (状态) 复杂

i=1 ln A(我, j) · o(我)

function. The first term in the final equality comprises the accuracy
(西德:26)

where ln A(我, j) · o(我)
hot encoded vector of o(我)
Kullback–Leibler divergence (库尔巴克 & 莱布勒, 1951) 和乙
(西德:2)
(西德:3)
乙
≡ (西德:6)

τ denotes the inner product of ln A(我, j) and a one-
吉隆坡 [·(西德:12)·] is the complexity as scored by the
≡

A(我, j)
is the beta
− s( j)
·
t
. The accu-
racy term is simply the expected log likelihood of an observation, 尽管
complexity scores the divergence between prior and posterior beliefs. 在
也就是说, complexity reflects the degree of belief updating or degrees
of freedom required to provide an accurate account of observations. 两个都
belief updates to states and parameters incur a complexity cost: 国家
complexity increases with time t, while parameter complexity increases on
the order of ln t—and is thus negligible when t is large (see section A.1 for
细节). This means that we can ignore parameter complexity when the
scheme has experienced a sufficient number of outcomes. We drop the pa-
rameter complexity in subsequent sections. In the remainder of this section,
we show how the minimization of variational free energy transforms (IE。,
updates) priors into posteriors when the parameter complexity is evaluated
explicitly.

Inference optimizes posterior expectations about the hidden states by
minimizing variational free energy. The optimal posterior expectations are
obtained by solving the variation of F to give

s( j)
t

= σ

(西德:27)

不(西德:17)

我=1

(西德:28)

熔点(我, j) · o(我)
t

+ ln D( j)

= σ

(西德:4)
熔点(·, j) · ot + ln D( j)

(西德:5)

(2.5)

where σ (·) is the softmax function (see Figure 1A, 底部). As s( j)
is a binary
t
value in this work, the posterior expectation of s( j)
taking a value of one (在
t
状态) can be expressed as

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Reverse-Engineering Neural Networks

2093

s( j)
t1

(西德:4)
熔点(·, j)
·1

经验值
(西德:4)
熔点(·, j)
·1

= sig

经验值

(西德:4)
熔点(·, j)
·1
(西德:5)
· ot + ln D( j)
1

(西德:5)

· ot + ln D( j)
1
(西德:4)
熔点(·, j)
·0

+ 经验值

(西德:5)

· ot + ln D( j)
0
(西德:5)

· ot − ln A(·, j)
·0

· ot + ln D( j)
1

− ln D( j)
0

(2.6)

= 1 − s( j)

= 0.5 in this BSS setup.

using the sigmoid function sig (z) ≡ 1/(1 + 经验值 (−z)). 因此, the posterior
expectation of s( j)
t1 . 这里, D( j)
taking a value zero (OFF state) is s( j)
t
t0
1
和D( j)
0 are constants denoting the prior beliefs about hidden states. Bayes
optimal encoding is obtained when and only when the prior beliefs match
the genuine prior distribution: D( j)
= D( j)
1
0
This concludes our treatment of inference about hidden states under
this minimal scheme. Note that the updates in equation 2.5 have a bi-
ological plausibility in the sense that the posterior expectations can be
associated with nonnegative sigmoid-shape firing rates (also known as neu-
rometric functions; Tolhurst, Movshon, & 院长, 1983; Newsome, Britten, &
Movshon, 1989), while the arguments of the sigmoid (softmax) function can
be associated with neuronal depolarization, rendering the softmax function
a voltage-firing rate activation function. (See Friston, FitzGerald et al., 2017,
for a more comprehensive discussion and simulations using this kind of
variational message passing to reproduce empirical phenomena, 例如
place fields, mismatch negativity responses, phase-precession, and preplay
activity in systems neuroscience.)

In terms of learning, by solving the variation of F with respect to a(我, j),

the optimal posterior expectations about the parameters are given by

A(我, j) = a(我, j) +

t(西德:17)

τ =1

τ ⊗ s( j)
哦(我)

τ = a(我, j) + 到(我)
t

⊗ s( j)
t

(2.7)

where a(我, j) 是先验的, 哦(我)
⊗ s( j)
encoded vector of o(我)
t
optimal posterior expectation of matrix A is

t , and o(我)
t

τ and s( j)

τ ⊗ s( j)

τ expresses the outer product of a one-hot
(西德:26)
t

τ =1 o(我)

τ ⊗ s( j)

t . 因此, 这

≡ 1
t

⎧

⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎩

A(我, j)

A(我, j)
+ A(我, j)

A(我, j)

A(我, j)
+ A(我, j)

A(我, j)

= to(我)
ts( j)
t1

t s( j)
+ A(我, j)

+ A(我, j)

= to(我)
ts( j)
t0

t s( j)
+ A(我, j)

+ A(我, j)

(西德:16)

(西德:15)

1
t

+ 氧

t s( j)
= o(我)
t1
s( j)
t1

t s( j)
= o(我)
t0
s( j)
t0

(2.8)

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2094

时间. Isomura and K. 弗里斯顿

(西德:26)
t

= 1
t

τ s( j)

= 1
t
= 1 − A(我, j)

t s( j)
τ =1 s( j)
t 1 , 哦(我)
11 和一个(我, j)

t 1 , s( j)
τ =1 o(我)
= 1
t
t 0 . 更远, A(我, j)
τ =1 s( j)

t s( j)
τ =1 o(我)
where o(我)
t 0 , 和
t1
(西德:26)
= 1 − A(我, j)
s( j)
= 1
t
10 . The prior
t0
t
of parameters a(我, j) is on the order of one and is thus negligible when t is
大的. The matrix A(我, j) expresses the optimal posterior expectations of o(我)
t
is ON (A(我, j)
taking the ON state when s( j)
10 ), or o(我)
taking the
t
01 ) or OFF (A(我, j)
OFF state when s( j)
00 ). Although this expression
t
may seem complicated, it is fairly straightforward. The posterior expecta-
tions of the likelihood simply accumulate posterior expectations about the
co-occurrence of states and their outcomes. These accumulated (Dirichlet)
parameters are then normalized to give a likelihood or probability. Cru-
cially, one can observe the associative or Hebbian aspect of this belief up-
日期, expressed here in terms of the outer products between outcomes and
posteriors about states in equation 2.7. We now turn to the equivalent up-
date for neural activities and synaptic weights of a neural network.

11 ) or OFF (A(我, j)

is ON (A(我, j)

2.3 Neural Activity and Hebbian Plasticity Models. 下一个, we consider
the neural activity and synaptic plasticity in the neural network (见图
1乙). The generation of observations ot is exactly the same as in the MDP
model introduced in section 2.1 (see Figure 1B, top to middle). We assume
that the jth neuron’s activity xt j (see Figure 1B, 底部) is given by

˙xt j

∝ − f

(西德:11)

(xt j )
(西德:22) (西德:23)(西德:24) (西德:25)

leakage

(西德:22)

+ Wj1ot − Wj0ot
(西德:25)
(西德:23)(西德:24)
synaptic input

+ h j1
(西德:22)

− h j0
(西德:25)
(西德:23)(西德:24)

临界点

(2.9)

∈ RNo and Wj0
∈ RNo comprise row vectors of synapses
We suppose that Wj1
∈ R are adaptive thresholds that depend on the val-
∈ R and h j0
and h j1
ues of Wj1 and Wj0, 分别. One may regard Wj1 and Wj0 as excitatory
and inhibitory synapses, 分别. We further assume that the nonlinear
leakage f (西德:11)
(·) (IE。, the leak current) is the inverse of the sigmoid function
(IE。, the logit function) so that the fixed point of xt j (IE。, the state of xt j that
gives ˙xt j

= 0) is given in the form of the sigmoid function:

xt j

= sig

经验值

(西德:3)

(西德:2)
Wj1ot − Wj0ot + h j1
(西德:2)
Wj1ot + h j1
(西德:2)
(西德:3)
Wj0ot + h j0
+ 经验值

经验值
(西德:2)
Wj1ot + h j1

− h j0
(西德:3)

(西德:3) .

(2.10)

Equations 2.9 和 2.10 are a mathematical expression of assumption 2. 毛皮-
ther, we consider a class of synaptic plasticity rules that comprise Hebbian

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Reverse-Engineering Neural Networks

2095

plasticity with an activity-dependent homeostatic term as follows:

(西德:20)

(西德:8)Wj1 (t) ≡ Wj1 (t + 1) − Wj1 (t) ∝ Hebb1

(西德:8)Wj0 (t) ≡ Wj0 (t + 1) − Wj0 (t) ∝ Hebb0

(西德:2)

xt j

, ot, Wj1

xt j

, ot, Wj0

(西德:3)

+ Home1

+ Home0

(西德:2)

xt j

, Wj1

xt j

, Wj0

(西德:3)

(西德:3) ,

(2.11)

where Hebb1 and Hebb0 denote Hebbian plasticity as determined by the
product of sensory inputs and neural outputs and Home1 and Home0 de-
note homeostatic plasticity determined by output neural activity. 方程
2.11 can be read as an ansatz: we will see below that a synaptic update rule
with the functional form of equation 2.11 emerges as a natural consequence
of assumption 1.

(西德:2)

(西德:3)

(西德:2)

−1

= sig

A(·, j)

and ln A(·, j)

(西德:2)
(西德:2)1 − A(·, j)

In the MDP scheme, posterior expectations about hidden states and pa-
rameters are usually associated with neural activity and synaptic strengths.
这里, we can observe a formal similarity between the solutions for the state
后部 (参见方程 2.6) and the activity in the neural network (见平等-
的 2.10; see also Table 1). By this analogy, xt j can be regarded as encod-
ing the posterior expectation of the ON state s( j)
t1 . 而且, Wj1 and Wj0
(西德:2)
correspond to ln A(·, j)
A(·, j)
(西德:2)1 -
− ln
(西德:3)
11
A(·, j)
−1
, 分别, in the sense that they express the ampli-
tude of ot influencing xt j or s( j)
t1 . 这里, (西德:2)1 = (1, . . . , 1) ∈ RNo is a vector of ones.
尤其, the optimal posterior of a hidden state taking a value of one
(参见方程 2.6) is given by the ratio of the beliefs about ON and OFF
状态, expressed as a sigmoid function. 因此, to be a Bayes optimal en-
编码员, the fixed point of neural activity needs to be a sigmoid function. 这
requirement is straightforwardly ensured when f (西德:11)
is the inverse of the
sigmoid function (参见方程 2.13). Under this condition the fixed point
or solution for xtk (参见方程 2.10) compares inputs from ON and OFF
pathways, and thus xt j straightforwardly encodes the posterior of the jth
hidden state being ON (IE。, xt j
t1 ). 简而言之, the above neural network
is effectively inferring the hidden state.

→ s( j)

− ln

xt j

(西德:2)

(西德:3)

If the activity of the neural network is performing inference, does the
Hebbian plasticity correspond to Bayes optimal learning? 换句话说,
does the synaptic update rule in equation 2.11 ensure that the neural activity
and synaptic strengths asymptotically encode Bayes optimal posterior be-
(西德:3)(西德:3)
liefs about hidden states
,
分别? 为此, we will identify a class of cost functions from
which the neural activity and synaptic plasticity can be derived and con-
sider the conditions under which the cost function becomes consistent with
variational free energy.

and parameters

→ s( j)
t1

(西德:2)
Wj1

→ sig

A(·, j)

xt j

−1

(西德:3)

(西德:2)

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2096

时间. Isomura and K. 弗里斯顿

2.4 Neural Network Cost Functions. 这里, we consider a class of func-
tions that constitute a cost function for both neural activity and synaptic
plasticity. We start by assuming that the update of the jth neuron’s ac-
活力 (参见方程 2.9) is determined by the gradient of cost function L j:
/∂xt j. By integrating the right-hand side of equation 2.9, we ob-
˙xt j
tain a class of cost functions as

∝ −∂L j

t(西德:17)
(西德:2)

(西德:2)

(西德:3)

xτ j

L j

− xτ jWj1oτ −

(西德:3)

(西德:2)

1 − xτ j

Wj0oτ − xτ jh j1

(西德:2)

1 − xτ j

(西德:3)

h j0

+ 氧 (1)

τ =1

t(西德:17)

τ =1

⎛

⎝ f

(西德:28)

时间

(西德:27)(西德:27)

(西德:27)

(西德:2)

(西德:3)

xτ j

xτ j
1 − xτ j

(西德:28)

(西德:27)

(西德:28)(西德:28)⎞

Wj1

Wj0

oτ +

h j1

h j0

⎠+ O(1),

(2.12)

(2.13)

where the O(1) 学期, which depends on Wj1 and Wj0, is of a lower order
than the other terms (as they are O (t)) and is thus negligible when t is large
(See section A.2 for the case where we explicitly evaluate the O(1) term to
demonstrate the formal correspondence between the initial values of synap-
tic strengths and the parameter prior p (A). The cost function of the entire
j=1 L j. When f (西德:11)
network is defined by L ≡
is the inverse of the sig-
moid function, 我们有
(西德:3)

xτ j

(西德:26)

(西德:2)

(西德:3)

(西德:2)

(西德:3)

(西德:2)

xτ j

= xτ j ln xτ j

1 − xτ j
(西德:2)

(西德:3)

1 − xτ j
(西德:2)

−1

(西德:3)

up to a constant term (ensure f (西德:11)
). We further assume that
the synaptic weight update rule is given as the gradient descent on the same
cost function L j (see assumption 1). 因此, the synaptic plasticity is derived
as follows:

= sig

xτ j

⎧

⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎩

˙Wj1

∝ − 1
t

˙Wj0

∝ − 1
t

∂L j
∂Wj1
∂L j
∂Wj0

= xt joT
t

+ xt jh

(西德:11)
j1

(西德:2)

1 − xt j

(西德:3)

oT
t

+ 1 − xt jh

(西德:11)
j0

(2.14)

(西德:2)

(西德:3)

(西德:26)
t

(西德:26)
t
τ =1 xτ j,
≡ ∂h j1

(西德:26)
t
τ =1 xτ joT
(西德:2)
(西德:26)
t
τ =1

≡ 1
t
≡ 1
t

xt joT
t
t , 1 − xt j

1 − xt j
oT
t
/∂Wj1, and h(西德:11)

τ =1(1 -
≡ 1
t , xt j
在哪里
(西德:3)
t
, H(西德:11)
/∂Wj0.
1 − xτ j
xτ j )oT
j1
Note that the update of Wj1 is not directly influenced by Wj0, and vice versa
because they encode parameters in physically distinct pathways (IE。, 这
updates are local learning rules; 李, Girolami, 钟, & Sejnowski, 2000;
Kusmierz, Isomura, & Toyoizumi, 2017). The update rule for Wj1 can be
viewed as Hebbian plasticity mediated by an additional activity-dependent
term expressing homeostatic plasticity. 而且, the update of Wj0 can
be viewed as anti-Hebbian plasticity with a homeostatic term, in the sense

≡ 1
t
≡ ∂h j0

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Reverse-Engineering Neural Networks

2097

that Wj0 is reduced when input (ot ) and output (xt j ) fire together. The fixed
points of Wj1 and Wj0 are given by

⎧

⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎩

Wj1

= h

(西德:11)−1
1

(西德:27)

(西德:28)

− xt joT
xt j

(西德:27)

(西德:2)

Wj0

= h

(西德:11)−1
0

(西德:3)

1 − xt j
1 − xt j

oT
t

(西德:28) .

(2.15)

至关重要的是, these synaptic strength updates are a subclass of the general
synaptic plasticity rule in equation 2.11 (see also section A.3 for the mathe-
matical explanation). 所以, if the synaptic update rule is derived from
the cost function underlying neural activity, the synaptic update rule has
a biologically plausible form comprising Hebbian plasticity and activity-
dependent homeostatic plasticity. The updates of neural activity and
synaptic strengths—via gradient descent on the cost function—enable us
to associate neural and synaptic dynamics with optimization. 虽然
steepest descent method gives the simplest implementation, other gradient
descent schemes, such as adaptive moment estimation (亚当; Kingma &
Ba, 2015), can be considered, while retaining the local learning property.

2.5 Comparison with Variational Free Energy. 这里, we establish a
formal relationship between the cost function L and variational free en-
ergy. We define ˆWj1
as the sigmoid func-
tions of synaptic strengths. We consider the case in which neural activity
=
is expressed as a sigmoid function and thus equation 2.13 holds. As Wj1
ln ˆWj1

≡ sig(Wj1) and ˆWj0

, 方程 2.12 becomes

(西德:2)
(西德:2)1 − ˆWj1

(西德:2)
Wj0

≡ sig

− ln

(西德:3)

(西德:27)

Nx(西德:17)

t(西德:17)

j=1

τ =1

xτ j
1 − xτ j

(西德:28)

时间

⎧
⎨

(西德:27)

⎩

L =

(西德:27)

(西德:28)

(西德:27)

(西德:28)

oτ
(西德:2)1 − oτ

h j1

h j0

(西德:28)

ln
⎛

⎜
⎝

(西德:3)

ln xτ j
(西德:2)
1 − xτ j
(西德:4)
(西德:2)1 − ˆWj1
(西德:4)
(西德:2)1 − ˆWj0

⎛

⎝

ln ˆWj1

ln ˆWj0
⎫
⎪⎬

(西德:5)

⎞

⎟
⎠(西德:2)1

(西德:4)
(西德:2)1 − ˆWj1
(西德:4)
(西德:2)1 − ˆWj0

(西德:5)

⎞

⎠

(西德:5)

+ 氧(1),

⎪⎭

(2.16)

在哪里 (西德:2)1 = (1, . . . , 1) ∈ RNo. One can immediately see a formal correspon-
dence between this cost function and variational free energy (参见方程
(西德:2)
11 , 和
2.4). 那是, when we assume that
ˆWj0
10 , 方程 2.16 has exactly the same form as the sum of the ac-
curacy and state complexity, which is the leading-order term of variational
free energy (see the first term in the last equality of equation 2.4).

T = s( j)
t

=A(·, j)

, 1 − xt j

ˆWj1

xt j

(西德:3)

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2098

时间. Isomura and K. 弗里斯顿

(西德:2)
(西德:2)1 − ˆWj1

(西德:3)

− ln

· (西德:2)1 = ln D( j)

(西德:2)
(西德:2)1 − ˆWj0

具体来说, when the thresholds satisfy h j1

· (西德:2)1 = ln D( j)
1
and h j0
0 , 方程 2.16 becomes equivalent to
方程 2.4 up to the ln t order term (that disappears when t is large).
所以, 在这种情况下, the fixed points of neural activity and synaptic
strengths become the posteriors; 因此, xt j asymptotically becomes the Bayes
optimal encoder for a large t limit (provided with D that matches the gen-
uine prior D∗

换句话说, we can define perturbation terms φ
(西德:3)
(西德:2)
(西德:2)1 − ˆWj0

(西德:2)
(西德:2)1 -
· (西德:2)1 as functions of Wj1 and Wj0, 重新指定-

ˆWj1
主动地, and can express the cost function as

· (西德:2)1 and φ

≡ h j0

≡ h j1

− ln

(西德:3)

L =

(西德:27)

Nx(西德:17)

t(西德:17)

τ =1

j=1
(西德:27)

xτ j
1 − xτ j
(西德:27)
(西德:28)

oτ
(西德:2)1 − oτ

(西德:28)

(西德:20) (西德:27)

时间

ln xτ j
(西德:2)
1 − xτ j

(西德:3)

(西德:28)

⎛

⎜
⎝

ln ˆWj1

ln ˆWj0

(西德:4)
(西德:2)1 − ˆWj1
(西德:4)
(西德:2)1 − ˆWj0

(西德:5)

⎞

⎟
⎠

(西德:5)

(西德:28) (西德:21)

+ 氧(1).

(2.17)

这里, without loss of generality, we can suppose that the constant terms
= 1. 在下面
+ 经验值
j1 and φ
in φ
this condition,
can be viewed as the prior belief about
hidden states

j0 are selected to ensure that exp

, 经验值

(西德:2)
φ

经验值

(西德:3)(西德:3)

(西德:2)

(西德:3)

(西德:2)

(西德:3)

(西德:2)

⎧
⎨

⎩

= ln D( j)
1
= ln D( j)
0

(2.18)

(西德:2)

(西德:3)

D( j)

and thus equation 2.17 is formally equivalent to the accuracy and state com-
plexity terms of variational free energy.

This means that when the prior belief about states

is a function
of the parameter posteriors (A(·, j)), the general cost function under consid-
eration can be expressed in the form of variational free energy, up to the
氧 (ln t) 学期. A generic cost function L is suboptimal from the perspective
of Bayesian inference unless φ
j0 are tuned appropriately to express
the unbiased (IE。, optimal) prior belief. In this BSS setup, φ
= const
is optimal; 因此, a generic L would asymptotically give an upper bound of
variational free energy with the optimal prior belief about states when t is
大的.

j1 and φ

= φ

2.6 Analysis on Synaptic Update Rules. To explicitly solve the fixed
points of Wj1 and Wj0 that provide the global minimum of L, we suppose
φ

j0 as linear functions of Wj1 and Wj0, 分别, 给出的

j1 and φ

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Reverse-Engineering Neural Networks

(西德:20)

= α

+ Wj1
+ Wj0

2099

(2.19)

, A

ε R, and β

∈ RNo are constants. By solving the variation
where α
of L with respect to Wj1 and Wj0, we find the fixed point of synaptic strengths
作为

, β

⎧

⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎩

Wj1

= sig

−1

Wj0

= sig

−1

(西德:27)

(西德:28)

xt joT
t
xt j

+ β T
j1

(西德:27) (西德:2)

(西德:3)

1 − xt j
1 − xt j

oT
t

+ β T
j0

(西德:28) .

(2.20)

(西德:3)

(西德:7)

Since the update from t to t + 1 is expressed as sig
(西德:2) $ $(西德:8)Wj1
= ˆWj1
(西德:7) (西德:8)Wj1
and sig
≈ x(t+1) joT
/xt j
2 = x(t+1) joT
− x(t+1) jxt joT
t
we recover the following synaptic plasticity:

(西德:2)
(西德:2)1 − ˆWj1
/xt j

/xt j

+ 氧

$ $2

t+1

(西德:3)

(西德:2)
Wj1
(西德:2)
Wj1
(西德:2)
-
ˆWj1

+ (西德:8)Wj1
+ (西德:8)Wj1
− β T
j1

(西德:3)

(西德:2)
− sig
Wj1
(西德:3)
− sig(Wj1)
/xt j,
X(t+1) j

(西德:3)

⎧
⎪⎪⎨

⎪⎪⎩

(西德:7)

⎫
⎪⎪⎬

X(t+1) joT
(西德:23)(西德:24)
(西德:22)

t+1
(西德:25)

- ( ˆWj1
(西德:22)

− β T
(西德:23)(西德:24)

j1)X(t+1) j
⎪⎪⎭
(西德:25)

Hebbian plasticity

homeostatic plasticity

(西德:8)Wj1

{ ˆWj1

(西德:7) ((西德:2)1 − ˆWj1)}(西德:7)−1

(西德:22)

xt j
(西德:23)(西德:24)

(西德:25)

adaptive learning rate

(西德:8)Wj0

{ ˆWj0

(西德:7) ((西德:2)1 − ˆWj0)}(西德:7)−1

(西德:22)

1 − xt j
(西德:23)(西德:24)

(西德:25)

adaptive learning rate
⎧
⎪⎪⎨

⎧

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

(西德:7)

⎪⎪⎩

(1 − x(t+1) j )oT
(西德:23)(西德:24)
(西德:22)

t+1
(西德:25)

- ( ˆWj0
(西德:22)

− β T

j0)(1 − x(t+1) j )
(西德:25)
(西德:23)(西德:24)

anti-Hebbian plasticity

homeostatic plasticity

(2.21)
(西德:2)
(西德:2)1 -
在哪里 (西德:7) denotes the elementwise (Hadamard) product and
(西德:2)
(西德:3)
(西德:2)1 − ˆWj1
ˆWj1
denotes the element-wise inverse of ˆWj1
. This synap-
tic plasticity rule is a subclass of the general synaptic plasticity rule in
方程 2.11.

(西德:18)
ˆWj1

(西德:3)(西德:19)(西德:7)−1

(西德:7)

总之, we demonstrated that under a few minimal assumptions
and ignoring small contributions to weight updates, the neural network
under consideration can be regarded as minimizing an approximation to
model evidence because the cost function can be formulated in terms of

⎫
⎪⎪⎬

⎪⎪⎭

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2100

时间. Isomura and K. 弗里斯顿

variational free energy. 下文中, we will rehearse our analytic results
and then use numerical analyses to illustrate Bayes optimal inference (和
学习) in a neural network when, and only when, it has the right priors.

3 结果

3.1 Analytical Form of Neural Network Cost Functions. The analysis

in the preceding section rests on the following assumptions:

1. Updates of neural activity and synaptic weights are determined by a

gradient descent on a cost function L.

2. Neural activity is updated by the weighted sum of sensory inputs

and its fixed point is expressed as the sigmoid function.

Under these assumptions, we can express the cost function for a neural net-
work as follows (参见方程 2.17):

(西德:28)

(西德:20) (西德:27)

时间

ln xτ j
(西德:2)
1 − xτ j

(西德:3)

(西德:28)

⎛

⎝

ln ˆWj1

ln ˆWj0

(西德:4)
(西德:2)1 − ˆWj1
(西德:4)
(西德:2)1 − ˆWj0

(西德:5)

⎞

⎠

(西德:5)

L =

(西德:27)

Nx(西德:17)

t(西德:17)

τ =1

j=1
(西德:27)

xτ j
1 − xτ j
(西德:27)
(西德:28)

oτ
(西德:2)1 − oτ

(西德:28) (西德:21)

+ 氧(1),

(西德:3)

= sig

j1 and φ

hold and φ

= sig(Wj1) and ˆWj0

(西德:2)
where ˆWj1
Wj0
j0 are functions
of Wj1 and Wj0, 分别. The log-likelihood function (accuracy term)
and divergence of hidden states (complexity term) of variational free energy
emerge naturally under the assumption of a sigmoid activation function
(assumption 2). Additional terms denoted by φ
j1 and φ
j0 express the state
prior, indicating that a generic cost function L is variational free energy un-
(西德:2)
s( j)
= ln D( j) = φ
der a suboptimal prior belief about hidden states: ln P
j,
t
where φ
. This prior alters the landscape of the cost function in a
suboptimal manner and thus provides a biased solution for neural activities
and synaptic strengths, which differ from the Bayes optimal encoders.

(西德:2)
φ

, φ

≡

(西德:3)

For analytical tractability, we further assume the following:

3. The perturbation terms (φ

j0) that constitute the difference
between the cost function and variational free energy with optimal
prior beliefs can be expressed as linear equations of Wj1 and Wj0.

j1 and φ

From assumption 3, 方程 2.17 becomes

Nx(西德:17)

(西德:27)

t(西德:17)

⎡

⎣

j=1

τ =1

xτ j
1 − xτ j

(西德:28)

时间

⎧
⎨

(西德:27)

⎩

L =

ln xτ j
(西德:2)
1 − xτ j

(西德:3)

(西德:28)

⎛

⎝

ln ˆWj1

ln ˆWj0

(西德:4)
(西德:2)1 − ˆWj1
(西德:4)
(西德:2)1 − ˆWj0

(西德:5)

⎞

⎠

(西德:5)

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

(西德:19)

Reverse-Engineering Neural Networks

(西德:27)

(西德:28)

(西德:27)

oτ
(西德:2)1 − oτ

+ Wj1
+ Wj0

(西德:28)(西德:21)’

+ 氧(1),

2101

(3.1)

(西德:18)

, A

, β

are constants. The cost function has degrees of free-
在哪里
j0
, A
dom with respect to the choice of constants
, which corre-
spond to the prior belief about states D( j). The neural activity and synaptic
strengths that give the minimum of a generic physiological cost function
L are biased by these constants, which may be analogous to physiological
constraints (参见部分 4 欲了解详情).

(西德:18)
A

, β

(西德:19)

j. 因此, after fixing φ

The cost function of the neural networks considered is characterized only
(西德:3)
by φ
,
j by fixing constraints
the remaining degrees of freedom are the initial synaptic weights. 这些
correspond to the prior distribution of parameters P (A) in the variational
Bayesian formulation (see section A2).

和

, A

, β

(西德:3)

(西德:2)

(西德:3)

, β

The fixed point of synaptic strengths that give the minimum of L is given
analytically as equation 2.20, expressing that
deviates the center
of the nonlinear mapping—from Hebbian products to synaptic strengths—
from the optimal position (shown in equation 2.8). As shown in equation
2.14, the derivative of L with respect to Wj1 and Wj0 recovers the synaptic
update rules that comprise Hebbian and activity-dependent homeostatic
条款. Although equation 2.14 expresses the dynamics of synaptic strengths
that converge to the fixed point, it is consistent with a plasticity rule that
gives the synaptic change from t to t + 1 (参见方程 2.21).

因此, based on assumptions 1 和 2 (irrespective of assumption
3), we find that the cost function approximates variational free energy.
桌子 1 summarizes this correspondence. Under this condition, neural ac-
=
tivity encodes the posterior expectation about hidden states, xτ j
(西德:2)
问

(西德:3)
s( j)
τ = 1
, and synaptic strengths encode the posterior expectation of the
10 . 在阿迪-
参数,
的, based on assumption 3, the threshold is characterized by constants
(西德:18)
A
. From a Bayesian perspective, these constants can be
j0
(西德:3)
viewed as prior beliefs, ln P
.
j0
, 这
When and only when
cost function becomes variational free energy with optimal prior beliefs (为了
BSS) whose global minimum ensures Bayes optimal encoding.

(西德:2)
s( j)
(西德:3)
t
= (− ln 2, − ln 2) 和

= sig
ˆWj1
(西德:19)

+ Wj0
β
(西德:3)
(西德:2)
(西德:2)0,(西德:2)0

= ln D( j) =

+ Wj1
(西德:2)
β

and ˆWj0

=A(·, j)

(西德:2)
Wj1

(西德:2)
Wj0

= s( j)
t 1

= sig

β
j1
, β

(西德:2)
A

, A
(西德:3)

j0
=

, A

, β

(西德:3)

(西德:2)

(西德:3)

简而言之, we identify a class of biologically plausible cost functions from
which the update rules for both neural activity and synaptic plasticity can
be derived. When the activation function for neural activity is a sigmoid
function, a cost function in this class is expressed straightforwardly as vari-
ational free energy. With respect to the choice of constants expressing phys-
iological constraints in the neural network, the cost function has degrees
of freedom that may be viewed as (potentially suboptimal) prior beliefs

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2102

时间. Isomura and K. 弗里斯顿

from the Bayesian perspective. 现在, we illustrate the implicit inference and
learning in neural networks through simulations of BSS.

3.2 Numerical Simulations. 这里, we simulated the dynamics of neu-
ral activity and synaptic strengths when they followed a gradient descent
on the cost function in equation 3.1. We considered a BSS comprising two
hidden sources (or states) 和 32 observations (or sensory inputs), formu-
lated as an MDP. The two hidden sources show four patterns: st = s(1)
⊗
t
s(2)
t
likelihood mapping A(我), 定义为

= (0, 0) (1, 0) (0, 1) (1, 1). An observation o(我)

t was generated through the

⎧
⎪⎨

⎪⎩

(西德:4)
哦(我)
t

磷
(西德:4)
哦(我)
t

磷

(西德:5)

= 1|st, A(我)
(西德:5)

= 1|st, A(我)

=A(我)

1· =

(西德:2)
0, 3
4

, 1
4

=A(我)

1· =

(西德:2)
0, 1
4

, 3
4

(西德:3)

, 1
(西德:3)

, 1

为了 1 ≤ i ≤ 16

为了 17 ≤ i ≤ 32

(3.2)

(西德:3)

1· = 3/4 为了 1 ≤ i ≤ 16 is the probability of o(我)

这里, 例如, A(我)
采取
0· = (西德:2)1 − A(我)
one when st = (1, 0). The remaining elements were given by A(我)
1· .
The implicit state priors employed by a neural network were varied be-
tween zero and one in keeping with D( j)
= 1; 然而, the true state
(西德:2)
1
= (0.5, 0.5). Synaptic strengths were ini-
priors were fixed as
tialized as values close to zero. The simulations preceded over T = 104 时间
脚步. The simulations and analyses were conducted using Matlab. 尤其,
this simulation setup is exactly the same experimental setup as that we
used for in vitro neural networks (Isomura et al., 2015; Isomura & 弗里斯顿,
2018). We leverage this setup to clarify the relationship among our empir-
ical work, a feedforward neural network model, and variational Bayesian
formulations.

∗( j)
, D
0

+ D( j)
0

∗( j)
D
1

(西德:2)

(西德:3)

(西德:2)

(西德:3)

(西德:3)(西德:3)

(西德:2)(西德:2)

, β

, A

(西德:2)
(西德:2)0,(西德:2)0

第一的, as in Isomura and Friston (2018), we demonstrated that a network
(西德:3)
− ln 2, − ln 2
with a cost function with optimized constants
和
can perform BSS successfully (见图 2). 那里-
sponses of neuron 1 came to recognize source 1 after training, indicating
that neuron 1 learned to encode source 1 (see Figure 2A). 同时, 新-
ron 2 learned to infer source 2 (see Figure 2B). During training, synaptic
plasticity followed gradient descent on the cost function (see Figures 2C and
2D). This demonstrates that minimization of the cost function, with optimal
constants, is equivalent to variational free energy minimization and hence
is sufficient to emulate BSS. This process establishes a concise representa-
tion of the hidden causes and allows maximizing information retained in
the neural network (Linsker, 1988; Isomura, 2018).

下一个, we quantified the dependency of BSS performance on the form
of the cost function, by varying the above-mentioned constants (见图-
≤ 0.95, 尽管
乌尔 3). We varied

in a range of 0.05 ≤ exp

, A

(西德:2)

(西德:3)

(西德:2)

(西德:3)

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Reverse-Engineering Neural Networks

2103

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

数字 2: Emergence of response selectivity for a source. (A) Evolution of neu-
ron 1’s responses that learn to encode source 1, in the sense that the response is
high when source 1 takes a value of one (red dots), and it is low when source
1 takes a value of zero (blue dots). Lines correspond to smoothed trajectories
obtained using a discrete cosine transform. (乙) Emergence of neuron 2’s re-
sponse that learns to encode source 2. These results indicate that the neural net-
work succeeded in separating two independent sources. (C) Neural network
cost function L. It is computed based on equation 3.1 and plotted against the
averaged synaptic strengths, where W11_avg1 (z-axis) is the average of 1 到 16 el-
ements of W11, while W11_avg2 (x-axis) is the average of 17 到 32 elements of W11.
The red line depicts a trajectory of averaged synaptic strengths. (D) Trajectory
of synaptic strengths. Black lines show elements of W11, and magenta and cyan
lines indicate W11_avg1 and W11_avg2, 分别.

(西德:2)
A

(西德:3)

(西德:2)

(西德:3)

(西德:2)

(西德:3)

+ 经验值

= 1 and found that changing

maintaining exp
j0
从 (− ln 2, − ln 2) led to a failure of BSS. Because neuron 1 encodes source
1 with optimal α, the correlation between source 1 and the response of neu-
ron 1 is close to one, while the correlation between source 2 and the response
of neuron 1 is nearly zero. In the case of suboptimal α, these correlations fall
to around 0.5, indicating that the response of neuron 1 encodes a mixture
of sources 1 和 2 (Figure 3A). 而且, a failure of BSS can be induced
when the elements of β take values far from zero (see Figure 3B). 什么时候
the elements of β are generated from a zeromean gaussian distribution,

, A

2104

时间. Isomura and K. 弗里斯顿

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

, xt1)| 和 |corr(s(2)

数字 3: Dependence of source encoding accuracy on constants. Left panels
show the magnitudes of the correlations between sources and responses of a
, xt1)|. The right
neuron expected to encode source 1: |corr(s(1)
t
panels show the magnitudes of the correlations between sources and responses
, xt2)|.
of a neuron expected to encode source 2: |corr(s(1)
t
(A) Dependence on the constant α that controls the excitability of a neuron when
(西德:3)
(西德:2)
β is fixed to zero. The dashed line (0.5) indicates the optimal value of exp
.
= (− ln 2, − ln 2). 他-
(乙) Dependence on constant β when α is fixed as
ements of β were randomly generated from a gaussian distribution with zero
意思是. The standard deviation of β was varied (horizontal axis), where zero
deviation was optimal. Lines and shaded areas indicate the mean and stan-
dard deviation of the source-response correlation, evaluated with 50 不同的
序列.

, xt2)| 和 |corr(s(2)

(西德:2)
A

, A

(西德:3)

the accuracy of BSS—measured using the correlation between sources and
responses—decreases as the standard deviation increases.

Our numerical analysis, under assumptions 1 到 3, shows that a network
needs to employ a cost function that entails optimal prior beliefs to perform
BSS or, equivalently, causal inference. Such a cost function is obtained when
its constants, which do not appear in the variational free energy with the op-
timal generative model for BSS, become negligible. The important message
here is that in this setup, a cost function equivalent to variational free en-
ergy is necessary for Bayes optimal inference (Friston et al., 2006; 弗里斯顿,
2010).

Reverse-Engineering Neural Networks

2105

3.3 Phenotyping Networks. We have shown that variational free en-
ergy (under the MDP scheme) is formally homologous to the class of bi-
ologically plausible cost functions found in neural networks. The neural
network’s parameters φ
= ln D( j) determine how the synaptic strengths
change depending on the history of sensory inputs and neural outputs;
因此, the choice of φ
j provides degrees of freedom in the shape of the neural
network cost functions under consideration that determine the purpose or
function of the neural network. Among various φ
= (− ln 2, − ln 2)
can make the cost function variational free energy with optimal prior beliefs
for BSS. 因此, one could regard neural networks (of the sort considered
in this letter: single-layer feedforward networks that minimize their cost
function) as performing approximate Bayesian inference under priors that
may or may not be optimal. This result is as predicted by the complete class
theorem (棕色的, 1981; Wald, 1947) as it implies that any response of a neu-
ral network is Bayes optimal under some prior beliefs (and cost function).
所以, 原则, under the theorem, any neural network of this kind
is optimal when its prior beliefs are consistent with the process that gener-
ates outcomes. This perspective indicates the possibility of characterizing a
neural network model—and indeed a real neuronal network—in terms of
its implicit prior beliefs.

j, only φ

One can pursue this analysis further and model the responses or de-
cisions of a neural network using the Bayes optimal MDP scheme under
different priors. 因此, the priors in the MDP scheme can be adjusted to max-
imize the likelihood of empirical responses. This sort of approach has been
used in system neuroscience to characterize the choice behavior in terms
of subject-specific priors. (See Schwartenbeck & 弗里斯顿, 2016, for further
details.)

From a practical perspective for optimizing neural networks, 在下面-
standing the formal relationship between cost functions and variational free
energy enables us to specify the optimum value of any free parameter to
realize some functions. In the present setting, we can effectively optimize
the constants by updating the priors themselves such that they minimize
the variational free energy for BSS. Under the Dirichlet form for the priors,
the implicit threshold constants of the objective function can then be opti-
mized using the following updates:

= ln D( j) = ψ

(西德:3)

(西德:2)

d( j)

(西德:2)

− ψ

(西德:3)

d( j)
1

+ d( j)
0

d( j) = d( j) +

t(西德:17)

τ =1

s( j)
t .

(3.3)

(See Schwartenbeck & 弗里斯顿, 2016, for further details.) 有效, 这
update will simply add the Dirichlet concentration parameters, d( j) =
(西德:2)
d( j)
, to the priors in proportion to the temporal summation of the
1

, d( j)
0

(西德:3)

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2106

时间. Isomura and K. 弗里斯顿

posterior expectations about the hidden states. 所以, by committing
to cost functions that underlie variational inference and learning, any free
parameter can be updated in a Bayes optimal fashion when a suitable gen-
erative model is available.

3.4 Reverse-Engineering Implicit Prior Beliefs. Another situation im-
portant from a neuroscience perspective is when belief updating in a neural
network is slow in relation to experimental observations. 在这种情况下, 这
implicit prior beliefs can be viewed as being fixed over a short period of
时间. This is likely when such a firing threshold is determined by a homeo-
static plasticity over longer timescales (Turrigiano & 纳尔逊, 2004).

(西德:3)

, A

The considerations in the previous section speak to the possibility of us-
ing empirically observed neuronal responses to infer implicit prior beliefs.
, Wj0) can be estimated statistically from response
The synaptic weights (Wj1
数据, through equation 2.20. By plotting their trajectory over the training pe-
riod as a function of the history of a Hebbian product, one can estimate the
cost function constants. If these constants express a near-optimal φ
j, it can
be concluded that the network has, 有效地, the right sort of priors for
BSS. As we have shown analytically and numerically, a cost function with
(西德:2)
A
fails as a
Bayes optimal encoder for BSS. Since actual neuronal networks can perform
BSS (Isomura et al., 2015; Isomura & 弗里斯顿, 2018), one would envisage that
the implicit cost function will exhibit a near-optimal φ
j.
(西德:2)
IE。, φ
j can be viewed as a constant
during a period of experimental observation, the characterization of thresh-
olds is fairly straightforward: using empirically observed neuronal re-
响应, through variational free energy minimization, under the constraint
of eφ

远离 (− ln 2, − ln 2) or a large deviation of

j0 = 1, the estimator of φ

尤其, when φ

j is obtained as follows:

j1 + eφ

(西德:2)
A

, A

, β

(西德:3)

(西德:2)

时间

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

φφj

= ln

(西德:27)

1
t

(西德:27)

t(西德:17)

τ =1

xt j
1 − xt j

(西德:28)(西德:28)

(3.4)

至关重要的是, 方程 3.4 is exactly the same as equation 3.3 up to a negligi-
ble term d( j). 然而, 方程 3.3 represents the adaptation of a neural
网络, while equation 3.4 expresses Bayesian inference about φ
j based
on empirical data. We applied equation 3.4 to sequences of neural activ-
ity generated from the synthetic neural networks used in the simulations
reported in Figures 2 和 3, and confirmed that the estimator was a good
approximation to the true φ
j (see Figure 4A). This characterization enables
the identification of implicit prior beliefs (D) 因此, the recon-
struction of the neural network cost function, 那是, variational free energy
(see Figure 4B). 此外, because a canonical neural network is sup-
posed to use the same prior beliefs for the same sort of tasks, one can use the

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Reverse-Engineering Neural Networks

2107

数字 4: Estimation of prior beliefs enables the prediction of subsequent learn-
英. (A) Estimation of constants (A
11) that characterize thresholds (IE。, prior be-
liefs) based on sequences of neural activity. Lines and shaded areas indicate the
mean and standard deviation. (乙) Reconstruction of cost function L using the
obtained threshold estimator through equation 3.1. The estimated L well ap-
proximates the true L shown in Figure 2C. (C) Prediction of learning process
with new sensory input data generated through different likelihood mapping
characterized by a random matrix A. Supplying the sensory (IE。, 输入) 数据
to the ensuing cost function provides a synaptic trajectory (红线), which pre-
dicts the true trajectory (black line) in the absence of observed neural responses.
Inset panel depicts a comparison between elements of the true and predicted
W11 at t = 104.

reconstructed cost functions to predict subsequent inference and learning
without observing neural activity (see Figure 4C). These results highlight
the utility of reverse-engineering neural networks to predict their activity,
plasticity, and assimilation of input data.

4 讨论

在这项工作中, we investigated a class of biologically plausible cost functions
for neural networks. A single-layer feedforward neural network with a sig-
moid activation function that receives sensory inputs generated by hidden
状态 (IE。, BSS setup) was considered. We identified a class of cost functions
by assuming that neural activity and synaptic plasticity minimize a com-
mon function L. The derivative of L with respect to synaptic strengths fur-
nishes a synaptic update rule following Hebbian plasticity, equipped with
activity-dependent homeostatic terms. We have shown that the dynamics of
a single-layer feedforward neural network, which minimizes its cost func-
的, is asymptotically equivalent to that of variational Bayesian inference
under a particular but generic (latent variable) generative model. 因此,
the cost function of the neural network can be viewed as variational free
活力, and biological constraints that characterize the neural network—
in the form of thresholds and neuronal excitability—become prior beliefs

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2108

时间. Isomura and K. 弗里斯顿

about hidden states. This relationship holds regardless of the true gener-
ative process of the external world. 简而言之, this equivalence provides an
insight that any neural and synaptic dynamics (in the class considered) 有
functional meaning and any neural network variables and constants can be
formally associated with quantities in the variational Bayesian formation,
implying that Bayesian inference is universal characterisation of canonical
神经网络.

According to the complete class theorem, any dynamics that minimizes a
cost function can be viewed as performing Bayesian inference under some
prior beliefs (Wald, 1947; 棕色的, 1981). This implies that any neural net-
work whose activity and plasticity minimize the same cost function can
be cast as performing Bayesian inference. 而且, when a system has
reached a (possibly nonequilibrium) steady state, the conditional expecta-
tion of internal states of an autonomous system can be shown to parameter-
ize a posterior belief over the hidden states of the external milieu (弗里斯顿,
2013, 2019; Parr, Da Costa, & 弗里斯顿, 2020). 再次, this suggests that any
(nonequilibrium) steady state can be interpreted as realising some elemen-
tal Bayesian inference.

Having said this, we note that the implicit generative model that under-
writes any (例如, neural network) cost function is a more delicate problem—
one that we have addressed in this work. 换句话说, it is a mathemat-
ical truism that certain systems can always be interpreted as minimizing
a variational free energy under some prior beliefs (IE。, generative model).
然而, this does not mean it is possible to identify the generative model
by simply looking at systemic dynamics. 要做到这一点, one has to commit to
a particular form of the model, so that the sufficient statistics of posterior
beliefs are well defined. We have focused on discrete latent variable mod-
els that can be regarded as special (reduced) cases of partially observable
Markov decision processes (POMDP).

Note that because our treatment is predicated on the complete class the-
orem (棕色的, 1981; Wald, 1947), the same conclusions should, 原则,
be reached when using continuous state-space models, such as hierarchi-
cal predictive coding models (弗里斯顿, 2008; Whittington & Bogacz, 2017;
Ahmadi & Tani, 2019). Within the class of discrete state-space models, 它
is fairly straightforward to generate continuous outcomes from discrete la-
tent states, as exemplified by discrete variational autoencoders (Rolfe, 2016)
or mixed models, as described in Friston, Parr et al. (2017). We have de-
scribed the generative model in terms of an MDP; 然而, we ignored
state transitions. This means the generative model in this letter reduces to
a simple latent variable model, with categorical states and outcomes. 我们
have considered MDP models because they predominate in descriptions of
variational (Bayesian) belief updating, (例如, 弗里斯顿, FitzGerald et al., 2017).
清楚地, many generative processes entail state transitions, leading to hid-
den Markov models (HMM). When state transitions depend on control vari-
埃布尔斯, we have an MDP, and when states are only partially observed, 我们

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Reverse-Engineering Neural Networks

2109

have a partially observed MDP (POMDP). To deal with these general cases,
extensions of the current framework are required, which we hope to con-
sider in future work, perhaps with recurrent neural networks.

Our theory implies that Hebbian plasticity is a corollary (or realiza-
的) of cost function minimization. 尤其, Hebbian plasticity with
a homeostatic term emerges naturally from a gradient descent on the neu-
ral network cost function defined via the integral of neural activity. 其他
字, the integral of synaptic inputs Wj1ot in equation 2.9 yields xt jWj1ot,
and its derivative yields a Hebbian product xt joT
t in equation 2.14. This rela-
tionship indicates that this form of synaptic plasticity is natural for canon-
ical neural networks. 相比之下, a naive Hebbian plasticity (without a
homeostatic term) fails to perform BSS because it updates synapses with
false prior beliefs (见图 3). It is well known that a modification of Heb-
bian plasticity is necessary to realize BSS (Földiák, 1990; Linsker, 1997; Iso-
mura & Toyoizumi, 2016), speaking to the importance of selecting the right
priors for BSS.

The proposed equivalence between neural networks and Bayesian infer-
ence may offer insights into designing neural network architectures and
synaptic plasticity rules to perform a given task—by selecting the right
kind of prior beliefs—while retaining their biological plausibility. An in-
teresting extension of the proposed framework is an application to spiking
神经网络. Earlier work has highlighted relationships between spik-
ing neural networks and statistical inference (Bourdoukan, Barrett, Deneve,
& Machens, 2012; Isomura, Sakai, Kotani, & Jimbo, 2016). The current ap-
proach might be in a position to formally link spiking neuron models and
spike-timing dependent plasticity (Markram et al., 1997; Bi & Poo, 1998;
Froemke & Dan, 2002; 费尔德曼, 2012) with variational Bayesian inference.
, β
, β
One can understand the nature of the constants
从
(西德:3)
j1
the biological and Bayesian perspectives as follows:
determines
the firing threshold and thus controls the mean firing rates. 换句话说,
these parameters control the amplitude of excitatory and inhibitory inputs,
which may be analogous to the roles of GABAergic inputs (Markram et al.,
2004; Isaacson & Scanziani, 2011) and neuromodulators (Pawlak, Wickens,
Kirkwood, & Kerr, 2010; Frémaux & Gerstner, 2016) in biological neuronal
(西德:3)
网络. 同时,
encodes prior beliefs about states,
j0
which exert a large influence on the state posterior. The state posterior is bi-
(西德:3)
ased if
is selected in a suboptimal manner—in relation to the pro-
, β
cess that generates inputs. 同时,
determines the accuracy
of synaptic strengths that represent the likelihood mapping of an observa-
tion o(我)
采取 1 (ON state) depending on hidden states (compare equation
t
2.8 and equation 2.20). Under a usual MDP setup where the state prior does
not depend on the parameter posterior, the encoder becomes Bayes optimal
(西德:2)
(西德:2)0,(西德:2)0
when and only when
. These constants can represent bio-
logical constraints on synaptic strengths, such as the range of spine growth,

(西德:18)
A
(西德:2)
j1
A

j0
, A

, A

, β

(西德:19)

(西德:2)

(西德:3)

(西德:2)

(西德:3)

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2110

时间. Isomura and K. 弗里斯顿

spinal fluctuations, or the effect of synaptic plasticity induced by sponta-
neous activity independent of external inputs. Although the fidelity of each
synapse is limited due to such internal fluctuations, the accumulation of in-
formation over a large number of synapses should allow accurate encoding
of hidden states in the current formulation.

In previous reports, we have shown that in vitro neural networks—
comprising a cortical cell culture—perform BSS when receiving electrical
stimulations generated from two hidden sources (Isomura et al., 2015). 毛皮-
瑟莫雷, we showed that minimizing variational free energy under an
MDP is sufficient to reproduce the learning observed in an in vitro net-
工作 (Isomura & 弗里斯顿, 2018). Our framework for identifying biologically
plausible cost functions could be relevant for identifying the principles that
underlie learning or adaptation processes in biological neuronal networks,
using empirical response data. 这里, we illustrated this potential in terms of
the choice of function φ
j is close to
a constant (− ln 2, − ln 2), the cost function is expressed straightforwardly
as a variational free energy with small state prior biases. In future work,
we plan to apply this scheme to empirical data and examine the biological
plausibility of variational free energy minimization.

j in the cost functions L. 尤其, if φ

The correspondence highlighted in this work enables one to identify a
generative model (comprising likelihood and priors) that a neural network
正在使用. The formal correspondence between neural network and varia-
tional Bayesian formations rests on the asymptotic equivalence between the
neural network’s cost functions and variational free energy (under some
priors). Although variational free energy can take an arbitrary form, 这
correspondence provides biologically plausible constraints for neural net-
works that implicitly encode prior distributions. 因此, this formulation
is potentially useful for identifying the implicit generative models that un-
derlie the dynamics of real neuronal circuits. 换句话说, one can quan-
tify the dynamics and plasticity of a neuronal circuit in terms of variational
Bayesian inference and learning under an implicit generative model.

Minimization of the cost function can render the neural network Bayes
optimal in a Bayesian sense, including the choice of the prior, as described
in the previous section. The dependence between the likelihood function
and the state prior vanishes when the network uses an optimal threshold
to perform inference—if the true generative process does not involve de-
pendence between the likelihood and the state prior. 换句话说, 这
dependence arises from a suboptimal choice of the prior. 的确, any free
parameters or constraints in a neural network can be optimized by mini-
mizing variational free energy. This is because only variational free energy
with the optimal priors—that match the true generative process of the ex-
ternal world—can provide the global minimum among a class of neural net-
work cost functions under consideration. This is an interesting observation
because it suggests that the global minimum of the class of cost functions—
that determine neural network dynamics—is characterized by and only

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Reverse-Engineering Neural Networks

2111

by statistical properties of the external world. This implies that the reca-
pitulation of external dynamics is an inherent feature of canonical neural
系统.

最后, the free energy principle and complete class theorem imply that
any brain function can be formulated in terms of variational Bayesian in-
参考. Our reverse engineering may enable the identification of neuronal
substrates or process models underlying brain functions by identifying the
implicit generative model from empirical data. Unlike conventional connec-
tomics (based on functional connectivity), reverse engineering furnishes a
computational architecture (例如, neural network), which encompasses neu-
ral activity, synaptic plasticity, and behavior. This may be especially useful
for identifying neuronal mechanisms that underlie neurological or psychi-
atric disorders—by associating pathophysiology with false prior beliefs that
may be responsible for things like hallucinations and delusions (弗莱彻 &
Frith, 2009; 弗里斯顿, Stephan, Montague, & Dolan, 2014).

总之, we first identified a class of biologically plausible cost
functions for neural networks that underlie changes in both neural activ-
ity and synaptic plasticity. We then identified an asymptotic equivalence
between these cost functions and the cost functions used in variational
Bayesian formations. Given this equivalence, changes in the activity and
synaptic strengths of a neuronal network can be viewed as Bayesian be-
lief updating—namely, a process of transforming priors over hidden states
and parameters into posteriors, 分别. 因此, a cost function in this
class becomes Bayes optimal when activity thresholds correspond to ap-
propriate priors in an implicit generative model. 简而言之, the neural and
synaptic dynamics of neural networks can be cast as inference and learn-
英, under a variational Bayesian formation. This is potentially important
for two reasons. 第一的, it means that there are some threshold parameters for
any neural network (in the class considered) that can be optimized for ap-
plications to data when there are precise prior beliefs about the process gen-
erating those data. 第二, in virtue of the complete class theorem, one can
reverse-engineer the priors that any neural network is adopting. 这可能
be interesting when real neuronal networks can be modeled using neural
networks of the class that we have considered. 换句话说, if one can fit
neuronal responses—using a neural network model parameterized in terms
of threshold constants—it becomes possible to evaluate the implicit priors
using the above equivalence. This may find a useful application when ap-
plied to in vitro (or in vivo) 神经元网络 (Isomura & 弗里斯顿, 2018;
莱文, 2013) 或者, 的确, dynamic causal modeling of distributed neuronal
responses from noninvasive data (Daunizeau, 大卫, & Stephan, 2011). 在
这个上下文, the neural network can, 原则, be used as a dynamic
causal model to estimate threshold constants and implicit priors. This “re-
verse engineering” speaks to estimating the priors used by real neuronal
系统, under ideal Bayesian assumptions; sometimes referred to as meta-
Bayesian inference (Daunizeau et al., 2010).

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2112

时间. Isomura and K. 弗里斯顿

附录: Supplementary Methods

A.1 Order of the Parameter Complexity. The order of the parameter

complexity term

≡

不(西德:17)

Ns(西德:17)

(西德:17)

我=1

j=1

l∈{1,0}

((西德:4)

A(我, j)
·l

− a(我, j)
·l

(西德:5)

· ln A(我, j)
·l

− ln B

(西德:5))

(西德:4)

A(我, j)
·l

(A.1)

·l

A(我, j)
·l

, all the elements of A(我, j)
is computed. To avoid the divergence of ln A(我, j)
are assumed to be larger than a positive constant ε. This means that all
the elements of a(我, j)
are in the order of t. The first term of equation A.1)
·l
(西德:3)
(西德:2)
· ln A(我, j)
· ln A(我, j)
− a(我, j)
+ 氧(1) since a(我, j)
becomes
·l
·l
·l
·l
is in the order of 1. 而且, from equation 2.3, A(我, j)
· ln A(我, j)
= a(我, j)
·
·l
·l
·l
(西德:3)
(西德:2)
A(我, j)
ln a(我, j)
+ 氧(1). 意思是-
+ 氧
·l
·l
尽管, the second term of equation A.1 comprises the logarithms of gamma
(西德:3)
(西德:2)
functions as ln B
. 从
Stirling’s formula,

A(我, j)
(西德:2)
A(我, j)
·l

= a(我, j)
·l
(西德:3)
(西德:2)

+ A(我, j)
0我
(西德:3)

· ln A(我, j)
·l

A(我, j)
·l
(西德:3)

= a(我, j)
·l

+ A(我, j)

= ln (西德:6)

+ ln (西德:6)

− ln (西德:6)

A(我, j)

− ln

(西德:3)−1

· ln

(西德:2)(西德:2)

(西德:3)(西德:3)

(西德:2)

(西德:3)

(西德:2)

0我

1我

0我

·l

(西德:5)

(西德:4)
A(我, j)

1我

(西德:6)

√

2圆周率

(西德:5)- 1

(西德:4)
A(我, j)

1我

(西德:28)

A(我, j)

1我

(西德:15)

(西德:27)

A(我, j)
1我
e

1 + 氧

holds. The logarithm of (西德:6)

(西德:3)

(西德:2)

A(我, j)

1我

is evaluated as

(西德:15)(西德:4)

A(我, j)

1我

(西德:16)(西德:16)

(西德:5)−1

(A2)

ln (西德:6)

(西德:3)

(西德:2)

A(我, j)

1我

= 1
2

ln 2π − 1
2
(西德:15)

+ ln

1 + 氧

ln a(我, j)
(西德:15)(西德:4)

1我

+ A(我, j)
1我
(西德:16)(西德:16)
(西德:5)−1

A(我, j)

1我

(西德:4)

(西德:5)

ln a(我, j)

1我

- 1

= a(我, j)

1我

ln a(我, j)

1我

− a(我, j)

1我

+ 氧 (ln t) .

(A.3)

(西德:3)

A(我, j)
(西德:2)
0我
ln

= a(我, j)

0我

A(我, j)

1我

+ A(我, j)

0我

ln a(我, j)
(西德:2)
(西德:3)
0我
-

− a(我, j)
A(我, j)

0我

+ 氧 (ln t) and ln (西德:6)

+ A(我, j)
+ 氧 (ln t) hold. 因此, 我们

A(我, j)

(西德:3)

0我

1我

+ A(我, j)

1我

0我

(西德:2)

(西德:3)

(西德:2)

相似地, ln (西德:6)
(西德:3)
A(我, j)
+ A(我, j)
=
1我
obtain

0我

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

(西德:5)

(西德:4)
A(我, j)
·l

ln B

= a(我, j)

1我

× ln

= a(我, j)
·l

+ A(我, j)
(西德:5)

0我

1我

ln a(我, j)
(西德:4)
A(我, j)
(西德:4)
A(我, j)
·l

· ln

1我

+ A(我, j)
0我
(西德:5)

ln a(我, j)

0我

(西德:4)
A(我, j)

1我

+ A(我, j)

0我

(西德:5)

+ 氧 (ln t)

+ 氧 (ln t) .

(A.4)

Reverse-Engineering Neural Networks

2113

(A.5)

(西德:21)

因此, equation A.1 becomes

不(西德:17)

Ns(西德:17)

(西德:17)

我=1

j=1

l∈{1,0}
(西德:5))

(
A(我, j)
·l

· ln

(西德:5)

(西德:4)
A(我, j)
·l

+ 氧 (1) -

(西德:4)
A(我, j)
·l

(西德:5)

(西德:4)
A(我, j)
·l

· ln

+ 氧 (ln t)

= O (ln t) .

所以, we obtain

F ( ˜o, 问 ( ˜s) , 问 (A)) =

Ns(西德:17)

t(西德:17)

(西德:20)

s( j)
t

ln s( j)

τ −

τ =1
j=1
+ 氧 (ln t) .

不(西德:17)

我=1

熔点(我, j) · o(我)

τ − ln D( j)

(A.6)

Under the current generative model comprising binary hidden states and
binary observations, the optimal posterior expectation of A can be obtained
up to the order of ln t/t even when the O (ln t) term in equation A.6 is ig-
nored. Solving the variation of F with respect to A(我, j)
yields the optimal
1我
posterior expectation. From A(我, j)
, 我们发现

= 1 − A(我, j)

0我

1我

不(西德:17)

Ns(西德:17)

t(西德:17)

δF =

(
−δ ln A(我, j)

1· o(我)

τ − δ ln

(西德:4)
(西德:2)1 − A(我, j)
1·

(西德:5) (西德:4)

(西德:5))

1 − o(我)
t

s( j)
t

(西德:15)

τ =1
*

我=1

j=1

Ns(西德:17)

不(西德:17)

= t

j=1

我=1
(西德:15)

不(西德:17)

= t

j=1

我=1
(西德:15)

δA(我, j)

1· (西德:7)
(西德:15)

Ns(西德:17)

δA(我, j)

1· (西德:7)

(西德:16)

(西德:5)(西德:7)−1

(西德:4)
A(我, j)
1·

· o(我)
t

⊗ s( j)
t

(西德:4)
(西德:2)1 − A(我, j)
1·

(西德:16)

(西德:5)(西德:7)−1

(西德:5)

(西德:4)
1 − o(我)
t

s( j)
t

δA(我, j)

1· (西德:7)

(西德:5)(西德:7)−1

(西德:4)
A(我, j)
1·

(西德:4)
(西德:2)1 − A(我, j)
1·

(西德:7)

(西德:16)

(西德:5)(西德:7)−1

(西德:16)

A(我, j)
1· (西德:7) s( j)

− o(我)

t s( j)

(A.7)

up to the order of ln t. 这里,
A(我, j)
. From δF = 0, 我们发现
1·

(西德:3)(西德:7)−1

(西德:2)

A(我, j)
1·

denotes the element-wise inverse of

A(我, j)

1· = o(我)

t s( j)

(西德:16)(西德:7)−1

(西德:15)

(西德:7)

s( j)
t

+ 氧

(西德:15)

(西德:16)

ln t
t

(A.8)

所以, we obtain the same result as equation 2.8 up to the order of ln t/t.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2114

时间. Isomura and K. 弗里斯顿

A.2 Correspondence between Parameter Prior Distribution and Ini-
tial Synaptic Strengths. 一般来说, optimizing a model of observable
quantities—including a neural network—can be cast inference if there
exists a learning mechanism that updates the hidden states and pa-
rameters of that model based on observations. (Exact and variational)
Bayesian inference treats the hidden states and parameters as random vari-
ables and thus transforms prior distributions P (st ), 磷 (A) into posteriors
问 (st ), 问 (A). 换句话说, Bayesian inference is a process of transform-
, . . . , ot under a gen-
ing the prior to the posterior based on observations o1
erative model. 从这个角度来看, the incorporation of prior knowledge
about the hidden states and parameters is an important aspect of Bayesian
inference.

The minimization of a cost function by a neural network updates its ac-
tivity and synaptic strengths based on observations under the given net-
work properties (例如, activation function and thresholds). According to the
complete class theorem, this process can always be viewed as Bayesian in-
参考. We have demonstrated that a class of cost functions—for a single-
layer feedforward network with a sigmoid activation function—has a form
equivalent to variational free energy under a particular latent variable
模型. 这里, neural activity xt and synaptic strengths W come to encode
the posterior distributions over hidden states Q(西德:11)
(A),
分别, where Q(西德:11)
(A) follow categorical and Dirichlet distri-
butions, 分别. 而且, we identified that the perturbation factors
φ
j, which characterize the threshold function, correspond to the logarithm
of the state prior P (st ) expressed as a categorical distribution.

(st ) and parameters Q(西德:11)

(st ) and Q(西德:11)

(st ), 问(西德:11)

然而, one might ask whether the posteriors obtained using the
network Q(西德:11)
(A) are formally different from those obtained using
variational Bayesian inference Q (st ), 问 (A) since only the latter explicitly
considers the prior distribution of parameters P (A). 因此, one may won-
der if the network merely influences update rules that are similar to varia-
tional Bayes but do not transform the priors P (st ), 磷 (A) into the posteriors
问 (st ), 问 (A), despite the asymptotic equivalence of the cost functions.
以下, we show that the initial values of synaptic strengths W init
j1

, W init
j0
correspond to the parameter prior P (A) expressed as a Dirichlet distribu-
的, to show that a neural network indeed transforms the priors into the
posteriors. For this purpose, we specify the order 1 term in equation 2.12 到
make the dependence on the initial synaptic strengths explicit. 具体来说,
we modify equation 2.12 作为

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

t(西德:17)

L j

⎧
⎨
⎩ f

τ =1
(西德:2)
Wj1

(西德:28)

时间

(西德:27)(西德:27)

(西德:27)

(西德:2)

(西德:3)

xτ j

xτ j
1 − xτ j

Wj1

Wj0

(西德:3) (西德:4)
λ

, Wj0

(西德:7) ˆW init
j1

, λ

(西德:7) ˆW init
j0

oτ +

(西德:5)
时间

(西德:28)

(西德:27)

(西德:28)(西德:28)⎫
⎬

⎭

h j1

h j0

Reverse-Engineering Neural Networks

(西德:4)

(西德:5)

(西德:4)
(西德:2)1 − ˆWj1

, ln

(西德:4)
(西德:2)1 − ˆWj0

(西德:5)(西德:5) (西德:2)

(西德:3)
时间 ,

, λ

2115

(A.9)

(西德:3)

and ˆW init
j0

(西德:2)
W init
j0
, λ

(西德:2)
≡ sig
≡ sig
where ˆW init
W init
are the sigmoid functions
j1
j1
∈ RNo are row vectors of the
of the initial synaptic strengths, and λ
inverse learning rate factors that express the insensitivity of the synaptic
strengths to the activity-dependent synaptic plasticity. The third term of
equation A.9 expresses the integral of ˆWj1 and ˆWj0 (with respect to Wj1 and
Wj0, 分别). This ensures that when t = 0 (IE。, when the first term on
the right-hand side of equation A.9 is zero), the derivative of L j is given
(西德:3)
, W init
by ∂L j
j0
provides the fixed point of L j.

(西德:7) ˆWj1, 因此

(西德:2)
W init
j1

(西德:7) ˆW init
j1

(西德:2)
Wj1

/∂Wj1

, Wj0

= λ

− λ

(西德:3)

Similar to the transformation from equation 2.12 方程 2.17, 我们

compute equation A9) 作为

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

⎧
⎨

(西德:27)

⎩

ln xτ j
(西德:2)
1 − xτ j

(西德:3)

(西德:28)

⎛

⎝

ln ˆWj1

ln ˆWj0

(西德:4)
(西德:2)1 − ˆWj1
(西德:4)
(西德:2)1 − ˆWj0

(西德:5)

⎞

⎠

(西德:5)

(西德:28)
时间

L =

(西德:27)

Nx(西德:17)

t(西德:17)

τ =1

j=1
(西德:27)

xτ j
1 − xτ j
(西德:27)
(西德:28)

(西德:28)(西德:21)

oτ
(西德:2)1 − oτ
*(西德:4)

Nx(西德:17)

ln ˆWj1

, ln

(西德:4)
(西德:2)1 − ˆWj1

(西德:5)(西德:5) (西德:4)

(西德:7) ˆW init
j1

, λ

(西德:7)

(西德:4)
(西德:2)1 − ˆW init
j1

(西德:5)(西德:5)
时间

j=1
(西德:4)
ln ˆWj0

(西德:4)
(西德:2)1 − ˆWj0

, ln

(西德:5)(西德:5) (西德:4)

(西德:7) ˆW init
j0

, λ

(西德:7)

(西德:5)(西德:5)

时间

(西德:4)
(西德:2)1 − ˆW init
j0

. (A.10)

(西德:2)
(西德:2)1 − ˆWj1

(西德:3)

= ln ˆWj1

− ln

11 , λ

Note that we used Wj1
. 至关重要的是, analogous to the
correspondence between ˆWj1 and the Dirichlet parameters of the parame-
ter posterior a(·, j)
can be formally associated with the Dirich-
let parameters of the parameter prior a(·, j)
11 . 因此, one can see the for-
mal correspondence between the second and third terms on the right-hand
side of equation A.10 and the expectation of the log parameter prior in
方程 2.4:

(西德:7) ˆW init
j1

EQ(A) [ln P (A)] =

不(西德:17)

Ns(西德:17)

我=1

j=1

不(西德:17)

Ns(西德:17)

我=1

j=1

熔点(我, j) · a(我, j)

(
熔点(我, j)
·1

· a(我, j)
·1

+ 熔点(我, j)
·0

· a(我, j)
·0

)

(A.11)

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2116

时间. Isomura and K. 弗里斯顿

此外, the synaptic update rules are derived from equation A.10 as
⎧

(西德:5)

(西德:4)

+ xt j

φ(西德:11)
j1

+ 1
t

(西德:7) ˆW init
j1

− λ

(西德:7) ˆWj1

˙Wj1

∝ − 1
t

∂L
∂Wj1

˙Wj0

∝ − 1
t

∂L
∂Wj0
(西德:4)

⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎩

= xt joT
t
(西德:2)

− xt j ˆWj1
(西德:3)

1 − xt j

+ 1
t

(西德:7) ˆW init
j0

− λ

(西德:7) ˆWj0

oT
t

- 1 − xt j ˆWj0
(西德:5)

+ 1 − xt j

φ(西德:11)
j0

(A.12)

The fixed point of equation A.12 is provided as

⎧

⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎩

Wj1

= sig

−1

(西德:15)(西德:4)

txt j

(西德:2)1 + λ

(西德:5)(西德:7)−1

(西德:4)
txt joT
t

(西德:7)

+ txt j

φ(西德:11)
j1

+ λ

(西德:7) ˆW init
j1

Wj0

(西德:15)(西德:4)

−1
= sig
(西德:4)
t

(西德:7)

t1 − xt j
(西德:3)

(西德:2)
1 − xt j

(西德:2)1 + λ

(西德:5)(西德:7)−1

(西德:5)(西德:5)

oT
t

+ t1 − xt j

φ(西德:11)
j0

+ λ

(西德:7) ˆW init
j0

(西德:5)(西德:16)

.
(A.13)

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

t = 0 are computed as Wj1

注意
(西德:2)(西德:2)

−1

(西德:3)(西德:7)−1 (西德:7)

(西德:2)

the synaptic strengths at

(西德:3)(西德:3)

(西德:7) ˆW init
j1

= W init

sig
j1 . 再次, one can see the formal cor-
respondence between the final values of the synaptic strengths given by
equation A.13 in the neural network formation and the parameter posterior
given by equation 2.8 in the variational Bayesian formation. As the Dirich-
let parameter of the posterior a(·, j)
is decomposed into the outer product
t1 and the prior a(·, j)
ot ⊗ s( j)
(西德:7) ˆW init
j1 ,
分别. 因此, 方程 2.8 corresponds to equation A.13. 因此, for a
given constant set
, we identify the corresponding pa-
rameter prior P

11 , they are associated with xt joT

(西德:18)
, W init
W init
(西德:2)
(西德:3)
j0
j1
= Dir

, λ
, λ
j1
(西德:3)
A(·, j)

t and λ

, 给出的

A(·, j)

(西德:19)

(西德:2)

⎛

⎝

A(·, j) ≡

A(·, j)
A(·, j)

⎞

⎛

⎠ =

⎝

(西德:7)

(西德:7) ˆW init
j1
j1
(西德:4)
(西德:2)1 − ˆW init
j1

(西德:7) ˆW init
j0
j0
(西德:4)
(西德:2)1 − ˆW init
j0

(西德:7)

(西德:5)

⎞

(西德:5)

⎠ .

(A.14)

总之, one can establish the formal correspondence between neu-
ral network and variational Bayesian formations in terms of the cost func-
系统蒸发散 (参见方程 2.4 versus equation A.10), priors (see equations 2.18 和
A.14), and posteriors (参见方程 2.8 versus equation A.13). This means
that a neural network successively transforms priors P (st ), 磷 (A) into pos-
teriors Q (st ), 问 (A), as parameterized with neural activity, and initial and
final synaptic strengths (and thresholds). 至关重要的是, when increasing the
number of observations, this process is asymptotically equivalent to that
of variational Bayesian inference under a specific likelihood function.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Reverse-Engineering Neural Networks

2117

A.3 Derivation of Synaptic Plasticity Rule. We consider synaptic
≡

= Wj1 (t) and define the change as (西德:8)Wj1

strengths at time t, Wj1
Wj1 (t + 1) − Wj1 (t). From equation 2.15, H(西德:11)

1(Wj1) satisfies both

(西德:2)
Wj1

(西德:11)
1

(西德:3)

+ (西德:8)Wj1

− h

(西德:11)

1(Wj1) = h

(西德:11)(西德:11)

1 (Wj1) (西德:7) (西德:8)Wj1

+ 氧

(西德:5)

(西德:4)$ $(西德:8)Wj1

$ $2

(A.15)

和

(西德:2)
Wj1

(西德:11)
1

(西德:3)

+ (西德:8)Wj1

− h

(西德:11)
1(Wj1)

= −

X(t+1) joT
t+1
X(t+1) j

+ txt joT
t
+ txt j

+ xt joT
xt j

≈ −

X(t+1) joT
txt j

t+1

+ xt joT
txt j

t
2 X(t+1) j

= − 1
txt j

(西德:2)

X(t+1) joT

t+1

+ X(t+1) jh

(西德:3)
(西德:11)
1(Wj1)

(A.16)

因此, 我们发现

⎛

(西德:7)−1

= − h(西德:11)(西德:11)
(西德:22)

1 (Wj1)
txt j
(西德:23)(西德:24)

(西德:25)

(西德:7)

⎜
⎝x(t+1) jOT

(西德:23)(西德:24)

t+1
(西德:25)

(西德:22)

Hebbian term

homeostatic term

⎞

⎟
⎠ .

(A.17)

(西德:11)
+ X(t+1) jh
1(Wj1)
(西德:25)
(西德:23)(西德:24)

(西德:22)

(西德:8)Wj1

相似地,

adaptive learning rate

⎛

(西德:8)Wj0

(西德:7)−1

= − h(西德:11)(西德:11)
(西德:22)

0 (Wj0)
t1 − xt j
(西德:23)(西德:24)

(西德:25)

adaptive learning rate

(西德:7)

⎜
⎝(1 − x(t+1) j )OT

(西德:23)(西德:24)

(西德:22)

+ (1 − x(t+1) j )H

t+1
(西德:25)

(西德:22)

(西德:23)(西德:24)

⎞

⎟
(西德:11)
⎠ .
0(Wj0)
(西德:25)

anti-Hebbian term

homeostatic term

(A.18)

These plasticity rules express (anti-) Hebbian plasticity with a homeostatic
学期.

Data Availability

All relevant data are within the letter. Matlab scripts are available at
https://github.com/takuyaisomura/reverse_engineering.

致谢

This work was supported in part by the grant of Joint Research by the Na-
tional Institutes of Natural Sciences (NINS Program No. 01112005). T.I. 是

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2118

时间. Isomura and K. 弗里斯顿

funded by the RIKEN Center for Brain Science. K.J.F. is funded by a Well-
come Principal Research Fellowship (088130/Z/09/Z). The funders had no
role in study design, data collection and analysis, decision to publish, 或者
preparation of the manuscript.

参考

Ahmadi, A。, & Tani, J. (2019). A novel predictive-coding-inspired variational RNN
model for online prediction and recognition. Neural Comput., 31, 2025–2074.

Albus, J. S. (1971). A theory of cerebellar function. 数学. Biosci., 10, 25–61.
Bastos, A. M。, Usrey, 瓦. M。, Adams, 右. A。, Mangun, G. R。, 薯条, P。, & 弗里斯顿, K. J.

(2012). Canonical microcircuits for predictive coding. 神经元, 76, 695–711.

Belouchrani, A。, Abed-Meraim, K., Cardoso, J. F。, & Moulines, 乙. (1997). A blind
source separation technique using second-order statistics. IEEE Trans. Signal Pro-
cess., 45, 434–444.

Bi, G. Q., & Poo, 中号. 中号. (1998). Synaptic modifications in cultured hippocampal neu-
罗恩: Dependence on spike timing, synaptic strength, and postsynaptic cell type.
J. 神经科学。, 18, 10464–10472.

Bishop, C. 中号. (2006). Pattern recognition and machine learning. 柏林: 施普林格.
Blei, D. M。, Kucukelbir, A。, & McAuliffe, J. D. (2017). Variational inference: A review

for statisticians. J. 是. Stat. Assoc., 112, 859–877.

Bliss, 时间. 五、, & Lømo, 时间. (1973). Longasting potentiation of synaptic transmission in
the dentate area of the anaesthetized rabbit following stimulation of the perforant
小路. J. Physiol. 232, 331–356.

Bourdoukan, R。, Barrett, D ., Deneve, S。, & Machens, C. K. (2012). Learning optimal
spike-based representations. 在F中. 佩雷拉, C. J. C. 布尔吉斯, L. 波图, & K. 问. Wein-
berger (编辑。), Advances in neural information processing systems, 25 (PP. 2285–2293)
红钩, 纽约: 柯兰.

棕色的, G. D ., Yamada, S。, & Sejnowski, 时间. J. (2001). Independent component analysis

at the neural cocktail party. Trends Neurosci. 24, 54–63.

棕色的, L. D. (1981). A complete class theorem for statistical problems with finite-

sample spaces. Ann Stat., 9, 1289–1300.

Cichocki, A。, Zdunek, R。, Phan, A. H。, & Amari, S. 我. (2009). Nonnegative matrix and
tensor factorizations: Applications to exploratory multi-way data analysis and blind
source separation. 霍博肯, 纽约: 威利.

Comon, P。, & Jutten, C. (2010). Handbook of blind source separation: Independent compo-

nent analysis and applications. 奥兰多, FL: 学术出版社.

Daunizeau, J。, 大卫, 奥。, & Stephan, K. 乙. (2011). Dynamic causal modelling: A criti-
cal review of the biophysical and statistical foundations. 神经影像, 58, 312–322.
Daunizeau, J。, Den Ouden, H. E., Pessiglione, M。, Kiebel, S. J。, Stephan, K. E., & Fris-
吨, K. J. (2010). Observing the observer (我): Meta-Bayesian models of learning
and decision-making. PLOS One, 5, e15554.

Dauwels, J. (2007). On variational message passing on factor graphs. In Proceedings

of the International Symposiun on Information Theory. 皮斯卡塔韦, 新泽西州: IEEE.

Dayan, P。, & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathemat-

ical modeling of neural systems. 剑桥, 嘛: 与新闻界.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Reverse-Engineering Neural Networks

2119

Dayan, P。, 欣顿, G. E., Neal, 右. M。, & Zemel, 右. S. (1995). The Helmholtz machine.

Neural Comput., 7, 889–904.

DiCarlo, J. J。, Zoccolan, D ., & Rust, 氮. C. (2012). How does the brain solve visual

object recognition? 神经元, 73, 415–434.

费尔德曼, D. 乙. (2012). The spike-timing dependence of plasticity. 神经元, 75, 556–

571.

弗莱彻, 磷. C。, & Frith, C. D. (2009). Perceiving is believing: A Bayesian approach to
explaining the positive symptoms of schizophrenia. Nat. 牧师. Neuros., 10, 48–58.
Földiák, 磷. (1990). Forming sparse representations by local anti-Hebbian learning.

Biol. Cybern., 64, 165–170.

Forney, G. D. (2001). Codes on graphs: Normal realizations. IEEE Trans. Info. 理论,

47, 520–548.

Frémaux, N。, & Gerstner, 瓦. (2016) Neuromodulated spike-timing-dependent plas-

ticity, and theory of three-factor learning rules. Front. Neural Circuits, 9.

弗里斯顿, K. (2005). A theory of cortical responses. Philos. 反式. 右. Soc. Lond. B Biol.

Sci., 360, 815–836.

弗里斯顿, K. (2008). Hierarchical models in the brain. PLOS Comput. Biol., 4, e1000211.
弗里斯顿, K. (2010). The free-energy principle: A unified brain theory? Nat. 牧师. 新-

rosci., 11, 127–138.

弗里斯顿, K. (2013). Life as we know it. J. 右. Soc. Interface, 10, 20130475.
弗里斯顿, K. (2019). A free energy principle for a particular physics. arXiv:1906.10184.
弗里斯顿, K., FitzGerald, T。, Rigoli, F。, Schwartenbeck, P。, & Pezzulo, G. (2016). Active

inference and learning. Neurosci. Biobehav. Rev., 68, 862–879.

弗里斯顿, K., FitzGerald, T。, Rigoli, F。, Schwartenbeck, P。, & Pezzulo, G. (2017). Active

inference: A process theory. Neural Comput., 29, 1–49.

弗里斯顿, K., Kilner, J。, & Harrison, L. (2006). A free energy principle for the brain. J.

Physiol. 巴黎, 100, 70–87.

弗里斯顿, K. J。, 林, M。, Frith, C. D ., Pezzulo, G。, Hobson, J. A。, & Ondobaka, S. (2017).

Active inference, curiosity and insight. Neural Comput., 29, 2633–2683.

弗里斯顿, K., Mattout, J。, & Kilner, J. (2011). Action understanding and active inference.

Biol Cybern., 104, 137–160.

弗里斯顿, K. J。, Parr, T。, & de Vries, 乙. D. (2017). The graphical brain: Belief propagation

and active inference. Netw. 神经科学。, 1, 381–414.

弗里斯顿, K. J。, Stephan, K. E., Montague, R。, & Dolan, 右. J. (2014). 计算型

psychiatry: The brain as a phantastic organ. Lancet Psychiatry, 1, 148–158.

Froemke, 右. C。, & Dan, 是. (2002). Spike-timing-dependent synaptic modification in-

duced by natural spike trains. 自然, 416, 433–438.

乔治, D ., & Hawkins, J. (2009). Towards a mathematical theory of cortical micro-

circuits. PLOS Comput. Biol., 5, e1000532.

Hebb, D. 氧. (1949). The organization of behavior: A neuropsychological theory. 纽约:

威利.

Isaacson, J. S。, & Scanziani, 中号. (2011). How inhibition shapes cortical activity. 神经元,

72, 231–243.

Isomura, 时间. (2018). A measure of information available for inference. Entropy, 20,

512.

Isomura, T。, & 弗里斯顿, K. (2018). In vitro neural networks minimize variational free

活力. Sci. Rep., 8, 16926.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

2120

时间. Isomura and K. 弗里斯顿

Isomura, T。, Kotani, K., & Jimbo, 是. (2015). Cultured cortical neurons can perform
blind source separation according to the free-energy principle. PLOS Comput.
Biol., 11, e1004643.

Isomura, T。, Sakai, K., Kotani, K., & Jimbo, 是. (2016). Linking neuromodulated spike-
timing dependent plasticity with the free-energy principle. Neural Comput., 28,
1859–1888.

Isomura, T。, & Toyoizumi, 时间. (2016). A local learning rule for independent component

分析. Sci. Rep., 6, 28073.

Kingma, D. P。, & Ba, J. (2015). 亚当: A method for stochastic optimization. In Pro-
ceedings of the 3rd International Conference for Learning Representations. ICLR-15.
Knill, D. C。, & Pouget, A. (2004). The Bayesian brain: The role of uncertainty in neural

coding and computation. Trends Neurosci., 27, 712–719.

库尔巴克, S。, & 莱布勒, 右. A. (1951). On information and sufficiency. 安. 数学. Stat.,

22, 79–86.

Ku´smierz, Ł., Isomura, T。, & Toyoizumi, 时间. (2017). Learning with three factors: Mod-

ulating Hebbian plasticity with errors. Curr. Opin. Neurobiol., 46, 170–177.

李, 时间. W., Girolami, M。, 钟, A. J。, & Sejnowski, 时间. J. (2000). A unifying information-
theoretic framework for independent component analysis. 电脑. 数学. Appl.
39, 1–21.

莱文, 中号. (2013). Reprogramming cells and tissue patterning via bioelectrical path-
方法: Molecular mechanisms and biomedical opportunities. Wiley Interdiscip.
牧师. Syst. Biol. Med., 5, 657–676.

Linsker, 右. (1988). Self-organization in a perceptual network. Computer 21, 105–117.
Linsker, 右. (1997). A local learning rule that enables information maximization for

arbitrary input distributions. Neural Comput., 9, 1661–1665.

Malenka, 右. C。, & Bear, 中号. F. (2004). LTP and LTD: An embarrassment of riches.

神经元, 44, 5–21.

Markram, H。, Lübke, J。, Frotscher, M。, & Sakmann, 乙. (1997). Regulation of synaptic
efficacy by coincidence of postsynaptic APs and EPSPs. 科学, 275, 213–215.
Markram, H。, Toledo-Rodriguez, M。, 王, Y。, 古普塔, A。, Silberberg, G。, & 吴, C.
(2004). Interneurons of the neocortical inhibitory system. Nat. 牧师. 神经科学。, 5,
793–807.

马尔, D. (1969). A theory of cerebellar cortex. J. Physiol., 202, 437–470.
Mesgarani, N。, & 张, 乙. F. (2012). Selective cortical representation of attended

speaker in multi-talker speech perception. 自然, 485, 233–236.

Newsome, 瓦. T。, Britten, K. H。, & Movshon, J. A. (1989). Neuronal correlates of a

perceptual decision. 自然, 341, 52–54.

Parr, T。, Da Costa, L。, & 弗里斯顿, K. (2020). Markov blankets, information geometry

and stochastic thermodynamics. Phil. 反式. 右. Soc. A, 378, 20190159.

Pawlak, 五、, Wickens, J. R。, Kirkwood, A。, & Kerr, J. 氮. (2010). Timing is not every-

事物: Neuromodulation opens the STDP gate. Front. Syn. 神经科学。, 2, 146.

饶, 右. P。, & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional
interpretation of some extra-classical receptive-field effects. Nat Neurosci., 2, 79–
87.

Rolfe, J. 时间. (2016). Discrete variational autoencoders. arXiv:1609.02200.
Schultz, W., Dayan, P。, & Montague, 磷. 右. (1997). A neural substrate of prediction and

reward. 科学, 275, 1593–1599.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d

我

F
/

3
2
1
1
2
0
8
5
1
8
6
5
4
2
3
n
e
C
哦
_
A
_
0
1
3
1
5
p
d

乙
y
G
你
e
s
t

哦
n
0
8
S
e
p
e
米
乙
e
r
2
0
2
3

Reverse-Engineering Neural Networks

2121

Schwartenbeck, P。, & 弗里斯顿, K. (2016). Computational phenotyping in psychiatry:

A worked example. eNeuro, 3, e0049–16.2016.

Sutton, 右. S。, & Barto, A. G. (1998). Reinforcement learning. 剑桥, 嘛: 和

按.

Tolhurst, D. J。, Movshon, J. A。, & 院长, A. F. (1983). The statistical reliability of signals

in single neurons in cat and monkey visual cortex. Vision Res., 23, 775–785.

Turrigiano, G. G。, & 纳尔逊, S. 乙. (2004). Homeostatic plasticity in the developing

nervous system. Nat Rev. 神经科学。, 5, 97–107.

冯·亥姆霍兹, H. (1925). Treatise on physiological optics (卷. 3) 华盛顿, 直流: Op-

tical Society of America.

Wald, A. (1947). An essentially complete class of admissible decision functions. 安

Math Stat., 18, 549–555.

Whittington, J. C。, & Bogacz, 右. (2017). An approximation of the error backprop-
agation algorithm in a predictive coding network with local Hebbian synaptic
plasticity. Neural Comput., 29, 1229–1262.

Received December 20, 2019; accepted June 22, 2020.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你
n
e
C
哦
A
r
t
我
C
e
–
p
d