ARTÍCULO
Communicated by Dirk Ostwald
Learning in Volatile Environments
With the Bayes Factor Surprise
Vasiliki Liakoni*
vasiliki.liakoni@epfl.ch
Alireza Modirshanechi*
alireza.modirshanechi@epfl.ch
Wulfram Gerstner
wulfram.gerstner@epfl.ch
Johanni Brea
johanni.brea@epfl.ch
École Polytechnique Fédérale de Lausanne, School of Computer and Communication
Sciences and School of Life Sciences, 1015 Lausanne, Suiza
Surprise-based learning allows agents to rapidly adapt to nonstationary
stochastic environments characterized by sudden changes. Nosotros mostramos que
exact Bayesian inference in a hierarchical model gives rise to a surprise-
modulated trade-off between forgetting old observations and integrating
them with the new ones. The modulation depends on a probability ratio,
which we call the Bayes Factor Surprise, that tests the prior belief against
the current belief. We demonstrate that in several existing approximate
algoritmos, the Bayes Factor Surprise modulates the rate of adaptation
to new observations. We derive three novel surprise-based algorithms,
one in the family of particle filters, one in the family of variational learn-
En g, and one in the family of message passing, that have constant scaling
in observation sequence length and particularly simple update dynam-
ics for any distribution in the exponential family. Empirical results show
that these surprise-based algorithms estimate parameters better than
alternative approximate approaches and reach levels of performance
comparable to computationally more expensive algorithms. The Bayes
Factor Surprise is related to but different from the Shannon Surprise. En
two hypothetical experiments, we make testable predictions for physio-
logical indicators that dissociate the Bayes Factor Surprise from the Shan-
non Surprise. The theoretical insight of casting various approaches as
surprise-based learning, as well as the proposed online algorithms, may
be applied to the analysis of animal and human behavior and to rein-
forcement learning in nonstationary environments.
*V.L. and A.M. made equal contributions to this article and are co-corresponding
autores.
Computación neuronal 33, 269–340 (2021)
https://doi.org/10.1162/neco_a_01352
© 2021 Instituto de Tecnología de Massachusetts
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
270
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
1 Introducción
animales, humanos, and similarly reinforcement learning agents may safely
assume that the world is stochastic and stationary during some intervals of
time interrupted by change points. The position of leaves on a tree, a stock
market index, or the time it takes to travel from A to B in a crowded city
is often well captured by stationary stochastic processes for extended peri-
ods of time. Then sudden changes may happen, such that the distribution
of leaf positions becomes different due to a storm, the stock market index
is affected by the enforcement of a new law, or a blocked road causes addi-
tional traffic jams. The violation of an agent’s expectation caused by such
sudden changes is perceived by the agent as surprise, which can be seen as
a measure of how much the agent’s current belief differs from reality.
Surprise, with its physiological manifestations in pupil dilation (Nassar
et al., 2012; Preuschoff, t Hart, & Einhauser, 2011) and EEG signals (Mars
et al., 2008; Modirshanechi, Kiani, & Aghajan, 2019; Ostwald et al., 2012),
is believed to modulate learning, potentially through the release of specific
neurotransmisores (Gerstner, Lehmann, Liakoni, Corneil, & Brea, 2018; Yu &
Dayán, 2005) so as to allow animals and humans to adapt quickly to sudden
cambios. The quick adaptation to novel situations has been demonstrated
in a variety of learning experiments (Behrens, lana rica, Walton, & Rush-
worth, 2007; Glaze, Kable, & Gold, 2015; Heilbron & Meyniel, 2019; Nassar
et al., 2012; Nassar, wilson, Heasly, & Gold, 2010; Yu & Dayán, 2005). El
bulk of computational work on surprise-based learning can be separated
into two groups. Studies in the field of computational neuroscience have
focused on biological plausibility with little emphasis on the accuracy of
aprendiendo (Behrens et al., 2007; Hombre rico, 2017; Faraji, Preuschoff, & Gerstner,
2018; Friston, 2010; Friston, FitzGerald, Rigoli, Schwartenbeck, & pezzulo,
2017; Nassar et al., 2012; Nassar, wilson, Heasly, & Gold, 2010; Ryali, Reddy,
& Yu, 2018; Schwartenbeck, FitzGerald, Dolan, & Friston, 2013; Yu & Dayán,
2005), whereas exact and approximate Bayesian online methods (Adams
& MacKay, 2007; Fearnhead & Liu, 2007) for change point detection and
parameter estimation have been developed without any focus on biologi-
cal plausibility (Aminikhanghahi & Cocinar, 2017; Cummings, Krehbiel, Mei,
Tuo, & zhang, 2018; lin, Sharpnack, Rinaldo, & Tibshirani, 2017; Masegosa
et al., 2017; wilson, Nassar, & Gold, 2010).
En este trabajo, we take a top-down approach to surprise-based learning.
We start with a generative model of change points similar to the one that has
been the starting point of multiple experiments (Behrens et al., 2007; Find-
ling, Chopin, & Koechlin, 2019; Glaze et al., 2015; Heilbron & Meyniel, 2019;
Nassar et al., 2012, 2010; Yu & Dayán, 2005). We demonstrate that Bayesian
inference on such a generative model can be interpreted as modulation of
learning by surprise; we show that this modulation leads to a natural def-
inition of surprise that is different from but closely related to the Shannon
Surprise (shannon, 1948). Además, we derive three novel approximate
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
271
online algorithms with update rules that inherit the surprise-modulated
adaptation rate of exact Bayesian inference. The overall goal of this article
is to give a Bayesian interpretation for surprise-based learning in the brain
and to find approximate methods that are computationally efficient and bi-
ologically plausible while maintaining a high level of learning accuracy. Como
a by-product, our approach provides theoretical insights on commonalities
and differences among existing surprise-based and approximate Bayesian
approaches.
En tono rimbombante, our approach makes specific experimental
predicciones.
En la sección 2, we introduce the generative model and we present our
surprise-based interpretation of Bayesian inference and our three approxi-
mate algorithms. Próximo, we use simulations to compare our algorithms with
existing ones on two different tasks inspired by and closely related to real
experimentos (Behrens et al., 2007; Mars et al., 2008; Nassar et al., 2012, 2010;
Ostwald et al., 2012). At the end of section 2, we formalize two experimen-
tally testable predictions of our theory and illustrate them with simulations.
A brief review of related studies as well as a few directions for further work
are supplied in section 3.
2 Resultados
In order to study learning in an environment that exhibits occasional and
abrupt changes, we consider a hierarchical generative model (ver figura
1A) in discrete time, similar to existing model environments (Behrens et al.,
2007; Nassar et al., 2012, 2010; Yu & Dayán, 2005). At each time point t, el
observation Yt = y comes from a distribution with the time-invariant like-
lihood PY (y|i ) parameterized by (cid:3)t = θ , where both y and θ can be mul-
tidimensional. En general, we indicate random variables by capital letters
and values by lowercase letters. Whenever there is no risk of ambiguity, nosotros
drop the explicit notation of random variables to simplify notation. Abrupt
changes of the environment correspond to sudden changes of the parameter
θt. At every time t, there is a change probability pc ∈ (0, 1) for the parameter
θt to be drawn from its prior distribution π (0) independent of its previous
valor, and a probability 1 − pc to stay the same as θt−1. A change at time
t is specified by the event Ct = 1; de lo contrario, Ct = 0. Por lo tanto, the gener-
ative model can be formally defined, for any T ≥ 1, as a joint probability
distribution over (cid:3)
, . . . , (cid:3)t ), C1:t , and Y1:T as
1:T ≡ ((cid:3)
1
PAG(c1:t , i
1:t , y1:t ) =P(c1)PAG(i
1)PAG(y1
|i
1)
t(cid:2)
t=2
PAG(ct )PAG(θt|ct, θt−1)PAG(yt|θt ),
(2.1)
where P(i
1) = π (0)(i
1), PAG(c1) = re(c1
− 1), y
PAG(ct ) = Bernoulli(ct; pc),
(2.2)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
272
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
= 1, the parameter (cid:3)
= θ , the observation Yt
Cifra 1: Nonstationary environment. (A) The generative model. At each time
∈ (0, 1) for a change in the environment. Cuando
point t there is a probability pc
there is a change in the environment, eso es, Ct
t is drawn
from its prior distribution π (0), independent of its previous value. Otherwise the
t retains its value from the previous time step t − 1. Given a parame-
value of (cid:3)
= y is drawn from a probability distribution
ter value (cid:3)
t
PY (y|i ). We indicate random variables by capital letters, and values by lower-
case letters. (B) Example of a nonstationary environment. Your friend meets you
every day at the coffee shop (blue dot) starting after work from her office (o-
ange dot) and crossing a river. The time of arrival of your friend is the observed
variable Yt, which due to the traffic or your friend’s workload may exhibit some
variability but has a stable expectation (i ). Si, sin embargo, a new bridge is opened
= 1 where t is the moment of change), your friend no longer needs to take a
(Ct
detour. There is, entonces, a sudden change in her observed daily arrival times.
PAG(θt|ct, θt−1) =
(cid:3)
δ(θt − θt−1)
Pi (0)(θt )
si
si
ct = 0,
ct = 1,
PAG(yt|θt ) = PY (yt|θt ).
(2.3)
(2.4)
Learning in Volatile Environments With the Bayes Factor Surprise
273
P stands for either probability density function (for the continuous vari-
ables) or probability mass function (for the discrete variables), and δ is the
Dirac or Kronecker delta distribution, respectivamente.
Given a sequence of observations y1:t, the agent’s belief π (t)(i ) about the
parameter (cid:3)t at time t is defined as the posterior probability distribution
PAG((cid:3)t = θ |y1:t ). In the online learning setting studied here, the agent’s goal
is to update the belief π (t)(i ) to the new belief π (t+1)(i ), or an approximation
thereof, upon observing yt+1.
A simplified real-world example of such an environment is illustrated in
Figura 1B. Imagine that every day, a friend meets you at the coffee shop,
starting after work from her office (ver Figura 1B, izquierda). para hacerlo, she needs
to cross a river via a bridge. The time of arrival of your friend (yt) exhibits
some variability due to various sources of stochasticity (p.ej., traffic and your
friend’s daily workload), but it has a stable average over time (θt). Sin embargo,
if a new bridge is opened, your friend arrives earlier, since she no longer
has to take a detour (ver Figura 1B, bien). The moment of opening the new
= 1 in our framework and the sudden change in
bridge is indicated by ct+1
the average arrival time of your friend by a sudden change from θt to θt+1.
Even without any explicit discussion with your friend about this situation
and only by observing her actual arrival time, you can notice the abrupt
change and hence adapt your schedule to the new situation.
2.1 Online Bayesian Inference Modulated by Surprise. According to
the definition of the hierarchical generative model (see Figure 1A and equa-
ciones 2.1 a 2.4), the value yt+1 of the observation at time t + 1 depends only
on the parameters θt+1, and is (given θt+1) independent of earlier observa-
tions and earlier parameter values. We exploit this Markovian property and
update, using Bayes’ rule, the belief π (t)(i ) ≡ P((cid:3)t = θ |y1:t ) at time t to the
new belief at time t + 1:
Pi (t+1)(i ) = PY (yt+1
|i )PAG((cid:3)t+1
|y1:t )
PAG(yt+1
= θ |y1:t )
.
(2.5)
Hasta ahora, equation 2.5 remains rather abstract. The aim of this section is to
rewrite it in the form of a surprise-modulated recursive update. The first
term in the numerator of equation 2.5 is the likelihood of the current ob-
servation given the parameter (cid:3)t+1
= θ , and the second term is the agent’s
estimated probability distribution of (cid:3)t+1 before observing yt+1. Porque
there is always the possibility of an abrupt change, the second term is
not the agent’s previous belief π (t) but P((cid:3)t+1
= θ |y1:t ) = (1 − pc)Pi (t)(i ) +
pcπ (0)(i ). Como resultado, it is possible to find a recursive formula for updating
the belief. For the derivation of this recursive rule, we define the following
terms.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
274
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
Definición 1. The probability or density (for discrete and continuous variables
respectivamente) of observing y with a belief π (t(cid:5)
(cid:4)
) is denoted as
PAG(y; Pi (t(cid:5)
)) =
PY (y|i )Pi (t(cid:5)
)(i )dθ .
(2.6)
)) =P(Yt(cid:5)+1
)) for an arbitrary ˆπ (t(cid:5)
Note that if π is the exact Bayesian belief defined as above in equation 2.5,
= 0). En la sección 2.2 we will also use
). Two particularly interesting cases of equa-
; Pi (t)), the probability of a new observation yt+1 with the
; Pi (0)), the probability of a new observation
then P(y; Pi (t(cid:5)
PAG(y; ˆπ (t(cid:5)
ción 2.6 are P(yt+1
current belief π (t), y P(yt+1
yt+1 with the prior belief π (0).
= y|y1:t(cid:5) , ct(cid:5)+1
Definición 2. The Bayes Factor Surprise SBF of the observation yt+1 is defined as
= 1 (es decir., when there is a
the ratio of the probability of observing yt+1 given ct+1
= 0 (es decir., when there is no
cambiar), to the probability of observing yt+1 given ct+1
cambiar), eso es,
SBF(yt+1
; Pi (t)) =P(yt+1
PAG(yt+1
; Pi (0))
; Pi (t))
.
(2.7)
This definition of surprise measures how much more probable the cur-
rent observation is under the naive prior π (0) relative to the current be-
lief π (t) (mira la sección 3 for further interpretation). This probability ratio is
the Bayes factor (Efron & Hastie, 2016; Kass & Raftery, 1995) that tests the
prior belief π (0) against the current belief π (t). We emphasize that our def-
inition of surprise is not arbitrary, but essential in order to write the exact
inference in equation 2.5 on the generative model in the compact recursive
form indicated in the proposition that follows. Además, as we show later,
this term can be identified in multiple learning algorithms (among them,
Nassar et al., 2010, 2012), but it has never been interpreted as a surprise
measure. In the following sections we establish the generality of this com-
putational mechanism and identify it as a common feature of many learning
algoritmos.
Definición 3. Under the assumption of no change ct+1
recent belief π (t) as prior, the exact Bayesian update for π (t+1) is denoted as
= 0, and using the most
Pi (t+1)
B
(i ) = PY (yt+1
PAG(yt+1
|i )Pi (t)(i )
; Pi (t))
.
(2.8)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Pi (t+1)
B
(i ) describes the incorporation of the new information into the cur-
rent belief via Bayesian updating.
Learning in Volatile Environments With the Bayes Factor Surprise
275
Definición 4. The surprise-modulated adaptation rate is a function γ : R+ ×
R+ → [0, 1] specified as
γ (S, metro) = mS
1 + mS
,
(2.9)
where S ≥ 0 is a surprise value and m ≥ 0 is a parameter controlling the effect of
surprise on learning.
Using the above definitions and equation 2.5, we have for the generative
model of Figure 1A and equations 2.1 a 2.4 the following proposition:
Proposition. Exact Bayesian inference on the generative model is equivalent to
the recursive update rule,
Pi (t+1)(i ) =
(cid:5)
(cid:5)
1 − γ
(cid:5)
S(t+1)
BF
,
pc
1 − pc
(cid:6)
(cid:6)(cid:6)
Pi (t+1)
B
(i )
+ γ
S(t+1)
BF
,
pc
1 − pc
PAG(i |yt+1),
where S(t+1)
BF
= SBF(yt+1
; Pi (t)) is the Bayes Factor Surprise, y
PAG(i |yt+1) = PY (yt+1
PAG(yt+1
|i )Pi (0)(i )
; Pi (0))
(2.10)
(2.11)
is the posterior if we take yt+1 as the only observation.
The proposition indicates that the exact Bayesian inference on the gen-
erative model discussed above (ver figura 1) leads to an explicit trade-off
entre (1) integrating a new observation ynew (corresponding to yt+1) con
the old belief π old (corresponding to π (t)) into a distribution π integration (cor-
responding to π (t+1)
) y (2) forgetting the past observations, so as to restart
with the belief π reset (corresponding to P(i |yt+1)) which relies only on the
new observation and the prior π (0):
B
π new(i ) = (1 − γ ) π integration(i |ynew, π old) + γ π reset(i |ynew, Pi (0)).
(2.12)
This trade-off is governed by a surprise-modulated adaptation rate
γ (S, metro) ∈ [0, 1], where S = SBF
≥ 0 (corresponding to the Bayes Factor Sur-
prise) can be interpreted as the surprise of the most recent observation, y
m = pc
≥ 0 is a parameter controlling the effect of surprise on learning.
1−pc
Because the parameter of modulation m is equal to pc
, for a fixed value of
1−pc
surprise S, the adaptation rate γ is an increasing function of pc. Por lo tanto, en
more volatile environments, the same value of surprise S leads to a higher
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
276
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
adaptation rate than in a less volatile environment; in the case of pc → 1,
any surprise value leads to full forgetting, γ = 1.
As a conclusion, our first main result is that a split as in equation 2.12
with a weighting factor (“adaptation rate” γ ) as in equation 2.9 is exact and
always possible for the class of environments defined by our hierarchical
generative model. This surprise modulation gives rise to specific testable
experimental predictions discussed later.
2.2 Approximate Algorithms Modulated by Surprise. Despite the sim-
plicity of the recursive formula in equation 2.10, the updated belief π (t+1) es
generally not in the same family of distributions as the previous belief π (t),
Por ejemplo, the result of averaging two normal distributions is not a nor-
mal distribution. Hence it is in general impossible to find a simple and exact
update rule for, decir, some sufficient statistic. Como consecuencia, the mem-
ory demands for π (t+1) scale linearly in time, and updating π (t+1) using π (t)
needs O(t) operaciones. In the following sections, we investigate three ap-
proximations, algoritmos 1 a 3, that have simple update rules and finite
memory demands, so that the updated belief remains tractable over a long
sequence of observations.
As our second main result, we show that all our three approximate al-
gorithms inherit the surprise-modulated adaptation rate from the exact
Bayesian approach, ecuaciones 2.9 y 2.12. The first algorithm adapts an
earlier algorithm of surprise minimization learning (SMiLe, Faraji et al.,
2018) to variational learning. We refer to our novel algorithm as Variational
SMiLe and abbreviate it as VarSMiLe (see algorithm 1). The second algo-
rithm is based on message passing (Adams & MacKay, 2007) restricted to
a finite number of messages N. We refer to this algorithm as MPN (see al-
gorithm 2). The third algorithm uses the ideas of particle filtering (gordon,
Salmond, & Herrero, 1993) for an efficient approximation for our hierarchical
generative model. We refer to our approximate algorithm as Particle Fil-
tering with N particles and abbreviate it by pfN (see algorithm 3). All al-
gorithms are computationally efficient, have finite memory demands, y
are biologically plausible; Particle Filtering has possible neuronal imple-
mentations (Huang & Rao, 2014; Kutschireiter, Surace, Vocero, & Pfister,
2017; Legenstein & Maass, 2014; Shi & Griffiths, 2009), MPN can be seen
as a greedy version of pfN without sampling, and Variational SMiLe may
be implemented by neo-Hebbian (Gerstner et al., 2018; Lisman, Grace, &
Duzel, 2011) update rules. Simulation results show that the performance of
the three approximate algorithms is comparable to and more robust across
environments than other state-of-the-art approximations.
2.2.1 Variational SMiLe Rule: Algoritmo 1. A simple heuristic approxi-
mation to keep the updated belief in the same family as the previous be-
liefs consists in applying the weighted averaging of the exact Bayesian up-
date rule (see equation 2.10) to the logarithm of the beliefs rather than the
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
277
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
beliefs themselves:
(cid:7)
(cid:8)
ˆπ (t+1)(i )
registro
= (1 − γt+1) registro
(cid:7)
ˆπ (t+1)
B
(cid:8)
(i )
+ γt+1 log
(cid:8)
(cid:7)
PAG(i |yt+1)
+ Const.,
(2.13)
(cid:7)
(cid:8)
; ˆπ (t)), metro
= γ
SBF(yt+1
where γt+1
is given by equation 2.9 with a free pa-
rameter m > 0, which can be tuned to each environment. By doing so, nosotros
still have the explicit trade-off between two terms as in equation 2.10, pero
in the logarithms, yet an advantageous consequence of averaging over log-
arithms is that if the likelihood function PY is in the exponential family and
the initial belief π (0) is its conjugate prior, then ˆπ (t+1) and π (0) are members
of the same family. In this particular case, we arrive at a simple update rule
for the parameters of ˆπ (t+1) (see algorithm 1 for pseudocode and section 4
for details). As it is common in variational approaches (Beal, 2003), the price
of this simplicity is that except for the trivial cases of pc = 0 and pc = 1, allá
is no evidence other than simulations that the update rule of equation 2.13
will end up at an approximate belief close to the exact Bayesian belief.
One way to interpret the update rule of equation 2.13 is to rewrite it as
the solution of a constraint optimization problem. The new belief ˆπ (t+1) es un
variational approximation of the Bayesian update ˆπ (t+1)
(mira la sección 4)
B
ˆπ (t+1)(i ) = arg min
q
(cid:9)
q(i )|| ˆπ (t+1)
B
(cid:10)
(i )
,
DKL
(2.14)
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
278
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
with a family of functions q(i ) constrained by the Kullback-Leibler diver-
gence,
(cid:10)
(cid:9)
q(i )||PAG(i |yt+1)
DKL
≤ Bt+1
,
(2.15)
where the bound Bt+1
tion of the Bayes Factor Surprise SBF(yt+1
PAG(i |yt+1) is given by equation 2.11.
B
(cid:9)
0, DKL[ ˆπ (t+1)
(cid:10)
(i )||PAG(i |yt+1)]
∈
is a decreasing func-
; ˆπ (t)) (mira la sección 4 for proof), y
Because of the similarity of the constraint optimization problem in equa-
ciones 2.14 y 2.15 to the Surprise Minimization Learning rule, (SMiLe)
(Faraji et al., 2018), we call this algorithm the Variational Surprise Minimiza-
tion Learning rule, or for short, the Variational SMiLe rule. The differences
between SMiLe and variational SMiLe are discussed in section 4.
Our variational method, and particularly its surprise-modulated adapta-
tion rate, is complementary to earlier studies (Masegosa et al., 2017; Özkan,
Šmídl, saha, Lundquist, & Gustafsson, 2013) in machine learning that as-
sumed different generative models and used additional assumptions and
different approaches for deriving the learning rule.
2.2.2 Message-Passing N: Algoritmo 2. For a hierarchical generative
model similar to ours, a message-passing algorithm has been used to per-
form exact Bayesian inference (Adams & MacKay, 2007), where the algo-
rithm’s memory demands and computational complexity scale linearly in
time t. En esta sección, we first explain the idea of the message-passing al-
gorithm of Adams and MacKay (2007) and its relation to our proposition.
We then present our approximate version of this algorithm, que tiene un
constant (in time) computational complexity and memory demands.
The history of change points up to time t is a binary sequence, for ex-
amplio, c1:t= {1, 0, 0, 1, 0, 1, 1}, where the value 1 indicates a change in the
corresponding time step. Following the idea of Adams and MacKay (2007),
we define the random variable Rt = min{n ∈ N : Ct−n+1
= 1} in order to de-
scribe the time since the last change point, which takes values between 1
y T. We can write the exact Bayesian expression for π (t)(i ) by marginaliz-
ing P((cid:3)t = θ , rt|y1:t ) over the t possible values of rt in the following way:
Pi (t)(i ) =
t−1(cid:11)
k=0
PAG(Rt = t − k|y1:t )PAG((cid:3)t = θ |Rt = t − k, y1:t ).
(2.16)
For consistency with algorithm 3 (es decir. Particle Filtering), we call each term in
the sum of equation 2.16 a “particle” and denote as π (t)
k (i ) =P((cid:3)t = θ |Rt =
t − k, y1:t ) the belief of the particle corresponding to Rt = t − k, and w(k)
=
t
PAG(Rt = t − k|y1:t ) its corresponding weight at time t:
Pi (t)(i ) =
t−1(cid:11)
k=0
w(k)
t
Pi (t)
k (i ).
(2.17)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
279
For each particle, the term π (t)
k (i ) is simple to compute, because when rt is
conocido, inference depends only on the observations after the last change
punto. Por lo tanto, the goal of online inference is to find an update rule for
the evolution of the weights w(k)
t over time.
We can apply the exact update rule of our proposition (see equation 2.10)
to the belief expressed in the form of equation 2.17. Upon each observation
of a new sample yt+1, a new particle is generated and added to the set of
particles, corresponding to P(i |yt+1) (eso es, π reset), modeling the possibil-
ity of a change point occurring at t + 1. According to the proposition, el
weight of the new particle (k = t) is equal to
,
(2.18)
w(t)
t+1
= γt+1
(cid:7)
(cid:8)
= γ
; Pi (t)),
where γt+1
SBF(yt+1
(cf. equation 2.9). The other t particles
coming from π (t) correspond to π (t+1)
(eso es, π integration) in the proposition.
The update rule (mira la sección 4 for derivation) for the weights of these parti-
cles (0 ≤ k ≤ t − 1) es
pc
1−pc
B
w(k)
t+1
= (1 − γt+1)w(k)
B,t+1
= (1 − γt+1)
PAG(yt+1
PAG(yt+1
; Pi (t)
k )
; Pi (t))
w(k)
t
.
(2.19)
Hasta ahora, we used the idea of Adams and MacKay (2007) to write the belief as in
equation 2.17 and used our proposition to arrive at the surprise-modulated
update rules in equations 2.18 y 2.19.
t
The computational complexity and memory requirements of the com-
plete message-passing algorithm increase linearly with time t. To deal with
this issue and to have a constant computation and memory demands over
tiempo, we implemented a message-passing algorithm of the form of equa-
ciones 2.17 a 2.19, but with a fixed number N of particles, chosen as those
with the highest weights w(k)
. Por lo tanto, our second algorithm adds a
new approximation step to the full message-passing algorithm of Adams
and MacKay (2007): Whenever t > N, after adding the new particle with
the weight as in equation 2.18 and updating the previous weights as in
equation 2.19, we discard the particle with the smallest weight (es decir., colocar
its weight equal to 0) and renormalize the weights. By doing so, we al-
ways keep the number of particles with nonzero weights equal to N. Nota
that for t ≤ N, our algorithm is exact and identical to the message-passing
algorithm of Adams and MacKay (2007). We call our modification of the
message-passing algorithm of Adams and MacKay (2007) “Message Pass-
ing N (MPN)."
To deal with the computational complexity and memory requirements,
one may alternatively keep only the particles with weights greater than
a cut-off threshold (Adams & MacKay, 2007). Sin embargo, such a constant
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
280
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
cut-off leads to a varying number (smaller or equal to t) of particles in time.
Our approximation MPN can therefore be seen as a variation of the thresh-
olding algorithm in Adams and MacKay (2007) with fixed number of parti-
cles N, and hence a variable cut-off threshold. The work of Fearnhead and
Liu (2007) follows the same principle as Adams and MacKay (2007), but em-
ploys stratified resampling to eliminate particles with negligible weights in
order to reduce the total number of particles. Their resampling algorithm
involves solving a complicated nonlinear equation at each time step, cual
makes it unsuitable for a biological implementation. Además, we experi-
enced that in some cases, the small errors introduced in the resampling step
of the algorithm of Fearnhead and Liu (2007) accumulated and led to worse
performance than our MPN algorithm, which simply keeps the N particles
with the highest weight at each time step.
For the case where the likelihood function PY (y|i ) is in the exponential
family and π (0) is its conjugate prior, the resulting algorithm of MPN has
a simple update rule for the belief parameters (see algorithm 2 and sec-
ción 4 for details). For comparison, we also implemented in our simulations
the full message-passing algorithm of Adams and MacKay (2007) con un
almost zero cut-off (machine precision), which we consider as our bench-
mark “Exact Bayes,” as well as the stratified optimal resampling algorithm
of Fearnhead and Liu, which we call “SORN.”
2.2.3 Particle Filtering: Algoritmo 3. Ecuación 2.17 demonstrates that the
exact Bayesian belief π (t) can be expressed as the marginalization of P((cid:3)t=
i , rt|y1:t ) over the time since the last change point rt. Equivalently, uno
can compute the exact Bayesian belief as the marginalization of P((cid:3)t=
i , c1:t|y1:t ) over the history of change points c1:t:
Pi (t)(i ) =
(cid:11)
c1:t
PAG(c1:t|y1:t )PAG((cid:3)t = θ |c1:t, y1:t )
= mi
PAG(C1:t |y1:t )
(cid:9)
(cid:10)
PAG((cid:3)t = θ |C1:t, y1:t )
.
(2.20)
The idea of our third algorithm is to approximate this expectation by par-
ticle filtering, eso es, sequential Monte Carlo sampling (Doucet, Godsill, &
Andrieu, 2000; Gordon et al., 1993) from P(C1:t|y1:t ).
We then approximate π (t) por
ˆπ (t)(i ) =
norte(cid:11)
yo=1
w(i)
t
ˆπ (t)
i
(i ) =
norte(cid:11)
yo=1
t P((cid:3)t = θ |C(i)
w(i)
1:t
, y1:t ),
(2.21)
dónde {C(i)
1:t
ticles) drawn from a proposal distribution Q(c1:t|y1:t ), {w(i)
t
}norte
i=1 is a set of N realizations (or samples) of c1:t (es decir., N par-
}norte
i=1 are their
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
281
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
, y1:t ) is the ap-
corresponding weights at time t, and ˆπ (t)
proximate belief corresponding to particle i.
i
(i ) =P((cid:3)t = θ |C(i)
1:t
Upon observing yt+1, the update procedure for the approximate belief
ˆπ (t+1) of equation 2.21 includes two steps: (1) updating the weights and
(2) sampling the new hidden state ct+1 for each particle. The two steps
are coupled together through the choice of the proposal distribution Q, para
which we choose the optimal proposal function (Doucet et al., 2000; ver
sección 4). Como resultado, given this choice of proposal function, we show (ver
282
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
sección 4) that the first step amounts to
w(i)
t+1
= (1 − γt+1)w(i)
B,t+1
+ γt+1
w(i)
t
,
w(i)
B,t+1
=P(yt+1
PAG(yt+1
(cid:7)
; ˆπ (t)
)
i
; ˆπ (t))
w(i)
t
,
(2.22)
(cid:8)
; ˆπ (t)), metro
= γ
B,t+1
SBF(yt+1
with m = pc
1−pc
(cf. equation 2.9), y
i=1 are the weights corresponding to the Bayesian update ˆπ (t+1)
}norte
where γt+1
{w(i)
de
equation 2.8 (mira la sección 4). In the second step, we update each particle’s his-
tory of change points by going from the sequence {C(i)
}norte
yo=1, para
1:t
which we always keep the old sequence up to time t, and for each particle
i, we add a new element c(i)
, 0]
t+1
= [C(i)
or change c(i)
, 1]. Nota, sin embargo, that it is not needed to keep the
1:t
whole sequences c(i)
t+1 to update
to ˆπ (t+1)
ˆπ (t)
t+1 from the optimal proposal
i
i
distribution Q(C(i)
, y1:t+1) (Doucet et al., 2000), which is given by (ver
t+1
sección 4)
1:t+1 in memory, but instead one can use c(i)
∈ {0, 1} representing no change c(i)
. We sample the new element c(i)
i=1 to {C(i)
}norte
= [C(i)
1:t
|C(i)
1:t
1:t+1
1:t+1
1:t+1
B
q(C(i)
t+1
= 1|C(i)
1:t
, y1:t+1) = γ
SBF(yt+1
; ˆπ (t)
i
),
(cid:5)
(cid:6)
.
pc
1 − pc
(2.23)
Curiosamente, the above formulas entail the same surprise modulation
and the same trade-off as proposed by the proposition’s equation 2.10. Para
the weight update, there is a trade-off between an exact Bayesian update
and keeping the value of the previous time step, controlled by a adapta-
tion rate modulated exactly in the same way as in equation 2.10. Tenga en cuenta que
in contrast to equation 2.10, the trade-off for the particles’ weights is not
between forgetting and integrating, but between maintaining the previous
knowledge and integrating. Sin embargo, the change probability (see equation
2.23) for sampling is equal to the adaptation rate and is an increasing func-
tion of surprise. Como resultado, although the weights are updated less for sur-
prising events, a higher surprise causes a higher probability for change,
indicated by c(i)
= 1, which implies forgetting, because for a particle i
t+1
with c(i)
, y1:t+1)
=P((cid:3)t+1
t+1
is equal to P((cid:3)t+1
= 1, yt+1) =P(i |yt+1) (ver Figura 1A), cual es
equivalent to a reset of the belief as in equation 2.12. En otras palabras, mientras
in MPN and the exact Bayesian inference in the proposition’s equation
2.10, the trade-off between integration and reset is accomplished by adding
at each time step a new particle with weight γt+1, in Particle Filtering, él
is accomplished via sampling. As a conclusion, the above formulas are
essentially the same as the update rules of MPN (see equations 2.19 y
= 1, the associated belief ˆπ (t+1)
= θ |C(i)
t+1
= θ |C(i)
t+1
= 1, C(i)
1:t
i
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
283
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
2.18) and have the same spirit as the recursive update of the proposition’s
equation 2.10.
Ecuaciones 2.21 y 2.23 can be applied to the case where the likelihood
function PY (y|i ) is in the exponential family and π (0) is its conjugate prior.
The resulting algorithm (algoritmo 3) has a particularly simple update rule
for the belief parameters (mira la sección 4 for details).
284
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
The theory of particle filter methods is well established (Doucet et al.,
2000; Gordon et al., 1993; Särkkä, 2013). Particle filters in simpler (Marrón
& Steyvers, 2009) or more complex (Findling et al., 2019) forms have also
been employed to explain human behavior. Here we derived a simple par-
ticle filter for the general case of generative models of equations 2.2 a 2.4.
Our main contribution is to show that the use of the optimal proposal dis-
tribution in this particle filter leads to a surprise-based update scheme.
2.2.4 Surprise-Modulation as a Framework for Other Algorithms. Other ex-
isting algorithms (Adams & MacKay, 2007; Faraji et al., 2018; Fearnhead &
Liu, 2007; Nassar et al., 2012, 2010) can also be formulated in the surprise-
modulation framework of equations 2.9 y 2.12 (mira la sección 4). Además,
in order to allow for a transparent discussion and for fair comparisons in
simulations, we extended the algorithms of Nassar et al. (2012, 2010) to a
more general setting. Here we give a brief summary of the algorithms we
consideró. A detailed analysis is provided in section 4.5.
The algorithms of Nassar et al. (2012, 2010) were originally designed for
a gaussian estimation task (mira la sección 2.3 for details of the task) con un
broad uniform prior. We extended them to the more general case of gaus-
sian tasks with gaussian priors, and we call our extended versions Nas10∗
and Nas12∗ for Nassar et al. (2010) and Nassar et al. (2012), respectivamente
(for a performance comparison between our extended algorithms and their
original versions see Figures 12 y 13). Both algorithms have the same
surprise-modulation as in our proposition (see equation 2.10). Hay
multiple interpretations of the approaches of Nas10∗ and Nas12∗ and links
to other algorithms. One such link we identify is in relation to Particle Fil-
tering with a single particle (pf1). More specifically, one can show that pf1
behaves in expectation similar to Nas10∗ and Nas12∗ (mira la sección 4 y el
appendix).
2.3 Simulaciones. With the goal of gaining a better understanding of
different approximate algorithms, we evaluated the departure of their
performance from the exact Bayesian algorithm in terms of mean squared
error (MSE) of estimating (cid:3)t (mira la sección 4), on two tasks inspired by and
closely related to real experiments (Behrens et al., 2007; Maheu, Dehaene,
& Meyniel, 2019; Mars et al., 2008; Nassar et al., 2012, 2010; Ostwald et al.,
2012): a gaussian and a categorical estimation task.
We compared our three novel algorithms VarSMiLe, Particle Filtering
(pfN, where N is the number of particles), and Message Passing with fi-
nite number of particles N (MPN) to the online exact Bayesian Message
Passing algorithm (Adams & MacKay, 2007) (Exact Bayes), which yields the
optimal solution with ˆ(cid:3)t = ˆ(cid:3)Opt
. Además, we included in the compar-
ison the stratified optimal resampling algorithm (Fearnhead & Liu, 2007)
(SORN, where N is the number of particles), our variant of Nassar et al.
t
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
285
∗
∗
) and Nassar et al. (2012) (Nas12
(2010) (Nas10
), the Surprise-Minimization
Learning algorithm of Faraji et al. (2018) (SMiLe), as well as a simple Leaky
Integrator (Leaky; mira la sección 4).
The algorithms Exact Bayes and SORN come from the field of change
point detection, and whereas the former has high memory demands, el
latter has the same memory demands as our algorithms pfN and MPN. El
∗
algorithms Nas10
, and SMiLe, por otro lado, come from the
human learning literature and are more biologically oriented.
, Nas12
∗
∗
|μt+1
2.3.1 Gaussian Estimation Task. The task is a generalized version of the
experiment of Nassar et al. (2012, 2010). The goal of the agent is to estimate
the mean θt = μt of observed samples, which are drawn from a gaussian
distribution with known variance σ 2: yt+1
, pag 2). The mean
μt+1 is itself drawn from a gaussian distribution μt+1
∼ N (0, 1) whenever
the environment changes. En otras palabras, the task is a special case of the
generative model of equations 2.2 a 2.4, with π (0)(μt ) = N (μt; 0, 1) y
PY (yt|μt ) = N (yt; μt, pag 2). An example of the task is shown in Figure 2A.
∼ N (μt+1
∗
and Nas12
We simulated the task for all combinations of σ ∈ {0.1, 0.5, 1, 2, 5} y
pc ∈ {0.1, 0.05, 0.01, 0.005, 0.001, 0.0001}. For each combination of σ and pc,
we first tuned the free parameter of each algorithm—m for SMiLe and Vari-
ational SMiLe, the leak parameter for the Leaky Integrator, and the pc of
Nas10
—by minimizing the MSE on three random initializa-
tions of the task. For the Particle Filter (pfN), the Exact Bayes, the MPN, y
the SORN, we empirically checked that the true pc of the environment was
indeed the value that gave the best performance, and we used this value for
the simulations. We evaluated the performance of the algorithms on 10 dif-
ferent random task instances of 105 steps each for pc ∈ {0.1, 0.05, 0.01, 0.005}
y 106 steps each for pc ∈ {0.001, 0.0001} (in order to sample more change
puntos). Note that the parameter σ is not estimated, and its actual value is
used by all algorithms except the Leaky Integrator.
In Figure 2B we show the MSE[ ˆ(cid:3)t|Rt = n] in estimating the parameter
after n steps since the last change point, for each algorithm, computed over
multiple changes, for two exemplar task settings. The Particle Filter with
∗
20 particles (pf20), the VarSMiLe, and the Nas12
have an overall perfor-
mance very close to that of the Exact Bayes algorithm (MSE[ ˆ(cid:3)Opt
|Rt = n]),
with much lower memory requirements. VarSMiLe sometimes slightly out-
performs the other two early after an environmental change (see Figure 2B,
bien), but shows slightly higher error values at later phases. The MPN algo-
rithm is the closest one to the optimal solution (es decir., Exact Bayes) for low σ
(see Figure 2B, izquierda), but its performance is much worse for the case of high
σ and low pc (see Figure 2B, bien). For SORN we observe a counter intu-
itive behavior in the regime of low σ : the inclusion of more particles leads
to worse performance (see Figure 2B, izquierda). At higher σ levels, the perfor-
mance of SOR20 is close to optimal and better than the MP20 in later time
t
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
286
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
− μ
t )2/2pag 2) with changing mean μ
t ). In this example, σ = 1 and pc
Cifra 2: Gaussian estimation task: Transient performance after changes. (A) En
each time step an observation (depicted as black dot) is drawn from a gaussian
distribution ∼ exp(−(yt
t (marked in blue)
and known variance σ 2 (lower left panels). At every change of the environment
(marked with red lines), a new mean μ
t is drawn from a standard gaussian dis-
= 0.01. (B) Mean squared
tribution ∼ exp(−μ2
error for the estimation of μ
t at each time step n after an environmental change,
= 0.1 (left panel)
= norte] con el tiempo; σ = 0.1, pc
eso es, the average of MSE[ ˆ(cid:3)
t
= 0.01 (right panel). The shaded area corresponds to the standard
and σ = 5, pc
error of the mean. Abbreviations: pfN: Particle Filtering with N particles; MPN:
Message Passing with N particles; VarSMiLe: Variational SMiLe; SORN: Strat-
isfied Optimal Resampling with N particles (Fearnhead & Liu, 2007); SMiLe:
Faraji et al. (2018); Nas10∗, Nas12∗: Variants of Nassar et al. (2010) and Nas-
sar et al. (2012), respectivamente; Leaky: Leaky Integrator, Exact Bayes: Adams and
MacKay (2007).
|Rt
steps. This may be due to the fact that the MPN discards particles in a de-
terministic and greedy way (es decir., the one with the lowest weight), mientras
for the SORN, there is a component of randomness in the process of par-
ticle elimination, which may be important for environments with higher
stochasticity.
For the Leaky Integrator, we observe a trade-off between good perfor-
mance in the transient phase and the stationary phase; a fixed leak value
cannot fulfill both requirements. The SMiLe rule, by construction, never
narrows its belief ˆπ (i ) below some minimal value, which allows it to have
a low error immediately after a change but leads later to high errors. Its
∗
performance deteriorates for higher σ (see Figure 2B, bien). The Nas10
Learning in Volatile Environments With the Bayes Factor Surprise
287
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 3: Gaussian estimation task: Steady-state performance. (A) Significar
squared error of the Exact Bayes algorithm (es decir., optimal solution) for each com-
bination of σ and pc averaged over time. (B–F) Difference between the mean
squared error of each algorithm and the optimal solution (of panel A), eso es,
el promedio de (cid:8)MSE[ ˆ(cid:3)
t] con el tiempo. The color bar of panel A applies to these
panels as well. Note that the black color for the MP20 indicates negative values,
which are due to the finite sample size for the estimation of MSE. Abbreviations:
See the Figure 2 caption.
performs well for low but not for higher values of σ . Despite the fact that
∗
a Particle Filter with one particle (pf1) is in expectation similar to Nas10
∗
and Nas12
(mira la sección 4), it performs worse than these two algorithms on
trial-by-trial measures. Still, it performs better than the MP1 and identically
to the SOR1.
In Figure 3A, we have plotted the average of MSE[ ˆ(cid:3)Opt
] of the Exact
Bayes algorithm over the whole simulation time for each of the consid-
ered σ and pc levels. The difference between the other algorithms and this
benchmark is called (cid:8)MSE[ ˆ(cid:3)t] (mira la sección 4) and is plotted in Figures
3C to 3F. All algorithms except for the SOR20 have lower average error
t
288
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
values for low σ and low pc than high σ and high pc. The Particle Filter pf20
and the Message Passing MP20 have the smallest difference from the opti-
mal solution. The average error of MP20 is higher than that of pf20 for high
σ and low pc, whereas pf20 is more robust across levels of environmental
parámetros. The worst-case performance for pf20 is (cid:8)MSE[ ˆ(cid:3)t] = 0.033 para
σ = 5 and pc = 0.0001, and for SOR20, es (cid:8)MSE[ ˆ(cid:3)t] = 0.061 for σ = 0.1
and pc = 0.1. The difference between these two worst-case scenarios is sig-
−6, two-sample t-test, 10 random seeds for each
nificant (p-value = 2.79 × 10
∗
algoritmo). Next in performance are the algorithms, Nas12
and VarSMiLe.
VarSMiLe exhibits its largest deviation from the optimal solution for high
σ and low pc, but is still more resilient compared to the MPN algorithms
for this type of environment. Among the algorithms with only one unit of
∗
memory demands—pf1, MP1, SOR1, VarSMiLe, SMiLe, Leaky, Nas10
y
Nas12
. The SOR20 has low error
overall but unexpectedly high error for environmental settings that are pre-
sumably more relevant for biological agents (intervals of low stochasticity
marked by abrupt changes). The simple Leaky Integrator performs well at
low σ and pc but deviates more from the optimal solution as these parame-
ters increase (see Figure 3F). The SMiLe rule performs best at lower σ , eso
es, more deterministic environments.
∗
—the winners are VarSMiLe and Nas12
A summary graph, where we collect the (cid:8)MSE[ ˆ(cid:3)t] across all levels of σ
∗
and pc, se muestra en la figura 4. We can see that pf20, Nas12
, and VarSMiLe
give the lowest worst case (lowest maximum value) (cid:8)MSE[ ˆ(cid:3)t] and are sta-
tistically better than the other eight algorithms (the error bars indicate the
standard error of the mean across the ten random task instances).
∗
|pt+1
2.3.2 Categorical Estimation Task. The task is inspired by the experiments
of Behrens et al. (2007), Maheu et al. (2019), Mars et al. (2008), and Ostwald
et al. (2012). The goal of the agent is to estimate the occurrence probabil-
∈ {1, . . . , 5} is drawn from
ity of five possible states. Each observation yt+1
a categorical distribution with parameters θt+1
∼
= pt+1, eso es, yt+1
= 1 in the environment, el
Cat(yt+1
parameters pt+1 are drawn from a Dirichlet distribution Dir(s · 1), dónde
s ∈ (0, ∞) is the stochasticity parameter. In relation to the generative model
of equations 2.2 a 2.4, we thus have π (0)(pt ) = Dir(pt; s · 1) and PY (yt|pt ) =
Cat(yt; pt ). An illustration of this task is depicted in Figure 5A.
; pt+1). When there is a change Ct+1
We considered the combinations of stochasticity levels s ∈ {0.01, 0.1, 0.14,
0.25, 1, 2, 5} and change probability levels pc ∈ {0.1, 0.05, 0.01, 0.005, 0.001,
0.0001}. The algorithms of Nassar et al. (2012, 2010) were specifically devel-
oped for a gaussian estimation task and cannot be applied here. All other
algorithms were first optimized for each combination of environmental pa-
rameters before an experiment starts, and then evaluated on 10 diferente
random task instances, para 105 steps each for pc ∈ {0.1, 0.05, 0.01, 0.005} y
106 steps each for pc ∈ {0.001, 0.0001}. The parameter s is not estimated, y
its actual value is used by all algorithms except the Leaky Integrator.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
289
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 4: Gaussian estimation task: Steady-state performance summary. Dif-
ference between the mean squared error of each algorithm and the optimal
solución (Exact Bayes), eso es, el promedio de (cid:8)MSE[ ˆ(cid:3)
t] con el tiempo, for all combi-
nations of σ and pc together. For each algorithm we plot the 30 valores (5 σ times
6 pc values) of Figure 3 with respect to randomly jittered values in the x-axis. El
color coding is the same as in Figure 2. The error bars mark the standard error
of the mean across 10 random task instances. The difference between the worst
case of SOR20 and pf20 is significant (p-value = 2.79 × 10−6, two-sample t-test,
10 random seeds for each algorithm). Abbreviations: See the Figure 2 caption.
The Particle Filter pf20, the MP20, and the SOR20 have a performance
closest to that of Exact Bayes, the optimal solution (see Figure 5B). VarSMiLe
is the next in the ranking, with a behavior after a change similar to the gaus-
sian task. pf20 performs better for s > 2, and MP20 performs better for s ≤ 2
(ver figura 6). For this task, the biologically less plausible SOR20 is the win-
ner in performance, and it behaves most consistently across environmental
−5 for s =
parámetros. Its worst-case performance is (cid:8)MSE[ ˆ(cid:3)t] = 8.16 × 10
2 and pc = 0.01, and the worst-case performance for pf20 is (cid:8)MSE[ ˆ(cid:3)t] =
−12, two-sample
0.0048 for s = 0.25 and pc = 0.005 (p-value = 1.148 × 10
prueba t, 10 random seeds for each algorithm). For all the other algorithms
except MP20, the highest deviations from the optimal solution are ob-
served for medium stochasticity levels (see Figures 6B to 6F). Cuando el
290
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
Cifra 5: Categorical estimation task: Transient performance after changes.
(A) At each time step, the agent sees one out of five possible categories (negro
dots) drawn from a categorical distribution with parameters pt . Occasional
abrupt changes happen with probability pc and are marked with red lines. Af-
ter each change, a new pt vector is drawn from a Dirichlet distribution with
= 0.01. (B) Mean squared
stochasticity parameter s. In this example, s = 1 and pc
error for the estimation of pt at each time step n after an environmental change,
= 0.01 (left panel)
= norte] con el tiempo; s = 0.14, pc
eso es, the average of MSE[ ˆ(cid:3)
t
= 0.005 (right panel). The shaded area corresponds to the stan-
and s = 5, pc
dard error of the mean. Abbreviations: pfN: Particle Filtering with N parti-
cles; MPN: Message Passing with N particles; VarSMiLe: Variational SMiLe;
SORN: Stratisfied Optimal Resampling with N particles (Fearnhead & Liu,
2007); SMiLe: Faraji et al. (2018); Leaky: Leaky Integrator; Exact Bayes: Adams
and MacKay (2007).
|Rt
environment is nearly deterministic (p.ej., s = 0.001 so that the parameter
vectors pt have almost all mass concentrated in one component), or highly
stochastic (p.ej., s > 1 so that nearly uniform categorical distributions are
likely to be sampled), these algorithms achieve higher performance, mientras
the Particle Filter is the algorithm that is most resilient to extreme choices of
the stochasticity parameter s. For VarSMiLe in particular, the lowest mean
error is achieved for high s and high pc or low s and low pc.
A summary graph, con el (cid:8)MSE[ ˆ(cid:3)t] across all levels of s and pc, is in
Cifra 7. The algorithms with the lowest “worst case” are SOR20 and pf20.
The top four algorithms SOR20, pf20, MP20, and VarSMiLe are significantly
better than the others (the error bars indicate the standard error of the mean
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
291
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 6: Categorical estimation task: Steady-state performance. (A) Significar
squared error of the Exact Bayes algorithm (es decir., optimal solution) for each com-
bination of environmental parameters s and pc averaged over time. (B–F) Dif-
ference between the mean squared error of each algorithm and the optimal
solución (of panel A), eso es, el promedio de (cid:8)MSE[ ˆ(cid:3)
t] con el tiempo. The color
bar of panel A applies to these panels as well. Abbreviations: See the Figure 5
caption.
a través del 10 random task instances), whereas MP1 and SMiLe have the
largest error with a maximum at 0.53.
2.3.3 Summary of Simulation Results. En resumen, our simulation re-
sults of the two tasks collectively suggest that our Particle Filtering (pfN)
and Message Passing (MPN) algorithms achieve a high level of perfor-
mance, very close to the one of biologically less plausible algorithms with
más alto (Exact Bayes) and same (SORN) memory demands. Además, su
behavior is more consistent across tasks. Finalmente, among the algorithms with
memory demands of one unit, VarSMiLe performs best.
292
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 7: Categorical estimation task: Steady-state performance summary. Dif-
ference between the mean squared error of each algorithm and the optimal
solución (Exact Bayes), eso es, el promedio de (cid:8)MSE[ ˆ(cid:3)
t] con el tiempo, for all com-
binations of s and pc together. For each algorithm, we plot the 42 valores (7 s
veces 6 pc values) of Figure 6 with respect to randomly jittered values in the
x-axis. The color coding is the same as in Figure 5. The error bars mark the stan-
dard error of the mean across 10 random task instances. The difference between
the worst case of SOR20 and pf20 is significant (p-value = 1.148 × 10−12, two-
sample t-test, 10 random seeds for each algorithm). Abbreviations: See the Fig-
ura 5 caption. Note that MP1 and SMiLe are out of bound with a maximum at
0.53.
2.3.4 Robustness against Suboptimal Parameter Choice. In all algorithms we
consideró, the environment’s hyperparameters are assumed to be known.
We can distinguish between two types of hyperparameters in our genera-
tive model: the parameters of the likelihood function (p.ej., σ in the gaussian
tarea), and the pc and the parameters of the conjugate prior (p.ej., s in the cat-
egorical task). Hyperparameters of the first type can be added to the param-
eter vector θ and be inferred with the same algorithm. Sin embargo, aprendiendo
the second type of hyperparameters is not straightforward. By assuming
that these hyperparameters are learned more slowly than θ , one can fine-
tune them after each n (p.ej., 10) change points, while change points can be
detected by looking at the particles for the Particle Filter and at the peaks
Learning in Volatile Environments With the Bayes Factor Surprise
293
of surprise values for VarSMiLe. Other approaches to hyperparameter esti-
mation can be found in Doucet and Tadi´c (2003), George and Doss (2017),
Liu and West (2001), and Wilson et al. (2010).
When the hyperparameters are fixed, a mismatch between the assumed
values and the true values is a possible source of errors. En esta sección, nosotros
investigate the robustness of the algorithms to a mismatch between the as-
sumed and the actual probability of change points. para hacerlo, we first tuned
each algorithm’s parameter for an environment with a change probability
pc and then tested the algorithms in environments with different change
probabilities while keeping the parameter fixed. For each new environment
with a different change probability, we calculated the difference between
the MSE of these fixed parameters and the optimal MSE, eso es, the re-
sulting MSE for the case that the Exact Bayes’ parameter is tuned for the
actual pc.
, pc] − MSE[ ˆ(cid:3)Opt
Más precisamente, if we denote as MSE[ ˆ(cid:3)t; pag(cid:5)
, pc] the MSE of an al-
gorithm with parameters tuned for an environment with p(cid:5)
C, aplicado
in an environment with pc, we calculated the mean regret, defined as
MSE[ ˆ(cid:3)t; pag(cid:5)
, pc], con el tiempo; note that the second term is
equal to MSE[ ˆ(cid:3)t; pc, pc] when the algorithm Exact Bayes is used for esti-
formación. The lower the values and the flatter the curve of the mean regret,
the better the performance and the robustness of the algorithm in the face of
lacking knowledge of the environment. The slope of the curve indicates the
degree of deviations of the performance as we move away from the opti-
mally tuned parameter. We ran three random (and same for all algorithms)
tasks initializations for each pc level.
C
C
t
C
c levels. For σ = 0.1 y P(cid:5)
En figura 8 we plot the mean regret for each algorithm for the gaussian
task for four pairs of s and p(cid:5)
= 0.04 (ver figura
8A), the Exact Bayes and the MP20 show the highest robustness (smallest
regret) and are closely followed by the pf20, VarSMiLe, and Nas12∗ (nota
the regret’s small range of values). The lower the actual pc, the higher the
regret, but the changes are still very small. The curves for the SMiLe and the
Leaky Integrator are also relatively flat, but the mean regret is much higher.
The SOR20 is the least robust algorithm.
Similar observations can be made for σ = 0.1 y P(cid:5)
C
= 0.004 (ver figura-
ure 8B). En este caso, the performance of all algorithms deteriorates strongly
when the actual pc is higher than the assumed one.
Sin embargo, for σ = 5 (see Figures 8C and 8D), the ranking of algorithms
cambios. The SOR20 is very robust for this level of stochasticity. The pf20
and MP20 perform similarly for pc = 0.04, but for lower p(cid:5)
C, the pf20 is
more robust and the MP20 exhibits high fluctuations in its performance.
The Nas12∗ is quite robust at this σ level. Overall for Exact Bayes, SOR20,
∗
, a mismatch of the assumed pc from the actual
pf20, VarSMiLe, and Nas12
one does not deteriorate the performance dramatically for σ = 5, pag(cid:5)
= 0.004
C
(see Figure 8D). The SMiLe and the Leaky Integrator outperform the other
algorithms for higher p(cid:5)
C (see Figure 8C). A potential reason is that
c if pc < p(cid:5)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
294
V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 8: Robustness to mismatch between actual and assumed probability of
changes for the gaussian estimation task. The mean regret is the mean squared
error obtained with assumed change probability p(cid:5)
c minus the mean squared
error obtained with the optimal parameter choice of Exact Bayes for the given
; p(cid:5)
actual pc, that is, the average of the quantity MSE[ ˆ(cid:3)
, pc]
c
over time. A red triangle marks the p(cid:5)
c value each algorithm was tuned for. We
plot the mean regret for the following parameter combinations: A, σ = 0.1 and
p(cid:5)
= 0.004; C, σ = 5 and p(cid:5)
=
c
c
0.004. Abbreviations: See the Figure 2 caption.
= 0.04; B, σ = 0.1 and p(cid:5)
c
= 0.04; D, σ = 5 and p(cid:5)
c
, pc] − MSE[ ˆ(cid:3)Opt
t
t
the optimal behavior for the Leaky Integrator (according to the tuned pa-
rameters) is to constantly integrate new observations into its belief (i.e., to
act like a Perfect Integrator) regardless of the p(cid:5)
c level. This feature makes it
blind to the pc and therefore very robust against the lack of knowledge of it
(see Figure 8C).
In summary, most of the time, the mean regret for Exact Bayes and MP20
is less than the mean regret for pf20 and VarSMiLe. However, the variability
Learning in Volatile Environments With the Bayes Factor Surprise
295
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 9: Robustness to mismatch between actual and assumed probability of
changes for the categorical estimation task. The mean regret is the mean squared
error obtained with assumed change probability p(cid:5)
c minus the mean squared
error obtained with the optimal parameter choice of Exact Bayes for the given
; p(cid:5)
actual pc, that is, the average of the quantity MSE[ ˆ(cid:3)
, pc]
c
over time. A red triangle marks the p(cid:5)
c value each algorithm was tuned for. We
plot the mean regret for the following parameter combinations: A, s = 0.14 and
p(cid:5)
= 0.004; C, s = 5 and p(cid:5)
=
c
c
0.004. Abbreviations: See the Figure 5 caption.
= 0.04; B, s = 0.14 and p(cid:5)
c
= 0.04; D, s = 5 and p(cid:5)
c
, pc] − MSE[ ˆ(cid:3)Opt
t
t
in the mean regret for pf20 and VarSMiLe is smaller, and their curves are
flatter across pc levels, which makes their performance more predictable.
The results for the categorical estimation task are similar to those of the
gaussian task, with the difference that the SOR20 is very robust for this case
(see Figure 9).
2.4 Experimental Prediction. It has been experimentally shown that
indicators statistically
some important behavioral and physiological
296
V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea
correlate with a measure of surprise or a prediction error. Examples of such
indicators are the pupil diameter (Joshi & Gold, 2020; Nassar et al., 2012;
Preuschoff et al., 2011), the amplitude of the P300, N400, and MMN compo-
nents of EEG (Kopp & Lange, 2013; Lieder, Daunizeau, Garrido, Friston, &
Stephan, 2013; Mars et al., 2008; Meyniel, Maheu, & Dehaene, 2016; Modir-
shanechi et al., 2019; Musiolek, Blankenburg, Ostwald, & Rabovsky, 2019;
Ostwald et al., 2012), the amplitude of MEG in specific time windows (Ma-
heu et al., 2019), BOLD responses in fMRI (Konovalov & Krajbich, 2018;
Loued-Khenissi, Pfeuffer, Einhäuser, & Preuschoff, 2020), and reaction time
(Huettel, Mack, & McCarthy, 2002; Meyniel et al., 2016). The surprise mea-
sure is usually the negative log probability of the observation, known as
Shannon Surprise (Shannon, 1948), and denoted here as SSh. However, as
we show in this section, as long as there is an uninformative prior over
observations, Shannon Surprise SSh is just an invertible function of our
modulated adaptation rate γ and hence an invertible function of the Bayes
Factor Surprise SBF. Thus, based on the results of previous works (Meyniel
et al., 2016; Modirshanechi et al., 2019; Nassar et al., 2012, 2010; Ostwald
et al., 2012), which always used uninformative priors, one cannot deter-
mine whether the physiological and behavioral indicators we have noted
correlate with SSh or SBF.
In this section, we first investigate the theoretical differences between
the Bayes Factor Surprise SBF and Shannon Surprise SSh. Then, based on
their observed differences, we formulate two experimentally testable pre-
dictions, with a detailed experimental protocol. Our predictions make it
possible to discriminate between the two measures of surprise and deter-
mine whether physiological or behavioral measurements are signatures of
SBF or of SSh.
2.4.1 Theoretical Difference between SBF and SSh. Shannon Surprise (Shan-
non, 1948) is defined as
SSh(yt+1
; π (t)) = −log
(cid:7)
(cid:8)
|y1:t )
,
P(yt+1
(2.24)
(cid:7)
; π (t)) = −log
|y1:t ), one should know the structure of
where for computing P(yt+1
the generative model. For the generative model of Figure 1A, we
find SSh(yt+1
. While the
Bayes Factor Surprise SBF depends on a ratio between the probability of the
new observation under the prior and the current beliefs, Shannon Surprise
depends on a weighted sum of these probabilities. Interestingly, it is pos-
sible to express (see section 4 for derivation) the adaptation rate γt+1 as a
function of the “difference in Shannon Surprise,”
; π (t)) + pcP(yt+1
(1 − pc)P(yt+1
(cid:8)
; π (0))
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
(cid:8)SSh(yt+1
(cid:8)
; π (t), π (0))
,
(cid:7)
= pcexp
γt+1
where (cid:8)SSh(yt+1
; π (t), π (0)) = SSh(yt+1
; π (t)) − SSh(yt+1
; π (0)),
(2.25)
Learning in Volatile Environments With the Bayes Factor Surprise
297
(cid:7)
S(t+1)
(cid:8)
, m
BF
= γ
where γt+1
depends on the Bayes Factor Surprise and the
saturation parameter m (cf. equation 2.9). Equation 2.25 shows that the
modulated adaptation rate is not just a function of Shannon Surprise upon
observing yt+1, but a function of the difference between the Shannon Surprise
of this observation under the current and the prior beliefs. Note that if the
prior is uninformative then the second term in equation 2.25 is a constant
with respect to observations, and γt+1 is an invertible function of SSh, thus
indistinguishable from SSh. We next exploit differences between SBF and SSh
to formulate our experimentally testable predictions.
2.4.2 Experimental Protocol. Consider the variant of the gaussian task of
Nassar et al. (2012, 2010), which we used in our simulations: PY (y|θ ) =
N (y; θ , σ 2) and π (0)(θ ) = N (θ ; 0, 1). Human subjects are asked to predict
the next observation yt+1 given what they have observed so far, that is, y1:t.
The experimental procedure is as follows:
1. Fix the hyperparameters σ 2 and pc.
2. At each time t, show the observation yt (produced in the aforemen-
tioned way) to the subject, and measure a physiological or behav-
ioral indicator Mt, for example, pupil diameter (Nassar et al., 2012,
2010).
3. At each time t, after observing yt, ask the subject to predict the next
observation ˆyt+1 and his or her confidence Ct about that prediction.
Note that the only difference between our task and the task of Nassar et al.
(2012, 2010) is the choice of prior for θ (i.e., gaussian instead of uniform).
The assumption is that according to the previous studies, there is a positive
correlation between Mt and a measure of surprise.
2.4.3 Prediction 1. Based on the results of Nassar et al. (2012, 2010), in
such a gaussian task, the best fit for subjects’ prediction ˆyt+1 is ˆθt, and the
confidence Ct is a monotonic function of the standard deviation of ˆθt, de-
ˆsigmat. In order to formalize our experimental prediction, we de-
noted as
fine, at time t, the prediction error as δt = yt − ˆyt and the “sign bias” as
st = sign(δt ˆyt ). The variable st is a crucial variable for our analysis. It shows
whether the prediction ˆyt is an overestimation in absolute value (st = +1)
or an underestimation in absolute value (st = −1). Figure 10A shows a
schematic for the case that both the current and prior beliefs are gaus-
sian distributions. The two observations indicated by dashed lines have the
same absolute error |δt| but differ in the sign bias s.
Given an absolute prediction value ˆy > 0, an absolute prediction error
δ > 0, a confidence value C > 0, and a sign bias s ∈ {−1, 1}, we can com-
pute the average of Mt over time for the time points with | ˆyt| ≈ ˆy, |δt| ≈ δ,
Ct ≈ C, and st = s, which we denote as ¯M1( ˆy, δ, s, C); the index 1 stands for
experimental prediction 1. The approximation notation ≈ is used for
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
298
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 10: Experimental prediction 1. (A) Schematic of the task for the case
of a gaussian belief. The distributions of yt+1 under the prior belief π (0) y
the current belief π (t) are shown by black and red curves, respectivamente. Two
possible observations with equal absolute prediction error δ but opposite sign
bias s are indicated by dashed lines. The two observations are equally prob-
able under π (t) but not under π (0). SBF is computed as the ratio between the
red and black dots for a given observation, whereas SSh is a function of the
weighted sum of the two. This phenomenon is the basis of our experimental
predicción. (B) The average surprise values ¯SSh( ˆθ = 1, δ, s = ±1, pag
= 0.5) y
¯SBF( ˆθ = 1, δ, s = ±1, pag
= 0.5) encima 20 subjects (each with 500 observaciones) son
shown for two different learning algorithms (Nas12∗ and pf20). The mean ¯SBF
is higher for negative sign bias (marked in blue) than for positive sign bias
(marked in orange). The opposite is observed for the mean ¯SSh. This effect in-
creases with increasing values of prediction error δ. The shaded area corre-
sponds to the standard error of the mean. The experimental task is the same
= 0.1
as the gaussian task we used in the previous section, with σ = 0.5 and pc
(mira la sección 4 for details).
C
C
Learning in Volatile Environments With the Bayes Factor Surprise
299
continuous variables instead of equality due to practical limitations (es decir.,
for obtaining adequate number of samples for averaging). Note that for our
theoretical proofs, we use equality, but in our simulation, we include the
practical limitations of a real experiment, and hence, use an approximation.
The formal definitions can be found in section 4. It is worth noting that
the quantity ¯M1( ˆy, δ, s, C) is model independent; its calculation does not re-
quire any assumption on the learning algorithm the subject may employ.
¯M1( ˆy, δ, s, C) reflects SSh or SBF,
Depending on whether the measurement
its relationship to the defined four variables ( ˆy, δ, s, C) is qualitatively and
quantitatively different.
In order to prove and illustrate our prediction, we consider each subject
as an agent enabled with one of the learning algorithms that we discussed.
Similar to the above, given an absolute prediction ˆθ > 0 (correspondiente
to the subjects’ absolute prediction ˆy), an absolute prediction error δ > 0,
a standard deviation σ
C (corresponding to the subjects’ confidence value
C), and a sign bias s ∈ {−1, 1}, we can compute the average Shannon Sur-
prise SSh(yt; ˆπ (t−1)) and the average Bayes Factor Surprise SBF(yt; ˆπ (t−1))
con el tiempo, for the time points with | ˆθt−1
C, and st = s,
which we denote as ¯SSh( ˆθ , δ, s, pag
C) respectivamente. We can
show theoretically (mira la sección 4) and in simulations (see Figure 10B and
sección 4) that for any value of ˆθ , δ, and σ
C) >
¯SSh( ˆθ , δ, s = −1, pag
C) for the Shannon Surprise, and exactly the opposite re-
lación, ¯SBF( ˆθ , δ, s = +1, pag
C), for the Bayes Factor Sur-
prise. Además, this effect increases with increasing δ.
C, we have ¯SSh( ˆθ, δ, s = +1, pag
C) < ¯SBF( ˆθ , δ, s = −1, σ | ≈ ˆθ , |δt| ≈ δ, ˆσt ≈ σ C) and ¯SBF( ˆθ , δ, s, σ It should be noted that such an effect is due to the essential difference of SSh and SBF in using the prior belief π (0)(θ ). Our experimental predic- tion is theoretically provable for the cases that each subject’s belief ˆπ (t) is a ∗ gaussian distribution, which is the case if they employ VarSMiLe, Nas10 , ∗ Nas12 , pf1, MP1, or Leaky Integrator as their learning rule (see section 4). For the cases that different learning rules (e.g., pf20) are used, where the posterior belief is a weighted sum of gaussians, the theoretical analysis is more complicated, but our simulations show the same results (see Figure 10B and section 4). Therefore, independent of the learning rule, we have the same experimental prediction on the manifestation of different surprise measures on physiological signals, such as pupil dilation. Our first exper- imental prediction can be summarized as the set of hypotheses shown in Table 1. 2.4.4 Prediction 2. Our second prediction follows the same experimen- tal procedure as the one for the first prediction. The main difference is that for the second prediction, we need to fit a model to the experimental data. Given one of the learning algorithms, the fitting procedure can be done by tuning the free parameters of the algorithm with the goal of minimiz- ing the mean squared error between the model’s prediction ˆθt and a sub- ject’s prediction ˆyt+1 (similar to Nassar et al., 2012, 2010) or with the goal of l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 3 2 2 6 9 1 8 9 6 8 0 8 n e c o _ a _ 0 1 3 5 2 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 300 V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea Table 1: Experimental Hypotheses and Predictions 1. Hypothesis Prediction The indicator reflects SBF The indicator reflects SSh The prior is not used for inference (cid:8) ¯M1( ˆθ, δ, C) = 0 (cid:8) ¯M1( ˆθ, δ, C) < 0 and (cid:8) ¯M1( ˆθ, δ, C) > 0 y
∂(cid:8) ¯M1 ( ˆθ,δ,C)
∂δ
∂(cid:8) ¯M1 ( ˆθ,δ,C)
∂δ
< 0 > 0
Nota: (cid:8) ¯M1( ˆθ, δ, C) stands for ¯M1( ˆθ, δ, s = +1, C) − ¯M1( ˆθ, δ, s = −1, C).
maximizing the likelihood of subject’s prediction ˆπ (t)( ˆyt+1). Our prediction
is independent of the learning algorithm, but in an actual experiment, nosotros
recommend using model selection to find the model that fits the human
data best.
Having a fitted model, we can compute the probabilities P(yt+1
; ˆπ (t)) y
; ˆπ (0)). For the case that these probabilities are equal, PAG(yt+1
; ˆπ (t)) =
PAG(yt+1
; ˆπ (0)) = p, the Bayes Factor Surprise SBF is equal to 1, independiente
PAG(yt+1
of the value of p (see equation 2.7). Sin embargo, the Shannon Surprise SSh is
equal to − log p and varies with p. Figure 11A shows a schematic for the case
that both current and prior beliefs are gaussian distributions. Two cases for
; ˆπ (0)) = p, for two different p values,
; ˆπ (t)) =P(yt+1
which we have P(yt+1
are marked by black dots at the intersections of the curves.
Given a probability p > 0, we can compute the average of Mt over time
; ˆπ (0)) ≈ p, which we de-
; ˆπ (t)) ≈ p and P(yt+1
for the time points with P(yt+1
note as ¯M2(pag); the index 2 stands for experimental prediction 2. Analogous
to the first prediction, the approximation notation ≈ is used due to practi-
cal limitations. Entonces, if ¯M2(pag) is independent of p, its behavior is consistent
with SBF, whereas if it decreases by increasing p, it can be a signature of
SSh. Our second experimental prediction can be summarized as the two hy-
potheses shown in Table 2. Note that in contrast to our first prediction, con
the assumption that the standard deviation of the prior belief is fitted us-
ing the behavioral data, we do not consider the hypothesis that the prior is
not used for inference, because this is indistinguishable from a very large
variance of the prior belief.
In order to illustrate the possible results and the feasibility of the exper-
mento, we ran a simulation and computed ¯SBF(pag) and ¯SSh(pag) for the time
; ˆπ (0)) ≈ p (mira la sección 4 for details).
; ˆπ (t)) ≈ p and P(yt+1
points with P(yt+1
The results of the simulation are shown in Figure 11B.
3 Discusión
We have shown that performing exact Bayesian inference on a gen-
erative world model naturally leads to a definition of surprise and a
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
301
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 11: Experimental prediction 2. (A) Schematic of the task for the case of
a gaussian belief. The probability distribution of observations under the prior
belief is shown by the solid black curve. Two different possible current beliefs
(determined by the letters A and B) are shown by dashed red curves. The inter-
sections of the dashed red curves with the prior belief determine observations
whose SBF is same and equal to one, but their SSh is a function of their prob-
abilities under the prior belief p. (B) The average surprise values ¯SSh(pag) y
¯SBF(pag) encima 20 subjects (each with 500 observaciones) are shown for two different
learning algorithms (Nas12∗ and pf20). The mean SBF is constant (equal to 1)
and independent of p, whereas the mean SSh is a decreasing function of p. El
shaded area corresponds to the standard error of the mean. The experimental
task is the same as the gaussian task we used in the previous section. Obser-
vations yt are drawn from a gaussian distribution with σ = 0.5, whose mean
changes with change point probability pc
= 0.1 (mira la sección 4 for details).
surprise-modulated adaptation rate. We have proposed three approximate
algoritmos (VarSMiLe, MPN, and pfN) for learning in nonstationary envi-
ambientes, which all exhibit the surprise-modulated adaptation rate of the
302
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
Mesa 2: Experimental Hypotheses and Predictions 2.
Hipótesis
The indicator reflects SBF
The indicator reflects SSh
Prediction
∂ ¯M2 (pag)
∂ p
∂ ¯M2 (pag)
∂ p
= 0
< 0
exact Bayesian approach and are biologically plausible. Empirically we ob-
served that our algorithms achieve levels of performance comparable to
approximate Bayesian methods with higher memory demands (Adams &
MacKay, 2007) and are more resilient across different environments com-
pared to methods with similar memory demands (Faraji et al., 2018; Fearn-
head & Liu, 2007; Nassar et al., 2012, 2010).
Learning in a volatile environment has been studied for a long time in
the fields of Bayesian learning, neuroscience, and signal processing. In the
following, we discuss the biological relevance of our work, and we briefly
review some of the previously developed algorithms, with particular focus
on the ones that have studied environments that can be modeled with a
generative model similar to the one in Figure 1. We then discuss further our
results and propose directions for future work on surprise-based learning.
3.1 Biological Interpretation. Humans are able to quickly adapt to
changes (Behrens et al., 2007; Nassar et al., 2012, 2010), but human behav-
ior is also often observed to be suboptimal, compared to the normative ap-
proach of exact Bayesian inference (Glaze et al., 2015; Mathys, Daunizeau,
Friston, & Stephan, 2011; Nassar et al., 2010; Prat-Carrabin, Wilson, Cohen,
& Da Silveira, 2020; Wilson, Nassar, & Gold, 2013). In general, biological
agents have limited resources and possibly inaccurate assumptions about
hyperparameters, yielding suboptimal behavior, as we also see with our
algorithms whose accuracies degrade with a suboptimal choice of hyper-
parameters. Performance also deteriorates with a decreasing number of
particles in the sampling-based algorithms, which might be another pos-
sible explanation of suboptimal human behavior. Previously, Particle Fil-
tering has been shown to explain the behavior of human subjects in chang-
ing environments: Daw and Courville (2008) use a single particle, Brown
and Steyvers (2009) use a simple heuristic form of particle filtering based
on direct simulation, Findling et al. (2019) combine Particle Filtering with a
noisy inference, and Prat-Carrabin et al. (2020) use it for a task with tempo-
ral structure.
At the level of neuronal implementation, we do not propose a specific
suggestion. However, there are several hypotheses about neural implemen-
tations of related particle filters (Huang & Rao, 2014; Kutschireiter et al.,
2017; Legenstein & Maass, 2014; Shi & Griffiths, 2009) on which, a neural
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
303
model of pfN and—its greedy version—MPN could be based. In a similar
spirit, the updating scheme of Variational SMiLe may be implemented in
biological neural networks (for distributions in the exponential family).
Our theoretical framework for modulation of learning by the Bayes
Factor Surprise SBF is related to the body of literature on neo-Hebbian
three-factor learning rules (Frémaux & Gerstner, 2016; Gerstner et al., 2018;
Lisman et al., 2011), where a third factor indicating reward or surprise en-
ables or modulates a synaptic change or a belief update (Yu, 2012; Yu &
Dayan, 2005). We have shown how Bayesian or approximate Bayesian in-
ference naturally leads to such a third factor that modulates learning via
the surprise modulated adaptation rate γ (SBF
, m). This may offer novel in-
terpretations of behavioral and neurophysiological data and help in under-
standing how three-factor learning computations may be implemented in
the brain.
3.2 Related Work.
3.2.1 Exact Bayesian Inference. As already described in section 2.2.2, for
the generative model in Figure 1, it is possible to find an exact online
Bayesian update of the belief using a message-passing algorithm (Adams
& MacKay, 2007). The space and time complexity of the algorithm increases
linearly with t, which makes it unsuitable for an online learning setting.
However, approximations like dropping messages below a certain thresh-
old (Adams & MacKay, 2007) or stratified resampling (Fearnhead & Liu,
2007) allow reducing the computational complexity. The former has a vari-
able number of particles in time, and the latter needs solving a complicated
nonlinear equation at each time step in order to reduce the number of par-
ticles to N (called SORN in section 2).
Our message-passing algorithm with a finite number of particles (mes-
sages) N (MPN; see algorithm 3) is closely related to these algorithms and
can be seen as a biologically more plausible variant of the other two. All
three algorithms have the same update rules given by equations 2.19 and
2.18. Hence the algorithms of both Adams and MacKay (2007) and Fearn-
head and Liu (2007) have the same surprise modulation as our MPN. The
difference lies in their approaches to eliminate less “important” particles.
In the literature of switching state-space models (Barber, 2012), the gen-
erative models of the kind in Figure 1 are known as “reset models,” and the
message-passing algorithm of Adams and MacKay (2007) is known to be
the standard algorithm for inference over these models (Barber, 2012). See
Barber (2006, 2012), and Ghahramani & Hinton (2000) for other variations of
switching state-space models and examples of approximate inference over
them.
3.2.2 Leaky Integration and Variations of Delta Rules. In order to esti-
mate some statistics, leaky integration of new observations is a particularly
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
304
V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea
simple form of a trade-off between integrating and forgetting. After a tran-
sient phase, the update of a leaky integrator takes the form of a delta rule
that can be seen as an approximation of exact Bayesian updates (Heilbron
& Meyniel, 2019; Meyniel et al., 2016; Ryali et al., 2018; Yu & Cohen, 2009).
This update rule was found to be biologically plausible and consistent with
human behavioral data (Meyniel et al., 2016; Yu & Cohen, 2009). However,
Behrens et al. (2007) and Heilbron and Meyniel (2019) demonstrated that in
some situations, the exact Bayesian model is significantly better than leaky
integration in explaining human behavior. The inflexibility of leaky integra-
tion with a single, constant leak parameter can be overcome by a weighted
combination of multiple leaky integrators (Wilson et al., 2013), where the
weights are updated in a similar fashion as in the exact online methods
(Adams & MacKay, 2007; Fearnhead & Liu, 2007), or by considering an
adaptive leak parameter (Nassar et al., 2012, 2010). We have shown that
the two algorithms of Nassar et al. (2012, 2010) can be generalized to gaus-
sian prior beliefs (Nas10∗
∗
). Our results show that these algo-
and Nas12
rithms also inherit the surprise modulation of the exact Bayesian inference.
Our surprise-dependent adaptation rate γ can be interpreted as a surprise-
modulated leak parameter.
3.2.3 Other Approaches. Learning in the presence of abrupt changes has
also been considered without explicit assumptions about the underlying
generative model. One approach uses a surprise-modulated adaptation
rate (Faraji et al., 2018) similar to equation 2.9. The Surprise-Minimization
Learning (SMiLe) algorithm of Faraji et al. (2018) has an updating rule sim-
ilar to the one of VarSMiLe (see equations 2.14 and 2.15). The adaptation
rate modulation, however, is based on the Confidence Corrected Surprise
(Faraji et al., 2018) rather than the Bayes Factor Surprise, and the trade-off in
its update rule is between resetting and staying with the latest belief rather
than between resetting and integrating (see section 4).
Other approaches use different generative models, such as conditional
sampling of the parameters also when there is a change (Glaze et al., 2015;
Yu & Dayan, 2005), a deeper hierarchy without fixed change probability pc
(Wilson et al., 2010), or drift in the parameters (Gershman, Radulescu, Nor-
man, & Niv, 2014; Mathys et al., 2011). A recent work shows that inference on
a simpler version the generative model of Figure 1, with no change points
but with a noisy inference style, can explain human behavior well even
when the true generative model of the environment is different and more
complicated (Findling et al., 2019). They develop a heuristic approach to
add noise in the inference process of a Particle Filter. Their algorithm can be
interpreted as a surprise-modulated Particle Filter, where the added noise
scales with a measure of surprise—conceptually equivalent to Bayesian sur-
prise (Itti & Baldi, 2006; Schmidhuber, 2010; Storck, Hochreiter, & Schmid-
huber, 1995). Moreover, another recent work (Prat-Carrabin et al., 2020)
shows that approximate sampling algorithms (like Particle Filtering) can
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
305
explain human behavior better than their alternatives in tasks closely re-
lated to the generative model of Figure 1. The signal processing literature
provides further methods to address the problem of learning in nonsta-
tionary environments with abrupt changes; see Aminikhanghahi and Cook
(2017) for a review, and Cummings et al. (2018), Lin et al. (2017), Masegosa
et al. (2017), and Özkan et al. (2013) for a few recent examples.
3.3 Surprise Modulation as a Generic Phenomenon. Learning rate
modulation similar to the one in equation 2.9 has been previously proposed
in the neuroscience literature with either heuristic arguments (Faraji et al.,
2018) or Bayesian arguments for a particular experimental task, for exam-
ple, when samples are drawn from a gaussian distribution (Nassar et al.,
2012, 2010). The fact that the same form of modulation is at the heart of
Bayesian inference for our relatively general generative model, that it is de-
rived without any further assumptions, and is not a priori defined, is in our
view an important contribution to the field of adaptive learning algorithms
in computational neuroscience.
Furthermore, the results of our three approximate methods (Particle Fil-
tering, Variational SMiLe, and Message Passing with a fixed N number of
messages) as well as some previously developed ones (Adams & MacKay,
2007; Fearnhead & Liu, 2007; Nassar et al., 2012, 2010) demonstrate that the
surprise-based modulation of the learning rate is a generic phenomenon.
Therefore, regardless of whether the brain uses Bayesian inference or an
approximate algorithm (Bogacz, 2017, 2019; Findling et al., 2019; Friston,
2010; Gershman, 2019; Gershman et al., 2014; Mathys et al., 2011; Nassar
et al., 2012, 2010; Prat-Carrabin et al., 2020), the notion of Bayes Factor Sur-
prise and the way it modulates learning (see equations 2.12 and 2.9) look
generic.
The generality of the way surprise should modulate learning depends on
an agent’s inductive biases about its environment and is directly associated
with the assumed generative model of the world. The generative model we
considered in this work involves abrupt changes. However, one can think of
other realistic examples, where an improbable observation does not indicate
a persistent change but a singular event or an outlier, similar to d’Acremont
and Bossaerts (2016) and Nassar, Bruckner, and Frank (2019). In such sit-
uations, the belief should not be changed, and surprise should attenuate
learning rather than accelerate it. Interestingly, we can show that exact and
approximate Bayesian inference on such a generative model naturally leads
to a surprise-modulated adaptation rate γ (SBF
, m), with the same definition
of SBF, where the trade-off is not between integrating and resetting, but be-
tween integrating and ignoring the new observation.1
1
Liakoni, V., Lehmann, M. P., Modirshanechi, A., Brea, J., Herzog, M. H., Gerstner,
W., & Preuschoff, K. Dissociating brain regions encoding reward prediction error and surprise.
Manuscript in preparation.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
306
V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea
An aspect that the generative model we considered does not capture is
the potential return to a previous state of the environment rather than a
change to a completely new situation. If in our example of Figure 1B, the
bridge with the shortest path is temporarily closed for repairs, your friend
would again have to take the longer detour; therefore, her arrival times will
return to their previous values (i.e. increase). In such cases, an agent should
infer whether the surprising observation stems from a new hidden state or
from an old state stored in memory. Relevant generative models have been
studied in Fox, Sudderth, Jordan, and Willsky (2011), Gershman, Monfils,
Norman, and Niv (2017), Gershman et al. (2014) and are out of the scope of
this article.
3.4 Bayes Factor Surprise as a Novel Measure of Surprise. In view of
a potential application in the neurosciences, a definition of surprise should
reflect how unexpected an event is and should modulate learning. Surpris-
ing events indicate that our belief is far from the real world and suggest
updating our model of the world, or, for a large surprise, simply forgetting
it. Forgetting is the same as returning to the prior belief. However, an ob-
servation yt+1 can be unexpected under both the prior π (0) and the current
beliefs π (t). In these situations, it is not obvious whether forgetting helps.
Therefore, the modulation between forgetting or not should be based on
a comparison between the probability of an event under the current belief
P(yt+1
; π (t)) and its probability under the prior belief P(yt+1
The definition of the Bayes Factor Surprise SBF as the ratio of P(yt+1
; π (t))
; π (0)) exploits this insight. The Bayes Factor Surprise appears as
and P(yt+1
a modulation factor in the recursive form of the exact Bayesian update rule
for a hierarchical generative model of the environment. When two events
are equally probable under the prior belief, the one that is less expected
under the current belief is more surprising, satisfying the first property. At
the same time, when two events are equally probable under the current be-
lief, the one that is more expected under the prior belief is more surprising,
signaling that forgetting may be beneficial.
; π (0)).
SBF can be written (using equation 2.6) in a more explicit way as
SBF(yt+1
; π (t)) = P(yt+1
P(yt+1
; π (0))
; π (t))
=
(cid:9)
PY (yt+1
PY (yt+1
(cid:9)
(cid:10)
|(cid:3))
|(cid:3))
(cid:10) .
Eπ (0)
Eπ (t)
(3.1)
Note that the definition by itself is independent of the specific form of the
generative model. In other words, even in the cases where data are gen-
erated with another generative model (e.g., the real world), SBF could be
a candidate surprise measure in order to interpret brain activity or pupil
dilation.
We formally discussed the connections between the Bayes Factor Sur-
prise and Shannon Surprise (Shannon, 1948) and showed that they are
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
307
closely linked. We showed that the modulated adaptation rate (γ ) used in
(approximate) Bayesian inference is a function of the difference between
the Shannon Surprise under the current and the prior beliefs, but cannot be
expressed solely by the Shannon Surprise under the current one. Our for-
mal comparisons between these two different measures of surprise lead to
specific experimentally testable predictions.
The Bayesian Surprise SBa (Itti & Baldi, 2006; Schmidhuber, 2010; Storck
et al., 1995) and the Confidence Corrected Surprise SCC (Faraji et al., 2018)
are two other measures of surprise in neuroscience. The learning modula-
tion derived in our generative model cannot be expressed as a function of
SBa and SCC. However, one can hypothesize that SBa is computed after the
update of the belief to measure the information gain of the observed event
and is therefore not a good candidate for online learning modulation. The
Confidence Corrected Surprise SCC takes into account the shape of the belief
and therefore includes the effects of confidence, but it does not consider any
¯M1( ˆθ , δ, s = +1, C) =
information about the prior belief. Hence, a result of
¯M1( ˆθ , δ, s = −1, C) in our first experimental prediction would be consistent
with the corresponding behavioral or physiological indicator reflecting the
SCC.
3.5 Difference in Shannon Surprise, an Alternative Perspective. Fol-
lowing our formal comparison in section 2.4, SBF can be expressed as a de-
terministic function of the difference in Shannon Surprise as
SBF
= (1 − pc)e(cid:8)SSh
1 − pce(cid:8)SSh
.
(3.2)
All of our theoretical results can be rewritten by replacing SBF with this
function of (cid:8)SSh. Moreover, because there is a one-to-one mapping between
SBF and (cid:8)SSh, from a systemic point of view, it is not possible to specify
whether the brain computes the former or the latter by analysis of behav-
ioral data and biological signals. This suggests an alternative interpretation
of surprise-modulated learning as an approximation of Bayesian inference:
What the brain computes and perceives as surprise or prediction error may
be Shannon Surprise, but the modulating factor in a three-factor synaptic
plasticity rule (Frémaux & Gerstner, 2016; Gerstner et al., 2018; Lisman et al.,
2011) may be implemented by comparing the Shannon Surprise values un-
der the current and the prior beliefs.
3.6 Future Directions. A natural continuation of our study is to test our
experimental predictions in human behavior and physiological signals in
order to investigate which measures of surprise are used by the brain. In
a similar direction, our approximate learning algorithms can be evaluated
on human behavioral data from experiments that use a similar generative
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
308
V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea
model (Behrens et al., 2007; Glaze et al., 2015; Heilbron & Meyniel, 2019;
Nassar et al., 2012, 2010; Wilson et al., 2013; Yu & Dayan, 2005) in order to
assess if our proposed algorithms achieve similar or better performance in
explaining data.
Finally, our methods can potentially be applied to model-based rein-
forcement learning in nonstationary environments. In recent years, there
has been growing interest in adaptive or continually learning agents in
changing environments in the form of continual learning and meta-learning
(Lomonaco, Desai, Culurciello, & Maltoni, 2019; Traoré et al., 2019). Many
continual learning model-based approaches make use of some procedure to
detect changes (Lomonaco et al., 2019; Nagabandi et al., 2018). Integrating
SBF and a learning rate γ (SBF) into a reinforcement learning agent would
be an interesting future direction.
4 Methods
4.1 Proof of the proposition. By definition,
π (t+1)(θ ) ≡ P((cid:3)t+1
= θ |y1:t+1).
(4.1)
We exploit the Markov property of the generative model in equations 2.2 to
2.4, condition on the fixed past y1:t and rewrite
π (t+1)(θ ) = PY (yt+1
|θ )P((cid:3)t+1
|y1:t )
P(yt+1
= θ |y1:t )
.
(4.2)
By marginalization over the hidden state Ct+1, the second factor in the nu-
merator of equation 4.2 can be written as
P((cid:3)t+1
= θ |y1:t ) = (1 − pc)π (t)(θ ) + pcπ (0)(θ ).
(4.3)
The denominator in equation 4.2 can be written as
P(yt+1
|y1:t ) =
(cid:4)
PY (yt+1
(cid:4)
|θ )P((cid:3)t+1
= θ |y1:t )dθ
(cid:4)
= (1 − pc)
PY (yt+1
|θ )π (t)(θ )dθ + pc
PY (yt+1
|θ )π (0)(θ )dθ
= (1 − pc)P(yt+1
; π (t)) + pcP(yt+1
; π (0)),
(4.4)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
309
where we used the definition in equation 2.6. Using these two expanded
forms, equation 4.2 can be rewritten as
π (t+1)(θ ) =
(cid:12)
|θ )
PY (yt+1
(1 − pc)P(yt+1
(cid:13)
(1 − pc)π (t)(θ ) + pcπ (0)(θ )
; π (0))
; π (t)) + pcP(yt+1
.
(4.5)
We define P(θ |yt+1) as the posterior given a change in the environment as
P(θ |yt+1) = PY (yt+1
P(yt+1
|θ )π (0)(θ )
; π (0))
.
Then, we can write equation 4.5 as
(4.6)
(θ ) + pcP(yt+1
; π (t)) + pcP(yt+1
; π (0))P(θ |yt+1)
; π (0))
π (t+1)(θ ) = (1 − pc)P(yt+1
; π (t))π (t+1)
B
(1 − pc)P(yt+1
;π (0) )
;π (t) ) P(θ |yt+1)
;π (0) )
;π (t) )
P(yt+1
P(yt+1
P(yt+1
P(yt+1
(θ ) + γt+1P(θ |yt+1),
=
π (t+1)
B
(θ ) + pc
1−pc
1 + pc
1−pc
= (1 − γt+1)π (t+1)
B
where π (t+1)
B
(θ ) is defined in equation 2.8, and
(cid:5)
γt+1
= γ
SBF(yt+1
; π (t)),
(cid:6)
pc
1 − pc
(4.7)
(4.8)
with SBF defined in equation 2.7, and γ (S, m) defined in equation 2.9. Thus,
our calculation yields a specific choice of surprise (S = SBF) and a specific
value for the saturation parameter m = pc
1−pc
.
4.2 Derivation of the Optimization-Based Formulation of VarSMiLe:
Algorithm 1. To derive the optimization-based update rule for the Varia-
tional SMiLe rule and the relation of the bound Bt+1 with surprise, we used
the same approach used in Faraji et al. (2018).
4.2.1 Derivation of the Update Rule. Consider the general form of the fol-
lowing variational optimization problem:
∗
q
(θ ) = argmin DKL
q(θ ) s.t. DKL
(cid:10)
(cid:9)
q(θ )||p2(θ )
(cid:9)
(cid:10)
q(θ )||p1(θ )
< B and Eq[1] = 1,
(4.9)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
310
V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea
(cid:9)
(cid:10)
0, DKL[p1(θ )||p2(θ )]
where B ∈
solutions
. On the extremes of B, we will have trivial
(cid:3)
∗
q
(θ ) =
p2(θ )
p1(θ )
if B = 0
if B = DKL[p1(θ )||p2(θ )].
(4.10)
Note that the Kullback–Leibler divergence is a convex function with re-
spect to its first argument—q in our setting. Therefore, both the objective
function and the constraints of the optimization problem in equation 4.9
are convex. For convenience, we assume that the parameter space for θ is
discrete, but the final results can be generalized also to the continuous case
with some considerations (see Beal, 2003, and Faraji et al., 2018). For the
discrete setting, the optimization problem in equation 4.9 can be rewritten
as
∗
q
(θ ) = argmin
(cid:11)
(cid:7)
log(q(θ )) − log(p1(θ ))
q(θ )
q(θ ) s.t.
(cid:11)
(cid:7)
q(θ )
θ
θ
log(q(θ )) − log(p2(θ )) < B and
(cid:11)
θ
q(θ ) = 1.
(4.11)
For solving the mentioned problem, one should find a q that satisfies the
Karush-Kuhn-Tucker (KKT) conditions (Boyd & Vandenberghe, 2004) for
L =
(cid:11)
θ
(cid:5)
q(θ )log
(cid:6)
q(θ )
p1(θ )
+ λ
(cid:11)
θ
q(θ )log
(cid:5)
(cid:6)
q(θ )
p2(θ )
− λB + α − α
∂L
∂q(θ )
(cid:5)
(cid:6)
(cid:5)
(cid:6)
= log
q(θ )
p1(θ )
q(θ )
p2(θ )
= (1 + λ)log(q(θ )) − log(p1(θ )) − λlog(p2(θ )) + 1 + λ − α,
+ 1 + λlog
+ λ − α
(cid:11)
θ
q(θ ),
(4.12)
(4.13)
where λ and α are the parameters of the dual problem. Defining γ = λ
1+λ
and considering the partial derivative to be zero, we have
∗
log(q
(θ )) = (1 − γ )log(p1(θ )) + γ log(p2(θ )) + Const(α, γ ),
(4.14)
where α is always specified in a way to have Const(α, γ ) as the normaliza-
tion factor:
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Const (α, γ ) = −log(Z(γ ))
(cid:11)
θ
where Z(γ ) =
p1−γ
1
γ
(θ )p
2 (θ ).
(4.15)
Learning in Volatile Environments With the Bayes Factor Surprise
311
According to the KKT conditions, λ ≥ 0, and as a result, γ ∈ [0, 1]. There-
fore, considering p1(θ ) = ˆπ (t+1)
(θ ) and p2(θ ) = P(θ |yt+1), the solution to the
B
optimization problem of equations 2.14 and 2.15 is 2.13.
4.2.2 Proof of the Claim That B Is a Decreasing Function of Surprise. Accord-
ing to the KKT conditions,
(cid:7)
λ
DKL[q
∗
(θ )||p2(θ )] − B
(cid:8)
= 0.
For the case that λ (cid:12)= 0 (i.e., γ (cid:12)= 0), we have B as a function of γ :
B(γ ) = DKL[q
∗
(θ )||p2(θ )]
(cid:5)
(cid:14)
= (1 − γ )Eq∗
log
(cid:6)(cid:15)
− log(Z(γ )).
p1(θ )
p2(θ )
(4.16)
(4.17)
Now, we show that the derivate of B(γ ) with respect to γ is always nonpos-
itive. To do so, we first compute the derivative of Z(γ ) as
∂log(Z(γ ))
∂γ
= 1
Z(γ )
∂
∂γ
(cid:11)
θ
p1−γ
1
γ
(θ )p
2 (θ )
(cid:5)
= 1
Z(γ )
(cid:14)
(cid:11)
θ
(cid:5)
= Eq∗
log
p1−γ
1
γ
(θ )p
2 (θ )log
(cid:6)(cid:15)
,
p2(θ )
p1(θ )
(cid:6)
p2(θ )
p1(θ )
(4.18)
and the derivate of Eq∗ [h(θ )] for an arbitrary h(θ ) as
∂Eq∗ [h(θ )]
∂γ
∂
∂γ
(cid:11)
θ
(cid:11)
=
=
=
(cid:11)
θ
∗
q
(θ )h(θ )
∗
q
(θ )h(θ )
∂
∂γ log(q
∗
(θ ))
∗
q
(θ )h(θ )
∂
∂γ
(cid:7)
(cid:8)
(1 − γ )log(p1(θ )) + γ log(p2(θ )) − log(Z(γ ))
θ
= Eq∗
(cid:5)
(cid:14)
h(θ )log
(cid:6)(cid:15)
p2(θ )
p1(θ )
(cid:10)
(cid:9)
h(θ )
Eq∗
− Eq∗
(cid:14)
(cid:5)
log
(cid:6)(cid:15)
.
p2(θ )
p1(θ )
(4.19)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
312
V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea
Using the last previous equations, we have
(cid:5)
(cid:14)(cid:5)
= −(1 − γ )
Eq∗
log
∂B(γ )
∂γ
(cid:5)
(cid:14)
(cid:5)
= −(1 − γ )Varq∗
log
p2(θ )
p1(θ )
p2(θ )
p1(θ )
(cid:6)(cid:15)
(cid:6)(cid:6)2(cid:15)
(cid:14)
(cid:5)
− Eq∗
log
(cid:6)(cid:15)2(cid:6)
p2(θ )
p1(θ )
≤ 0,
(4.20)
which means that B is a decreasing function of γ . Because γ is an increasing
function of surprise, B is also a decreasing function of surprise.
4.3 Derivations of Message Passing N: Algorithm 2. For the sake of
clarity and coherence, we repeat here some steps performed in section 2.
Following the idea of Adams and MacKay (2007) we first define the ran-
dom variable Rt = min{n ∈ N : Ct−n+1
= 1}. This is the time window from
the last change point. Then the exact Bayesian form for π (t)(θ ) can be writ-
ten as
π (t)(θ ) = P((cid:3)t+1
= θ |y1:t )
=
t(cid:11)
rt =1
P(rt|y1:t )P((cid:3)t+1
= θ |rt, y1:t ).
(4.21)
To have a formulation similar to the one of Particle Filtering, we rewrite the
belief as
π (t)(θ ) =
t−1(cid:11)
k=0
w(k)
t P((cid:3)t+1
= θ |Rt = t − k, y1:t ) =
t−1(cid:11)
k=0
w(k)
t
π (t)
k (θ ),
(4.22)
where π (t)
t − k, and w(k)
k (θ ) = P((cid:3)t = θ |Rt = t − k, y1:t ) is the term corresponding to Rt =
= P(Rt = t − k|y1:t ) is its corresponding weight at time t.
t
To update the belief after observing yt+1, one can use the exact Bayesian
(θ )
recursive formula, equation 2.10, for which one needs to compute π (t+1)
as
B
π (t+1)
B
(θ ) =
π (t)(θ )PY (yt+1
; π (t))
P(yt+1
|θ )
= PY (yt+1
P(yt+1
|θ )
; π (t))
t−1(cid:11)
k=0
w(k)
t P((cid:3)t+1
= θ |Rt = t − k, y1:t ).
(4.23)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
313
Using Bayes’ rule and the conditional independence of observations, we
have
π (t+1)
B
(θ ) = PY (yt+1
P(yt+1
|θ )
; π (t))
=
1
; π (t))
P(yt+1
t−1(cid:11)
k=0
t−1(cid:11)
k=0
P(yk+1:t
w(k)
t
|(cid:3)t+1
P(yk+1:t
= θ , Rt = t − k)π (0)(θ )
|Rt = t − k)
(cid:16)
t+1
i=k+1 PY (yi
P(yk+1:t
|θ )π (0)(θ )
|Rt = t − k)
w(k)
t
,
(4.24)
and once again, by using Bayes’ rule and the conditional independence of
observations, we find
π (t+1)
B
(θ ) =
=
1
; π (t))
P(yt+1
t−1(cid:11)
w(k)
t
P(yk+1:t+1
|Rt+1
= t − k + 1)
P(yk+1:t
|Rt = t − k)
× P((cid:3)t+1
= t − k + 1, yk+1:t+1)
1
; π (t))
P(yt+1
w(k)
t P(yt+1
|Rt+1
= t − k + 1, y1:t )
k=0
= θ |Rt+1
t−1(cid:11)
k=0
= θ |Rt+1
× P((cid:3)t+1
= t − k + 1, y1:t+1).
(4.25)
This gives us
π (t+1)
B
(θ ) =
t−1(cid:11)
w(k)
t
P(yt+1
P(yt+1
; π (t)
k )
; π (t))
×
k=0
× P((cid:3)t+1
= θ |Rt+1
= t − k + 1, y1:t+1),
(4.26)
and finally,
w(k)
B,t+1
= P(yt+1
P(yt+1
; π (t)
k )
; π (t))
w(k)
t
.
(4.27)
Using the recursive formula, the update rule for the weights for 0 ≤ k ≤
t − 1 is
w(k)
t+1
= (1 − γt+1)w(k)
B,t+1
= (1 − γt+1)
P(yt+1
P(yt+1
; π (t)
k )
; π (t))
w(k)
t
,
and for the newly added particle t,
w(t)
t+1
where γt+1
,
= γt+1
(cid:7)
= γ
SBF(yt+1
; π (t)), m = pc
1−pc
(cid:8)
of equation 2.9.
(4.28)
(4.29)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
314
V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea
The MPN algorithm uses equations 4.22, 4.28, and 4.29 for computing the
belief for t ≤ N, which is same as the exact Bayesian inference. For t > norte,
it first updates the weights in the same fashion as equations 4.28 y 4.29,
keeps the greatest N weights, and sets the rest weights equal to 0. After nor-
malizing the new weights, it uses equation 4.22 (but only over the particles
with nonzero weights) to compute the belief ˆπ (t). (For the particular case of
exponential family, see algorithm 2 for the pseudocode.)
4.4 Derivation of the Weight Update for Particle Filtering: Algoritmo
3. We derive here the weight update for the particle filter. The difference in
our formalism from a standard derivation (Särkkä, 2013) is the absence of
, y1:t ) (cid:12)=
the Markov property of conditional observations (es decir., PAG(yt+1
|ct+1)). Our goal is to perform the approximation
PAG(yt+1
|c1:t+1
PAG(c1:t+1
|y1:t+1) ≈
norte(cid:11)
yo=1
w(i)
t+1
δ(c1:t+1
− c(i)
1:t+1).
(4.30)
Given a proposal sampling distribution Q, for the weight of particle i at time
t + 1, tenemos
w(i)
t+1
∝
1:t+1
PAG(C(i)
q(C(i)
1:t+1
|y1:t+1)
|y1:t+1)
∝
PAG(C(i)
1:t+1
q(C(i)
, yt+1
|y1:t )
|y1:t+1)
1:t+1
,
w(i)
t+1
∝
PAG(yt+1
q(C(i)
t+1
, C(i)
t+1
|C(i)
1:t
|C(i)
, y1:t )PAG(C(i)
|y1:t )
1:t
1:t
, y1:t+1)q(C(i)
|y1:t )
1:t
,
(4.31)
where the only assumption for the proposal distribution Q is that the previ-
ous hidden states c(i)
1:t are independent of the next observation, yt+1, cual
allows keeping the previous samples c(i)
1:t to c(i)
1:t+1 and
writing the update of the weights in a recursive way (Särkkä, 2013).
1:t when going from c(i)
Notice that w(i)
t
step. Por lo tanto,
∝ P(C(i)
1:t
q(C(i)
1:t
|y1:t )
|y1:t )
are the weights calculated at the previous time
w(i)
t+1
∝
PAG(yt+1
q(C(i)
t+1
, C(i)
t+1
|C(i)
1:t
|C(i)
, y1:t )
1:t
, y1:t+1)
w(i)
t
.
(4.32)
For the choice of Q, we use the optimal proposal function in terms of vari-
ance of the weights (Doucet et al., 2000):
q(C(i)
t+1
|C(i)
1:t
, y1:t+1) =P(C(i)
t+1
|C(i)
1:t
, y1:t+1).
(4.33)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
315
Using Bayes’ rule and equations 4.32 y 4.33, after a few steps of algebra,
tenemos
w(i)
t+1
∝
PAG(yt+1
PAG(C(i)
t+1
, C(i)
t+1
|C(i)
1:t
(cid:12)
|C(i)
, y1:t )
1:t
, y1:t+1)
w(i)
t
=P(yt+1
|C(i)
1:t
, y1:t )w(i)
t
∝
(1 − pc)PAG(yt+1
|C(i)
1:t
, y1:t, C(i)
t+1
= 0) + pcP(yt+1
|C(i)
1:t
, y1:t, C(i)
t+1
(cid:13)
= 1)
w(i)
t
.
(4.34)
Using the definition in equation 2.6, we have P(yt+1
PAG(yt+1
= 1) =P(yt+1
) y P(yt+1
, y1:t, C(i)
t+1
; ˆπ (t)
i
|C(i)
1:t
|C(i)
1:t
, y1:t, C(i)
= 0) =
t+1
; Pi (0)). Por lo tanto, tenemos
(cid:17)
w(i)
t+1
=
(1 − pc)PAG(yt+1
; ˆπ (t)
i
) + pcP(yt+1
(cid:18)
; Pi (0))
w(i)
t
/z,
where Z is the normalization factor,
Z = (1 − pc)PAG(yt+1
; ˆπ (t)) + pcP(yt+1
; Pi (0)),
where we have
PAG(yt+1
; ˆπ (t)) =
norte(cid:11)
yo=1
w(i)
t P(yt+1
; ˆπ (t)
i
).
(4.35)
(4.36)
(4.37)
We now compute the weights corresponding to π (t+1)
2.8
B
as defined in equation
w(i)
B,t+1
=P(yt+1
PAG(yt+1
; ˆπ (t)
)
i
; ˆπ (t))
w(i)
t
.
(4.38)
Combining equations 4.35, 4.36, y 4.38, we can then rewrite the weight
update rule as
w(i)
t+1
= (1 − γt+1)w(i)
B,t+1
+ γt+1
w(i)
t
,
(cid:12)
(cid:13)
(4.39)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
where γt+1
= γ
SBF(yt+1
; ˆπ (t)),
pc
1−pc
of equation 2.9.
At every time step t + 1 we sample each particle’s hidden state ct+1 from
the proposal distribution. Using equation 4.33, tenemos
316
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
q(C(i)
t+1
= 1|C(i)
1:t
, y1:t+1) =
(1 − pc)PAG(yt+1
(cid:5)
pcP(yt+1
; ˆπ (t)
i
; Pi (0))
) + pcP(yt+1
(cid:6)
; Pi (0))
= γ
SBF(yt+1
; ˆπ (t)
i
),
.
(4.40)
pc
1 − pc
We implemented the Sequential Importance Resampling algorithm
(Doucet et al., 2000; Gordon et al., 1993), where the particles are resampled
when their effective number falls below a threshold. The effective number
of the particles is defined as (Doucet et al., 2000; Särkkä, 2013)
Neff
≈
(cid:19)
1
yo=1(w(i)
norte
t )2
.
(4.41)
When Neff is below a critical threshold, the particles are resampled with
replacement from the categorical distribution defined by their weights, y
all their weights are set to w(i)
= 1/N. We did not optimize the parameter
t
Neff, and following Doucet and Johansen (2009), we performed resampling
when Neff
≤ N/2.
4.5 Surprise-Modulation as a Framework for Other Algorithms.
4.5.1 SMiLe Rule. The Confidence Corrected Surprise (Faraji et al., 2018)
es
SCC(yt+1
; ˆπ (t)) = DKL
(cid:9)
(cid:10)
ˆπ (t)(i )|| ˜P(i |yt+1)
,
where ˜P(i |yt+1) is the scaled likelihood defined as
˜P(i |yt+1) =
(cid:20)
PY (yt+1
PY (yt+1
|i )
|i (cid:5))dθ (cid:5)
.
(4.42)
(4.43)
Note that this becomes equal to P(i |yt+1) if the prior belief π (0) is a uni-
form distribution (see equation 2.11).
With the aim of minimizing the Confidence Corrected Surprise by up-
dating the belief during time, Faraji et al. (2018) suggested an update rule
solving the optimization problem,
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
ˆπ (t+1)(i ) = arg min
DKL
(cid:9)
q
s.t. DKL
(cid:9)
(cid:10)
q(i )|| ˜P(i |yt+1)
(cid:10)
q(i )|| ˆπ (t)(i )
≤ Bt+1
,
(4.44)
Learning in Volatile Environments With the Bayes Factor Surprise
317
(cid:10)
(cid:9)
0, DKL[PAG(i |yt+1)|| ˆπ (t)(i )]
where Bt+1
showed that the solution to this optimization problem is
∈
is an arbitary bound. Los autores
(cid:7)
(cid:8)
ˆπ (t+1)(i )
registro
= (1 − γt+1) registro
(cid:8)
(cid:7)
ˆπ (t)(i )
+ γt+1 log
(cid:7)
(cid:8)
˜P(i |yt+1)
+ Const.,
(4.45)
where γt+1
4.44.
∈ [0, 1] is specified so that it satisfies the constraint in equation
Although equation 4.45 looks very similar to equation 2.13, it signifies
a trade-off between the latest belief ˆπ (t) and the belief updated by only the
most recent observation ˜P(i |yt+1), eso es, a trade-off between adherence to
the current belief and reset. While SMiLe adheres to the current belief ˆπ (t),
Variational SMiLe integrates the new observation with the current belief to
get ˆπ (t)
B , which leads to a trade-off similar to the one of the exact Bayesian
inferencia (see equations 2.12 y 2.10).
To modulate the learning rate by surprise, Faraji et al. (2018) consideró
the boundary Bt+1 as a function of the Confidence Corrected Surprise,
Bt+1
= Bmax
γ
(cid:13)
(cid:12)
SCC(yt+1), metro
where Bmax
= DKL[PAG(i |yt+1)|| ˆπ (t)(i )],
(4.46)
where m is a free parameter. Entonces, γt+1 is found by satisfying the constraint
of the optimization problem in equation 4.44 using equations 4.45 y 4.46.
|μt+1
∼ N (μt+1
4.5.2 Nassar’s Algorithm. For the particular case that observations are
drawn from a gaussian distribution with known variance and unknown
, pag 2) and θt = μt, Nassar et al. (2012, 2010) estafa-
significar, yt+1
sidered the problem of estimating the expected μt and its variance rather
than a probability distribution (es decir., belief) over it, implicitly assuming that
the belief is always a gaussian distribution. The algorithms of Nassar et al.
(2012, 2010) were developed for the case that whenever the environment
cambios, the mean μt+1 is drawn from a uniform prior with a range of val-
ues much larger than the width of the gaussian likelihood function. El
authors showed that in this case, the expected μt+1 (es decir., ˆμt+1) estimated by
the agent upon observing a new sample yt+1 is
ˆμt+1
= ˆμt + αt+1(yt+1
− ˆμt ),
with αt+1 the adaptive learning rate given by
αt+1
= 1 + (cid:12)t+1 ˆrt
1 + ˆrt
,
(4.47)
(4.48)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
318
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
Cifra 12: Gaussian estimation task: Transient performance after changes for
original algorithms of Nassar et al. (2010) and Nassar et al. (2012). Mean squared
error for the estimation of μ
t at each time step n after an environmental change,
= 0.1 (left panel)
= norte] con el tiempo; σ = 0.1, pc
eso es, the average of MSE[ ˆ(cid:3)
t
= 0.01 (right panel). The shaded area corresponds to the standard
and σ = 5, pc
error of the mean. Nas10∗, Nas12∗: variants of Nassar et al. (2010) and Nassar
et al. (2012), respectivamente, Nas10 Original, Nas12 Original: Original algorithms
of Nassar et al. (2010) and Nassar et al. (2012), respectivamente.
|Rt
where ˆrt is the estimated time since the last change point (es decir., the esti-
mated Rt = min{n ∈ N : Ct−n+1
= 1|y1:t+1) the prob-
= 1) y (cid:12)t+1
ability of a change given the observation. Note that this quantity (es decir. el
posterior change point probability) is the same as our adaptation rate γt+1
of equation 2.9.
=P(ct+1
We next extend their approach to a more general case where the prior is
a gaussian distribution with arbitrary variance, μt+1
0 ). Nosotros entonces
discuss the relation of this method to Particle Filtering. A performance com-
∗
parison between our extended algorithms Nas10
and their
original versions Nas10 and Nas12 is depicted in Figures 12 and 13.
∗
and Nas12
∼ N (μ
0
, σ 2
4.5.3 Nas10∗ and Nas12∗ Algorithms. Let us consider that y1:t are ob-
served, the time since the last change point rt is known, and the agent’s
current estimation of μt is ˆμt. It can be shown (see the appendix for the
derivation) that the expected μt+1 (es decir., ˆμt+1) upon observing the new sam-
ple yt+1 is
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
(cid:5)
= (1 − γt+1)
ˆμt +
ˆμt+1
1
ρ + rt + 1
(cid:6)
− ˆμt )
(yt+1
(cid:6)
(cid:5)
+ γt+1
μ
0
+ 1
ρ + 1
(yt+1
− μ
0)
,
(4.49)
Learning in Volatile Environments With the Bayes Factor Surprise
319
Cifra 13: Gaussian estimation task: Steady-state performance for original al-
gorithms of Nassar et al. (2010) and Nassar et al. (2012). Difference between the
mean squared error of each algorithm and the optimal solution (Exact Bayes),
eso es, the average (cid:8)MSE[ ˆ(cid:3)
t] over time for each combination of environmental
parameters σ and pc. Abbreviations: Nas10∗, Nas12∗: Variants of Nassar et al.
(2010) and Nassar et al. (2012) respectivamente; Nas10 Original, Nas12 Original:
Original algorithms of Nassar et al. (2010) and Nassar et al. (2012) respectivamente.
where ρ = σ 2
σ 2
0
tation rate of equation 2.9.
, μ
0 is the mean of the prior distribution, and γt+1 is the adap-
We can see that the updated mean is a weighted average, with surprise-
modulated weights, between integrating the new observation with the cur-
rent mean ˆμt and integrating it with the prior mean μ
0, in the same spirit as
the other algorithms we considered here. Ecuación 4.49 can also be seen as a
surprise-modulated weighted sum of two delta rules: one including a pre-
− ˆμt )
diction error between the new observation and the current mean (yt+1
and one including a prediction error between the observed sample and the
prior mean (yt+1
− μ
0).
In order to obtain a form similar to the one of Nassar et al. (2012, 2010),
we can rewrite the above formula as
ˆμt+1
=
ρ
ρ + 1
(cid:12)
ˆμt + γt+1(μ
0
(cid:13)
− ˆμt )
+ 1
ρ + 1
(cid:12)
ˆμt + αt+1(yt+1
(cid:13)
− ˆμt )
,
(4.50)
= ρ+γt+1rt +1
ρ+rt +1
where we have defined αt+1
. Por eso, the update rule takes the
form of a weighted average, with fixed weights, between two delta rules:
one including a prediction error between the prior mean and the current
significar (μ
− ˆμt ) and one including a prediction error between the observed
0
− ˆμt ), both with surprise-modulated
sample and the current mean (yt+1
learning rates.
In Nassar et al. (2012, 2010), the true new mean after a change point is
drawn from a uniform distribution with a range of values much larger than
the width of the gaussian likelihood. Their derivations implicitly approx-
imate the uniform distribution with a gaussian distribution with σ
(cid:14) σ .
0
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
320
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
Note that if σ
= 1+γt+1rt
appears, and αt+1
1+rt
gorithm in equations 4.47 y 4.48, with γt+1
(cid:14) pag , then ρ → 0, so that the first term of equation 4.50 dis-
. This results in the delta rule of the original al-
= (cid:12)t+1.
0
All of the calculations so far were done by assuming that rt is known.
Sin embargo, for the case of a nonstationary regime with a history of change
puntos, the time interval rt is not known. Nassar et al. (2012, 2010) used the
expected time interval ˆrt as an estimate. We make a distinction here between
Nassar et al. (2012) and Nassar et al. (2010):
In Nassar et al. (2010) ˆrt is calculated recursively on each trial in the same
= (1 − γt+1)(ˆrt + 1) + γt+1, eso es, at each time
spirit as equation 2.10: ˆrt+1
step, there is a probability (1 − γt+1) that ˆrt increments by 1 and a prob-
ability γt+1 that it is reset to 1. So ˆrt+1 is the weighted sum of these two
resultados. Por eso, equation 4.50 combined with the expected time interval
ˆrt constitutes a generalization of the update rule of Nassar et al. (2010) para
the case of gaussian prior N (μ
(see the
0
appendix for pseudocode).
0 ). We call this algorithm Nas10
, σ 2
∗
In Nassar et al. (2012), the variance ˆσ 2
t+1
=Var[μt+1
= σ 2
ˆσ 2
t+1
|y1:t+1] is estimated
− σ 2
is computed.
σ 2
0
given ˆμt, ˆrt, and ˆσ 2
t . Based on this variance, ˆrt+1
The derivation of the recursive computation of ˆσ 2
t+1 for the case of gaussian
priors can be found in the appendix. We call the combination of equation
∗
4.50 with this way of computing the expected time interval ˆrt Nas12
(ver
the appendix for pseudocode). These two versions of calculating ˆrt in Nas-
sar et al. (2010) and Nassar et al. (2012) give different results, and we com-
∗
pare our algorithms with both Nas10
in our simulations. Nota
eso, as discussed in section 2.1, the posterior belief at time t + 1 does not
generally belong to the same family of distributions as the belief of time t.
Sin embargo, we therefore approximate for both algorithms the posterior belief
PAG(i |y1:t+1) by a gaussian.
∗
and Nas12
4.5.4 Nassar’s Algorithm and Particle Filtering with One Particle. En el
case of Particle Filtering (see equation 2.23) with only one particle, at each
time step, we sample the particle’s hidden state with change probability
, y1:t+1) = γt+1, generating a posterior belief that takes two
q(C(1)
t+1
possible values with probability (according to the proposal distribution)
= 1|C(1)
1:t
(cid:12)
ˆπ (t+1)(i ) = ˆπ (t+1)
q
(cid:12)
ˆπ (t+1)(i ) =P(i |yt+1)| yt+1
(i )| yt+1
B
q
(cid:13)
(cid:13)
= 1 − γt+1
,
= γt+1
.
(4.51)
So, in expectation, the updated belief will be
mi
q[ ˆπ (t+1)(i )] = (1 − γt+1) ˆπ (t+1)
B
(i ) + γt+1P(i |yt+1).
(4.52)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
321
If we apply equation 4.52 to ˆμt+1, we find that E
generalization of Nassar et al. (2010) (see equation 4.49).
q[ ˆμ(t+1)], is identical to the
Además, in Particle Filtering with a single particle, we sample the parti-
cle’s hidden state, which is equivalent to sampling the interval ˆRt+1. Porque
ˆRt+1 takes the value ˆrt + 1 con (1 − γt+1) and the value of 1 (=reset) con
probability γt+1, the expected value of ˆRt+1 is
mi
q[ ˆRt+1] = (1 − γt+1)(ˆrt + 1) + γt+1
.
(4.53)
En otras palabras, in Nassar et al. (2010), the belief is updated based on the ex-
pected ˆrt, whereas in Particle Filtering with one particle, the belief is updated
using the sampled ˆrt.
En resumen, the two methods will give different estimates on a trial-per-
trial basis, but the same result in expectation. The pseudocode for Particle
Filtering with one particle for the particular case of the gaussian estimation
task is in the appendix.
4.6 Application to the Exponential Family. For our three algorithms—
Variational SMiLe, Message Passing with fixed number N of particles,
ˆπ (t+1)(i )
and Particle Filtering—we derive compact update rules for
when the likelihood function PY (y|i ) is in the exponential family and
Pi (0)(i ) is its conjugate prior. In that case, the likelihood function has the
forma
(cid:7)
PY (y|i ) = h(y)exp.
(cid:8)
θ T φ(y) − A(i )
,
(4.54)
where θ is the vector of natural parameters, h(y) is a positive function, Fi(y) es
the vector of sufficient statistics, y un(i ) is the normalization factor. Entonces
the conjugate prior π (0) has the form
(cid:7)
Pi (0)(i ) = Pπ
(cid:3) = θ ; χ (0), norte (0)
(cid:8)
= ˜h(i ) F
(cid:7)
χ (0), norte (0)
(cid:8)
(cid:7)
exp.
(cid:8)
θ T χ (0) − ν (0)A(i )
,
(4.55)
where χ (0) and ν (0) are the distribution parameters, ˜h(i ) is a positive
(cid:7)
función, yf
is the normalization factor. For this setting and
(cid:8)
(cid:7)
while π (t) = Pπ
, the Bayes Factor Surprise has the compact
forma
χ (0), norte (0)
(cid:3) = θ ; χ (t), norte (t)
(cid:8)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
(cid:12)
SBF
yt+1
; Pπ
(cid:7)
(cid:3) = θ ; χ (t), norte (t)
(cid:8)(cid:13)
(cid:7)
(cid:7)
= f
F
χ (t) + Fi(yt+1), norte (t) + 1
χ (0) + Fi(yt+1), norte (0) + 1
(cid:7)
(cid:8)
(cid:8) F
F
χ (0), norte (0)
(cid:7)
χ (t), norte (t)
(cid:8)
(cid:8) .
(4.56)
322
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
The pseudocode for Variational SMiLe, MPN, and Particle Filtering can
be seen in algorithms 1, 2, y 3, respectivamente.
4.7 Simulation Task. En esta sección, we first argue why the mean
squared error is a proper measure for comparing different algorithms with
entre sí, and then we explain the version of Leaky integrator that we used
for simulations.
4.7.1 Mean Squared Error as an Optimality Measure. Consider the case
that at each time point t, the goal of an agent is to have an estimation
of the parameter (cid:3)t as a function of the observations Y1:t, eso es, ˆ(cid:3)t=
F (Y1:t ). The estimator that minimizes the mean squared error MSE[ ˆ(cid:3)t] =
mi
PAG(Y1:t ,(cid:3)t )
( ˆ(cid:3)t − (cid:3)t )2
es
(cid:10)
(cid:9)
ˆ(cid:3)Opt
t
= mi
PAG((cid:3)t |Y1:t )
(cid:10)
(cid:9)
(cid:3)t
= Eπ (t)
(cid:9)
(cid:3)t
(cid:10)
,
(4.57)
which is the expected value of (cid:3)t conditioned on the observations Y1:t, o, en
other words under the Bayes-optimal current belief (see Papoulis & Saun-
ders, 1989, for a proof). The MSE for any other estimator ˆ(cid:3)t can be written
como (see below for the proof)
MSE[ ˆ(cid:3)t] = MSE[ ˆ(cid:3)Opt
t
] + (cid:8)MSE[ ˆ(cid:3)t],
(cid:17)
(cid:18)
dónde (cid:8)MSE[ ˆ(cid:3)t] = mi
PAG(Y1:t )
( ˆ(cid:3)t − ˆ(cid:3)Opt
t
)2
≥ 0.
(4.58)
This means that the MSE for any arbitrary estimator ˆ(cid:3)t includes two terms:
the optimal MSE and the mismatch of the actual estimator from the opti-
mal estimator ˆ(cid:3)Opt
. Como resultado, if the estimator we are interested in is the
expected value of (cid:3)t under the approximate belief ˆπ (t) computed by each of
(cid:10)
our algorithms (es decir., ˆ(cid:3)(cid:5)
), the second term in equation 4.58, eso
t
es, the deviation from optimality, is a measure of how good the approxima-
tion is.
(cid:9)
(cid:3)t
= mi
ˆπ (t)
t
Proof for the Algorithms without Sampling. Consider ˆ(cid:3)Opt
= fOpt(Y1:t ).
Entonces, for any other arbitrary estimator ˆ(cid:3)t = f (Y1:t ) (except for the ones with
sampling), tenemos
t
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
MSE[ ˆ(cid:3)t] = mi
= mi
PAG(Y1:t ,(cid:3)t )
PAG(Y1:t ,(cid:3)t )
= mi
PAG(Y1:t ,(cid:3)t )
(cid:9)
(cid:10)
(cid:9)
( ˆ(cid:3)t − (cid:3)t )2
( F (Y1:t ) − (cid:3)t )2
(cid:17)(cid:7)
(cid:10)
(cid:8)
( F (Y1:t ) − fOpt(Y1:t )) + ( fOpt(Y1:t ) − (cid:3)t )
(cid:18)
.
2
(4.59)
Learning in Volatile Environments With the Bayes Factor Surprise
323
The quadratic term in the last line can be expanded and written as
MSE[ ˆ(cid:3)t] = mi
PAG(Y1:t ,(cid:3)t )
(cid:17)
(cid:18)
( F (Y1:t ) − fOpt(Y1:t ))2
(cid:18)
(cid:17)
+ mi
PAG(Y1:t ,(cid:3)t )
+ 2mi
PAG(Y1:t ,(cid:3)t )
( fOpt(Y1:t ) − (cid:3)t )2
(cid:17)
(cid:18)
.
( F (Y1:t ) − fOpt(Y1:t ))( fOpt(Y1:t ) − (cid:3)t )
(4.60)
The random variables in the expected value of the first line are not depen-
dent on (cid:3)t, so it can be computed over Y1:t. The expected value of the second
line is equal to MSE[ ˆ(cid:3)Opt
]. It can also be shown that the expected value of
the third line is equal to 0,
t
3rd line = 2E
PAG(Y1:t )
(cid:17)
(cid:17)
(cid:17)
(cid:18)(cid:18)
mi
PAG((cid:3)t |Y1:t )
( F (Y1:t ) − fOpt(Y1:t ))( fOpt(Y1:t ) − (cid:3)t )
= 2E
PAG(Y1:t )
( F (Y1:t ) − fOpt(Y1:t ))( fOpt(Y1:t ) − E
(cid:18)
PAG((cid:3)t |Y1:t )[(cid:3)t])
= 0,
(4.61)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
where in the last line, we used the definition of the optimal estimator. To-
juntos, tenemos
MSE[ ˆ(cid:3)t] = MSE[ ˆ(cid:3)Opt
t
] + mi
PAG(Y1:t )
( ˆ(cid:3)t − ˆ(cid:3)Opt
t
)2
(cid:17)
(cid:18)
.
(4.62)
(cid:2)
Proof for the Algorithms with Sampling. For particle filtering (and any
kind of estimator with sampling), the estimator is not a deterministic func-
tion of observations Y1:t. Bastante, the estimator is a function of observations
as well as a set of random variables (muestras) drawn from a distribution
that is also a function of observations Y1:t. In our case, the samples are the
sequence of hidden states C1:t. The estimator can be written as
ˆ(cid:3)PF
t
= f (Y1:t, C(1:norte)
1:t
),
(4.63)
1:t
where C(1:norte)
are N independent and identically distributed samples drawn
from the proposal distribution Q(C1:t|Y1:t ). MSE for this estimator should
also be averaged over the samples, which leads to
MSE[ ˆ(cid:3)PF
t ] = mi
PAG(Y1:t ,(cid:3)t )q(C(1:norte)
1:t
|Y1:t )
= mi
PAG(Y1:t ,(cid:3)t )q(C(1:norte)
1:t
|Y1:t )
(cid:9)
(cid:9)
(cid:10)
( ˆ(cid:3)PF
t
− (cid:3)t )2
( F (Y1:t, C(1:norte)
1:t
) − (cid:3)t )2
(cid:10)
.
(4.64)
324
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
Similar to what we did before, the MSE for particle filtering can be written
como
MSE[ ˆ(cid:3)PF
t ] = MSE[ ˆ(cid:3)Opt
t
] + mi
= MSE[ ˆ(cid:3)Opt
t
] + mi
1:t
PAG(Y1:t )q(C(1:norte)
(cid:17)
mi
PAG(Y1:t )
(cid:17)
|Y1:t )
( ˆ(cid:3)PF
t
(cid:9)
(cid:18)
− ˆ(cid:3)Opt
t
)2
(cid:10)(cid:18)
,
− ˆ(cid:3)Opt
t
)2
(4.65)
( ˆ(cid:3)PF
t
q(C(1:norte)
1:t
|Y1:t )
which can be written in terms of bias and variance over samples as
MSE[ ˆ(cid:3)PF
t ] = MSE[ ˆ(cid:3)Opt
t
] + mi
PAG(Y1:t )
+ BiasQ(C(1:norte)
1:t
|Y1:t )( ˆ(cid:3)PF
t
(cid:17)
VarQ(C(1:norte)
1:t
(cid:18)
.
, ˆ(cid:3)Opt
t
)2
|Y1:t )( ˆ(cid:3)PF
t
)
(4.66)
(cid:2)
4.7.2 Leaky Integration. For the gaussian task, the goal is to have an es-
timation of the mean of the gaussian distribution at each time t, denotado
by ˆθt. Given a leak parameter ω ∈ (0, 1], the leaky integrator estimation
es
ˆθt =
(cid:19)
t
k=1
(cid:19)
t
k=1
ωt−kyk
ωt−k
.
(4.67)
For the categorical task, the goal is to have an estimation of the param-
eters of the categorical distribution at each time t, denoted by ˆθt = [ ˆθ
i,t]norte
yo=1
for the case that there are N categories. Given a leak parameter ω ∈ (0, 1],
the leaky integrator estimation is
ˆθ
i,t
=
(cid:19)
t
k=1
ωt−kδ(yk
(cid:19)
ωt−k
t
k=1
− i)
,
where δ is the Kronecker delta function.
(4.68)
4.8 Derivation of the Formula Relating Shannon Surprise to the Mod-
ulated Learning Rate. Given the defined generative model, the Shannon
surprise upon observing yt+1 can be written as
SSh(yt+1
; Pi (t)) = log
= log
(cid:6)
1
PAG(yt+1
|y1:t )
(cid:5)
(cid:5)
1
(cid:6)
(1 − pc)PAG(yt+1
; Pi (t)) + pcP(yt+1
; Pi (0))
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
325
(cid:5)
= log
(cid:6)
1
; Pi (0))
PAG(yt+1
+ registro
= SSh(yt+1
; Pi (0)) + registro
(cid:5)
γt+1
pc
(cid:12)
(cid:13)
(cid:6)
1
1 + 1
metro
1
SBF (yt+1
;Pi (t) )
(cid:5)
1
pc
(cid:6)
,
(4.69)
= γ
; Pi (t)), m = pc
where γt+1
of equation 2.9. Como resultado, el
1−pc
modulated adaptation rate can be written as in equation 2.25 and the Bayes
Factor Surprise as in equation 3.2.
SBF(yt+1
4.9 Experimental Predictions.
4.9.1 Setting. Consider a gaussian task where Yt can take values in R.
The likelihood function PY (y|i ) is defined as
PY (y|i ) = N (y; i , pag 2),
(4.70)
where σ ∈ R+
is the standard deviation and θ ∈ R is the mean of the distri-
bution, eso es, the parameter of the likelihood. Whenever there is a change
in the environment (with probability pc ∈ (0, 1)), the value θ is drawn from
the prior distribution π (0)(i ) = N (i ; 0, 1).
4.9.2 Theoretical Proofs for Prediction 1. For our theoretical derivations for
our first prediction, we consider the specific but relatively mild assumption
that the subjects’ belief π (t) at each time is a gaussian distribution,
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
Pi (t)(i ) = N (i ; ˆθt, ˆσ 2
t ),
(4.71)
where ˆθt and ˆσt are determined by the learning algorithm and the sequence
of observations y1:t. This is the case when the subjects use either VarSMiLe,
Nas10∗, Nas12∗, pf1, MP1, or Leaky Integration as their learning rule. Con
such assumptions, the inferred probability distribution P(y; Pi (t)) can be
written as
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
PAG(y; Pi (t)) = N (y; ˆθt, pag 2 + ˆσ 2
t ).
(4.72)
− ˆθt and the “sign bias” as st+1
As mentioned in section 2, we define, at time t, the prediction error as
δt+1
= sign(δt+1 ˆθt ). Entonces, given an
= yt+1
absolute prediction ˆθ > 0, an absolute prediction error δ > 0, a standard de-
viation σ
C, and a sign bias s ∈ {−1, 1}, the average Bayes Factor Surprise is
computed as
326
V. Liakoni, A. Modirshanechi, W.. Gerstner, y j. Brea
(cid:11)
SBF(yt; ˆπ (t−1)),
¯SBF( ˆθ , δ, s, pag
C) = 1
|t |
t∈T
where T = {t : | ˆθt−1
| = ˆθ , |δt| = re, ˆσt = σ
C
, st = s}.
(4.73)
It can easily be shown that the value SBF(yt; ˆπ (t−1)) is the same for all t ∈ T ,
and hence the average surprise is the same as the surprise for each time
punto. Por ejemplo, the average surprise for s = +1 is equal to
¯SBF( ˆθ , δ, s = +1, pag
C) =
norte ( ˆθ + δ; 0, pag 2 + 1)
norte (δ; 0, pag 2 + pag 2
C )
.
(4.74)
Similar formulas can be computed for ¯SBF( ˆθ , δ, s = −1, pag
ference (cid:8) ¯SBF( ˆθ , δ, pag
computed as
C) = ¯SBF( ˆθ , δ, s = +1, pag
C) − ¯SBF( ˆθ , δ, s = −1, pag
C). Then the dif-
C) can be
(cid:8) ¯SBF( ˆθ, δ, pag
C) =
norte ( ˆθ + δ; 0, pag 2 + 1) − N ( ˆθ − δ; 0, pag 2 + 1)
norte (δ; 0, pag 2 + pag 2
C )
.
(4.75)
It can be shown that
(cid:8) ¯SBF( ˆθ, δ, pag
C) < 0 and
∂
∂δ
(cid:8) ¯SBF( ˆθ , δ, σ
C) < 0
(4.76)
for all ˆθ > 0, δ > 0, and σ
C
for the second inequality is given below.
> 0. The first inequality is trivial, and the proof
The average Shannon Surprise can be computed in a similar way. Para
ejemplo, for s = +1, tenemos
¯SSh( ˆθ , δ, s = +1, pag
C)
= − log
(cid:12)
(cid:13)
,
pcN ( ˆθ + δ; 0, pag 2 + 1) + (1 − pc)norte (δ; 0, pag 2 + pag 2
C )
(4.77)
and then the difference (cid:8) ¯SSh( ˆθ , δ, pag
−1, pag
C) can be computed as
C) = ¯SSh( ˆθ , δ, s = +1, pag
C) − ¯SSh( ˆθ , δ, s =
(cid:8) ¯SSh( ˆθ , δ, pag
C) = log
(cid:5)
1 + m ¯SBF( ˆθ , δ, s = −1, pag
1 + m ¯SBF( ˆθ , δ, s = +1, pag
C)
C)
(cid:6)
,
(4.78)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
327
where m = pc
1−pc
tener
. Entonces, using the results for the Bayes Factor Surprise, nosotros
(cid:8) ¯SSh( ˆθ, δ, pag
C) > 0 y
∂
∂δ
(cid:8) ¯SSh( ˆθ , δ, pag
C) > 0,
for all ˆθ > 0, δ > 0, and σ
C
> 0.
Proof of the Second Inequality for SBF. Let us define the variables
pag 2
d
pag 2
norte
,
= σ 2 + pag 2
C
= σ 2 + 1,
as well as the functions
f1(δ) = ¯SBF( ˆθ , δ, s = +1, pag
C) =
f2(δ) = ¯SBF( ˆθ , δ, s = −1, pag
C) =
(cid:5)
exp.
(cid:5)
exp.
pag
d
σn
pag
d
σn
δ2
2pag 2
d
δ2
2pag 2
d
− (δ + ˆθ )2
2pag 2
norte
− (δ − ˆθ )2
2pag 2
norte
(cid:6)
(cid:6)
,
,
F (δ) = (cid:8) ¯SBF( ˆθ , δ, pag
C) = f1(δ) − f2(δ).
The following inequalities hold true:
F (δ) < 0 ⇒ f1(δ) < f2(δ), < σ 2 0 = 1 ⇒ σ 2 d < σ 2 n . σ 2 C Then, the derivative of f (δ) can be computed as (cid:5) d dδ f (δ) = f1(δ) (cid:7) ˆθ σ 2 n = − δ σ 2 d − δ + ˆθ σ 2 n (cid:8) f1(δ) + f2(δ) (cid:6) δ − ˆθ σ 2 n − (cid:5) − f2(δ) (cid:5) δ σ 2 d 1 − f1(δ) − f2(δ) f1(δ) + f2(δ) δ ˆθ (cid:6) (cid:5) (cid:6)(cid:6) σ 2 n σ 2 d − 1 < 0. Therefore, we have ∂ ∂δ (cid:8) ¯SBF( ˆθ , δ, σ C) = d dδ f (δ) < 0. (4.79) (4.80) (4.81) (4.82) (4.83) (cid:2) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 3 2 2 6 9 1 8 9 6 8 0 8 n e c o _ a _ 0 1 3 5 2 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Proof of the Second Inequality for SSh. Using the functions we defined for the previous proof and after computing the partial derivative of equation 4.78, we have 328 V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea ∂ ∂δ (cid:8) ¯SSh( ˆθ , δ, σ C) = ∂ ∂δ log = d dδ log (cid:6) (cid:5) (cid:5) 1 + m ¯SBF( ˆθ , δ, s = −1, σ 1 + m ¯SBF( ˆθ , δ, s = +1, σ (cid:6) 1 + m f2(δ) . 1 + m f1(δ) C) C) (4.84) The derivative of the last term can be written in terms of the derivates of f1 and f2, indicated by f (cid:5) 2, respectively, 1 and f (cid:5) ∂ ∂δ (cid:8) ¯SSh( ˆθ , δ, σ C) = m f (cid:5) 2(δ) 1 + m f2(δ) − m f (cid:5) 1(δ) 1 + m f1(δ) (δ) 1 + m f2(δ) −m f (cid:5) (cid:8)(cid:7) = (cid:7) 1 + m f1(δ) (cid:8) + (cid:7) m2( f1 f (cid:5) 2 1 + m f1(δ) − f (cid:5) (cid:8)(cid:7) 1 f2)(δ) 1 + m f2(δ) (cid:8) . (4.85) The first term is always positive based on the proof for SBF. The second term is also always positive because ( f1 f (cid:5) 2 − f (cid:5) 1 f2)(δ) = f1(δ) f2(δ) (cid:5)(cid:5) δ σ 2 d − δ − ˆθ σ 2 n (cid:6) (cid:5) − δ σ 2 d − δ + ˆθ σ 2 n (cid:6)(cid:6) = f1(δ) f2(δ) 2 ˆθ σ 2 n > 0.
Como resultado, tenemos
∂
∂δ
(cid:8) ¯SSh( ˆθ , δ, pag
C) > 0.
(4.86)
(cid:2)
4.9.3 Simulation Procedure for Prediction 1. In order to relax the main as-
sumption of our theoretical proofs (the belief is always a gaussian distribu-
ción), to include the practical difficulties of a real experiment (p.ej., to use
|δt| ≈ δ instead of |δt| = re), and to have an estimation of the effect size, nosotros
also performed simulations for our first experimental prediction.
For each simulated subject, the procedure of our simulation was as
follows:
1. We fixed the hyperparameters σ 2 and pc for producing samples.
2. We selected a learning algorithm (p.ej., pf20) and fixed its correspond-
ing tuned parameters (based on our simulations in section 2).
3. We applied the learning algorithm over a sequence of observations
y1:t . Note that in a real experiment, this step can be done through
a few episodes, which makes it possible to have a long sequence of
observaciones (es decir., large T).
4. At each time t, we saved the values yt, ˆθt, ˆσt, δt, st, SSh(yt; ˆπ (t−1)), y
SBF(yt; ˆπ (t−1)).
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
/
mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d
/
yo
F
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
norte
mi
C
oh
_
a
_
0
1
3
5
2
pag
d
.
/
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
329
Entonces, given an absolute prediction ˆθ > 0, an absolute prediction error
> 0, and a sign bias s ∈ {−1, 1}, we defined
δ > 0, a standard deviation σ
C
the set of time points,
T = {1 < t ≤ T : || ˆθt−1 | − ˆθ | < (cid:8)θ , ||δt| − δ| < (cid:8)δ, | ˆσt − σ C | < (cid:8)σ C , st = s}, (4.87) | ≈ ˆθ , |δt| ≈ δ, and ˆσt ≈ σ | − ˆθ| < (cid:8)θ , ||δt| − δ| < (cid:8)δ, and | ˆσt − σ C where || ˆθt−1 C are equivalent to | ˆθt−1 C are positive real values that should be determined based on practical limitations (mainly the length of the observation sequence T). We then computed the average surprise values as C, respectively. (cid:8)θ , (cid:8)δ, and (cid:8)σ | < (cid:8)σ ¯SBF( ˆθ , δ, s, σ C) = 1 |T | ¯SSh( ˆθ , δ, s, σ C) = 1 |T | (cid:11) t∈T (cid:11) t∈T SBF(yt; ˆπ (t−1)), SSh(yt; ˆπ (t−1)). (4.88) We repeated this procedure for N different simulated subjects (with differ- ent random seeds). The average of ¯SBF and ¯SSh over N = 20 subjects, for ∗ = 0.5, two learning algorithms (Nas12 (cid:8)θ = 0.25, (cid:8)δ = 0.1, and (cid:8)σ = 1 is shown in Figure 10B. The results are C the same as what our theoretical analysis predicted. and pf20), and for T = 500, ˆθ = 1, σ C 4.9.4 Simulation Procedure for Prediction 2. For our second prediction, the theoretical proof is trivial. However, in order to have a setting similar to ; ˆπ (t)) = p), a real experiment (e.g., to use P(yt+1 and to have an estimation of the effect size, we used simulations also for our second experimental predictions. ; ˆπ (t)) ≈ p instead of P(yt+1 We followed the same procedure as the one for the simulation of the first prediction. For each simulated subject and at each time t, we saved the ; ˆπ (0)), SSh(yt; ˆπ (t−1)), and SBF(yt; ˆπ (t−1)). Then, quantities P(yt+1 for a given a probability value p > 0, we defined the set of time points,
; ˆπ (t)), PAG(yt+1
T = {0 ≤ t ≤ T : |PAG(yt+1
; ˆπ (t)) − p| < (cid:8)p, |P(yt+1
; ˆπ (0)) − p| < (cid:8)p},
(4.89)
; ˆπ (t)) ≈ p and P(yt+1
; ˆπ (t)) − p| < (cid:8)p and |P(yt+1
where |P(yt+1
; ˆπ (0)) − p| < (cid:8)p are equivalent
; ˆπ (0)) ≈ p, respectively. (cid:8)p is a positive real
to P(yt+1
value that should be determined based on practical limitations (mainly
the length of the observation sequence T). We then computed the aver-
age surprise ¯SBF(p) and ¯SSh(p) over T for each value of p. We repeated
this procedure for N different simulated subjects (with different random
seeds). The average of ¯SBF and ¯SSh over N = 20 subjects, for two learning
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
330
V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea
and pf20), and for T = 500 and (cid:8)p = 0.0125 is shown in
∗
algorithms (Nas12
Figure 11B.
Appendix
A.1 Modified Algorithm of Nassar et al. (2012, 2010): Adaptation for
Gaussian Prior.
A.1.1 Recursive Update of the Estimated Mean for Gaussian Prior. Let us
first consider the case of a stationary regime (no change points) where ob-
served samples are drawn from a gaussian distribution with known vari-
|θ ∼ N (θ , σ 2), and the parameter θ is also drawn from a gaussian
ance, yt+1
distribution θ ∼ N (μ
, ..., yt+1, it
0
can be shown that, using Bayes’ rule, the posterior distribution P(θ |y1:t+1) =
π (t+1)
B
0 ). After having observed samples y1
(θ ) is
, σ 2
(cid:21)
(cid:21)
P(θ |y1:t+1) = N
θ ; μB,t+1
=
1
+ t+1
σ 2
1
σ 2
0
+
μ
0
σ 2
0
(cid:19)
(cid:22)
t+1
i=1 yi
σ 2
(cid:22)
, σ 2
B,t+1
=
1
+ t+1
σ 2
.
1
σ 2
0
(A.1)
An estimate of θ is its expected value E(θ |y1:t+1) = μB,t+1.
In a nonstationary regime where, after having observed y1
, . . . , yt from
the same hidden state, there is the possibility for a change point upon ob-
serving yt+1, the posterior distribution is
P(θ |y1:t+1) = (1 − γt+1)P(θ |y1:t+1
, ct+1
= 0) + γt+1P(θ |yt+1
, ct+1
= 1).
(A.2)
To facilitate notation, we denote ct+1
so that
= 0 as “stay” and ct+1
= 1 as “change”
P(θ |y1:t+1) = (1 − γt+1)P(θ |y1:t+1
, stay) + γt+1P(θ |yt+1
, change).
(A.3)
Note that the above is equivalent to the Bayesian recursive formula, equa-
tion 2.10 of the main text, where γt+1 is the adaptation rate we saw in equa-
tion 2.9 of the main text, and is essentially the probability to change given
= 1|y1:t+1). In Nassar et al. (2010) this quantity
the new observation, P(ct+1
is denoted as (cid:12)t+1. Taking equation A.1 into account, we have
E(θ |y1:t+1
, stay) = μB,t+1
=
1
σ 2
0
E(θ |y1:t+1
, change) =
1
+ 1
σ 2
1
σ 2
0
(cid:21)
μ
0
σ 2
0
(cid:19)
t+1
i=t+1−rt yi
σ 2
(cid:22)
,
+
(cid:6)
,
(A.4)
1
+ rt +1
σ 2
(cid:5)
μ
0
σ 2
0
+ yt+1
σ 2
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
331
where rt is the time interval of observations coming from the same hidden
state, calculated at time t. Taking the expectation of equation A.3 the esti-
mated mean upon observing the new sample yt+1 is
ˆμt+1
= (1 − γ )
(cid:21)
+
μ
0
σ 2
0
(cid:22)
(cid:19)
t+1
i=t+1−rt yi
σ 2
+ γ
1
+ 1
σ 2
1
σ 2
0
(cid:5)
μ
0
σ 2
0
(cid:6)
,
+ yt+1
σ 2
1
+ rt +1
σ 2
1
σ 2
0
(A.5)
where we dropped the subscript t + 1 in γ to simplify notations. We have
ˆμt+1
= (1 − γ )
1
σ 2
0
+ γ
1
σ 2
0
1
+ 1
σ 2
(cid:21)
+
μ
0
σ 2
0
(cid:19)
t
i=t+1−rt yi
σ 2
+ yt+1
σ 2
(cid:22)
(cid:6)
.
+ yt+1
σ 2
(A.6)
1
+ rt +1
σ 2
(cid:5)
μ
0
σ 2
0
Because ˆμt = 1
+ rt
σ 2
1
σ 2
0
(cid:12)
+
μ
0
σ 2
0
(cid:19)
t
i=t+1−rt
σ 2
(cid:13)
yi
, after a few lines of algebra, we have
ˆμt+1
= (1 − γ ) ˆμt + γ μ
0
+ (1 − γ )
1
+ rt + 1
σ 2
σ 2
0
(yt+1
− ˆμt )
+ γ
1
+ 1
σ 2
σ 2
0
(yt+1
− μ
0).
We now define ρ = σ 2
σ 2
0
and find
ˆμt+1
= (1 − γ ) ˆμt + γ μ
0
+ (1 − γ )
1
ρ + rt + 1
(yt+1
− ˆμt )
+ γ
1
ρ + 1
(yt+1
− μ
0).
(A.7)
(A.8)
A rearrangement of the terms and inclusion of the dependency of γ on
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
time yields
(cid:5)
= (1 − γt+1)
ˆμt +
ˆμt+1
(cid:5)
+ γt+1
μ
0
+ 1
ρ + 1
1
ρ + rt + 1
(cid:6)
− ˆμt )
(yt+1
(cid:6)
(yt+1
− μ
0)
.
(A.9)
332
V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea
In order to obtain a form similar to the one of Nassar et al. (2012, 2010)
we continue and spell out the terms that include the quantities ˆμt, μ
0, and
yt+1:
ˆμt+1
= (1 − γ ) ˆμt − (1 − γ )
1
ρ + rt + 1
ˆμt
+ γ μ
0
− γ
+ (1 − γ )
μ
0
1
ρ + 1
1
ρ + rt + 1
yt+1
+ γ
1
ρ + 1
.
yt+1
(A.10)
Using that
1
ρ+rt +1
= 1
ρ+1
−
rt
(ρ+1)(ρ+rt +1) , we have
ˆμt+1
= (1 − γ ) ˆμt − (1 − γ )
1
ρ + 1
ˆμt + (1 − γ )
rt
(ρ + 1)(ρ + rt + 1)
ˆμt
+ γ μ
0
− γ
μ
0
1
ρ + 1
1
ρ + 1
+ (1 − γ )
yt+1
− (1 − γ )
rt
(ρ + 1)(ρ + rt + 1)
yt+1
+ γ
1
ρ + 1
.
yt+1
(A.11)
After a further step of algebra, we arrive at
ˆμt+1
=
(cid:12)
ρ
ρ + 1
(cid:13)
(1 − γ ) ˆμt + γ μ
0
(cid:5)
+ 1
ρ + 1
(1 − γ )
rt
ρ + rt + 1
(cid:6)
( ˆμt − yt+1) + yt+1
.
(A.12)
If we define 1 − α = (1 − γ )
rt
ρ+rt +1
and rearrange the terms, we have
⇒ α = 1 − (1 − γ )
rt
ρ+rt +1
⇒ α = ρ+γ rt +1
ρ+rt +1
ˆμt+1
=
ˆμt+1
=
(cid:12)
(cid:12)
ρ
ρ + 1
ρ
ρ + 1
(cid:13)
(1 − γ ) ˆμt + γ μ
0
ˆμt + γ (μ
0
(cid:13)
− ˆμt )
+ 1
(cid:12)
(cid:13)
ρ + 1
(cid:12)
ˆμt + α(yt+1
(1 − α) ˆμt + αyt+1
(cid:13)
.
− ˆμt )
+ 1
ρ + 1
(A.13)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
333
Adding back the dependency of γ and α on time, we finally have
ˆμt+1
=
ρ
ρ + 1
(cid:12)
ˆμt + γt+1(μ
0
(cid:13)
− ˆμt )
+ 1
ρ + 1
(cid:12)
ˆμt + αt+1(yt+1
(cid:13)
.
− ˆμt )
(A.14)
A.1.2 Recursive Update of the Estimated Variance for Gaussian Prior. Nassar
= Var(θ |y1:t+1) and then, based
et al. (2012) first calculate the variance ˆσ 2
on this, compute ˆrt+1. We derive here these calculations for the case of gaus-
sian prior. We note again that
t+1
P(θ |y1:t+1) = (1 − γt+1)P(θ |y1:t+1
, stay) + γt+1P(θ |yt+1
, change).
(A.15)
Then for the variance ˆσ 2
t+1
= Var(θ |y1:t+1), we have
ˆσ 2
t+1
= (1 − γ )σ 2
stay
+ γ σ 2
change
+ (1 − γ )γ (μstay − μ
change)2
= (1 − γ )σ 2
B,t+1
+ γ σ 2
change
+ (1 − γ )γ (μB,t+1
− μ
change)2,
(A.16)
where σ 2
=
B,t+1
and σ 2
1
+ rt
We have already defined ρ = σ 2
σ 2
0
change
+1
σ 2
1
σ 2
0
= 1
1
σ 2
0
+ 1
σ 2
.
so that
A = (1 − γ )σ 2
B,t+1
+ γ σ 2
change
= (1 − γ )
σ 2
ρ + rt + 1
+ γ
σ 2
ρ + 1
,
(A.17)
Using, as before, that
1
ρ+rt +1
= 1
ρ+1
−
rt
(ρ+1)(ρ+rt +1) , we have
(cid:5)
A =
σ 2
ρ + 1
1 − (1 − γ )
(cid:6)
.
1
ρ + rt + 1
(A.18)
We already defined the learning rate α = 1 − (1 − γ )
rt
ρ+rt +1 , so we can write
A =
σ 2
ρ + 1
α.
Note that μt = 1
+ rt
σ 2
1
σ 2
0
we have
(cid:12)
+
μ
0
σ 2
0
(cid:19)
t
i=t+1−rt
σ 2
(cid:13)
yi
so for the calculation of the last term,
(A.19)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
B = μB,t+1
− μ
change
(cid:5)
1
+ rt +1
σ 2
=
1
σ 2
0
(cid:19)
t+1
i=t+1−rt yi
σ 2
(cid:6)
−
+
(cid:5)
μ
0
σ 2
0
(cid:6)
.
+ yt+1
σ 2
1
+ 1
σ 2
1
σ 2
0
(A.20)
μ
0
σ 2
0
334
V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
We now rearrange the terms
(cid:5)
B = μt +
1
ρ + 1
−
rt
(ρ + 1)(ρ + rt + 1)
(cid:6)
(yt+1
− μt ) − μ
0
− 1
ρ + 1
(yt+1
− μ
0),
and finally we have
ˆσ 2
t+1
=
σ 2
ρ + 1
α + (1 − γ )γ B2.
(A.21)
(A.22)
A.2 Implementation of Nas10∗, Nas12∗, and Particle Filtering with
One Particle for the Gaussian Estimation Task. We provide here the pseu-
∗
docode for the algorithms Nas10
, and Particle Filtering with one
, Nas12
particle for the gaussian estimation task (see algorithms A1, A2, and A3).
Observations are drawn from a gaussian distribution with known vari-
, σ 2) and θt = μt. When there
ance and unknown mean: yt+1
is a change, the parameter μ is also drawn from a gaussian distribution,
μ ∼ N (μ
0 ). All three algorithms estimate the expected μt+1 (i.e., ˆμt+1)
0
upon observing a new sample yt+1.
∼ N (μt+1
|μt+1
, σ 2
∗
After rewriting equation 4.56, we have
(cid:12)
(cid:13)
SBF
yt+1
; ˆμt, ˆσt
=
(cid:14)
exp
−
σ 2 + ˆσ 2
t
σ 2 + σ 2
0
μ2
0
2σ 2
0
− ˆμ2
t
2 ˆσ 2
t
+
ˆμt
ˆσ 2
t
2(
+ yt+1
σ 2
+ 1)
σ 2
ˆσ 2
t
+
(cid:15)
.
μ
0
σ 2
0
2(
+ yt+1
σ 2
+ 1)
σ 2
σ 2
0
(A.23)
Learning in Volatile Environments With the Bayes Factor Surprise
335
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Note that the pseudocode for pf1 provided here is a translation of al-
gorithm 3 to the case of a single particle (where there are no weights to
calculate) and for the gaussian distribution as a particular instance of the
exponential family.
336
V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea
Acknowledgments
We thank both reviewers for their constructive and helpful comments. This
research was supported by the Swiss National Science Foundation (200020
184615) and the European Union Horizon 2020 Framework Program under
grant agreement 785907 (Human Brain Project, SGA2).
References
Adams, R. P., & MacKay, D. J. (2007). Bayesian online changepoint detection.
arXiv:0710.3742.
Aminikhanghahi, S., & Cook, D. J. (2017). A survey of methods for time series change
point detection. Knowledge and Information Systems, 51(2), 339–367.
Barber, D. (2006). Expectation correction for smoothed inference in switching linear
dynamical systems. Journal of Machine Learning Research, 7, 2515–2540.
Barber, D. (2012). Bayesian reasoning and machine learning. Cambridge: Cambridge
University Press.
Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. Ph.D. diss.,
University College London.
Behrens, T. E., Woolrich, M. W., Walton, M. E., & Rushworth, M. F. (2007). Learning
the value of information in an uncertain world. Nature Neuroscience, 10(9), 1214.
Bogacz, R. (2017). A tutorial on the free-energy framework for modelling perception
and learning. Journal of Mathematical Psychology, 76, 198–211.
Bogacz, R. (2019). Dopamine role in learning and action inference. bioRxiv:837641.
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge
University Press.
Brown, S. D., & Steyvers, M. (2009). Detecting and predicting changes. Cognitive Psy-
chology, 5(1), 49–67.
Cummings, R., Krehbiel, S., Mei, Y., Tuo, R., & Zhang, W. (2018). Differentially pri-
vate change-point detection. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,
N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing sys-
tems, 31 (pp. 10825–10834). Red Hook, NY: Curran.
d’Acremont, M., & Bossaerts, P. (2016). Neural mechanisms behind identification of
leptokurtic noise and adaptive behavioral response. Cerebral Cortex, 26(4), 1818–
1830.
Daw, N., & Courville, A. (2008). The pigeon as particle filter. In J. C. Platt, D. Koller,
Y. Singer, & S. T. Roweis (Eds.), Advances in neural information processing systems,
20 (pp. 369–376). Red Hook, NY: Curran.
Doucet, A., Godsill, S., & Andrieu, C. (2000). On sequential Monte Carlo sampling
methods for Bayesian filtering. Statistics and Computing, 10(3), 197–208.
Doucet, A., & Johansen, A. M. (2009). A tutorial on particle filtering and smooth-
ing: Fifteen years later. In D. Crisan & B. Rozovskii (Eds.), The Oxford handbook of
nonlinear filtering (pp. 656–704), 3. Oxford: Oxford University Press.
Doucet, A., & Tadi´c, V. B. (2003). Parameter estimation in general state-space models
using particle methods. Annals of the Institute of Statistical Mathematics, 55(2), 409–
422.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
337
Efron, B., & Hastie, T. (2016). Computer age statistical inference. Cambridge: Cambridge
University Press.
Faraji, M., Preuschoff, K., & Gerstner, W. (2018). Balancing new against old infor-
mation: The role of puzzlement surprise in learning. Neural Computation, 30(1),
34–83.
Fearnhead, P., & Liu, Z. (2007). On-line inference for multiple changepoint problems.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4), 589–
605.
Findling, C., Chopin, N., & Koechlin, E. (2019). Imprecise neural computations as source
of human adaptive behavior in volatile environments. bioRxiv:799239.
Fox, E., Sudderth, E. B., Jordan, M. I., & Willsky, A. S. (2011). Bayesian nonparametric
inference of switching dynamic linear models. IEEE Transactions on Signal Process-
ing, 59(4), 1569–1585.
Frémaux, N., & Gerstner, W. (2016). Neuromodulated spike-timing-dependent plas-
ticity, and theory of three-factor learning rules. Frontiers in Neural Circuits, 9, 85.
Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews
Neuroscience, 1(2), 127.
Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., & Pezzulo, G. (2017). Active
inference: A process theory. Neural Computation, 29(1), 1–49.
George, C. P., & Doss, H. (2017). Principled selection of hyperparameters in the latent
Dirichlet allocation model. Journal of Machine Learning Research, 18, 1–38.
Gershman, S. J. (2019). What does the free energy principle tell us about the brain?
arXiv:1901/07945.
Gershman, S. J., Monfils, M.-H., Norman, K. A., & Niv, Y. (2017). The computational
nature of memory modification. Elife, 6, e23763.
Gershman, S. J., Radulescu, A., Norman, K. A., & Niv, Y. (2014). Statistical computa-
tions underlying the dynamics of memory updating. PLOS Computational Biology,
10(11), e1003939.
Gerstner, W., Lehmann, M., Liakoni, V., Corneil, D., & Brea, J. (2018). Eligibility traces
and plasticity on behavioral time scales: Experimental support of neo-Hebbian
three-factor learning rules. Frontiers in Neural Circuits, 12.
Ghahramani, Z., & Hinton, G. E. (2000). Variational learning for switching state-
space models. Neural Computation, 12(4), 831–864.
Glaze, C. M., Kable, J. W., & Gold, J. I. (2015). Normative evidence accumulation in
unpredictable environments. Elife, 4, e08825.
Gordon, N.
J., Salmond, D.
(1993). Novel approach to
J., & Smith, A. F.
nonlinear/non-gaussian Bayesian state estimation. In IEE Proceedings (Radar and
Signal Processing), 140, 107–113.
Heilbron, M., & Meyniel, F. (2019). Confidence resets reveal hierarchical adaptive
learning in humans. PLOS Computational Biology, 15(4), e1006972.
Huang, Y., & Rao, R. P. (2014). Neurons as Monte Carlo samplers: Bayesian inference
and learning in spiking networks. In Z. Ghahramani, M. Welling, C. Cortes, N.
Lawrence, & K. Q. Weinberger (Eds.), Advances in neural information processing
systems, 27 (pp. 1943–1951). Red Hook, NY: Curran.
Huettel, S. A., Mack, P. B., & McCarthy, G. (2002). Perceiving patterns in random
series: Dynamic processing of sequence in prefrontal cortex. Nature Neuroscience,
5(5), 485–490.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
338
V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea
Itti, L., & Baldi, P. F. (2006). Bayesian surprise attracts human attention. In Y. Weiss,
B. Schölkopf, & J. Platt (Eds.), Advances in neural information processing systems, 18
(pp. 547–554). Cambridge, MA: MIT Press.
Joshi, S., & Gold, J. I. (2020). Pupil size as a window on neural substrates of cognition.
Trends in Cognitive Sciences, 24.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical
Association, 9(430), 773–795.
Konovalov, A., & Krajbich, I. (2018). Neurocomputational dynamics of sequence
learning. Neuron, 98(6), 1282–1293.
Kopp, B., & Lange, F. (2013). Electrophysiological indicators of surprise and entropy
in dynamic task-switching environments. Frontiers in Human Neuroscience, 7,
300.
Kutschireiter, A., Surace, S. C., Sprekeler, H., & Pfister, J.-P. (2017). Nonlinear
Bayesian filtering and learning: A neuronal dynamics for perception. Scientific
Reports, 7(1), 8722.
Legenstein, R., & Maass, W. (2014). Ensembles of spiking neurons with noise support
optimal probabilistic inference in a dynamically changing environment. PLOS
Computational Biology, 10(10).
Lieder, F., Daunizeau, J., Garrido, M. I., Friston, K. J., & Stephan, K. E. (2013). Mod-
elling trial-by-trial changes in the mismatch negativity. PLOS Computational Biol-
ogy, 9(2).
Lin, K., Sharpnack, J. L., Rinaldo, A., & Tibshirani, R. J. (2017). A sharp error analysis
for the fused lasso, with application to approximate changepoint screening. In
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, &
R. Garnett (Eds.), Advances in neural information processing systems, 30 (pp. 6884–
6893). Red Hook, NY: Curran.
Lisman, J., Grace, A. A., & Duzel, E. (2011). A neohebbian framework for episodic
memory; role of dopamine-dependent late LTP. Trends in Neurosciences, 34(10),
536–547.
Liu, J., & West, M. (2001). Combined parameter and state estimation in simulation-
based filtering. In A. Doucet, N. de Freitas, & N. Gordon (Eds.), Sequential Monte
Carlo methods in practice (pp. 197–223). Berlin: Springer.
Lomonaco, V., Desai, K., Culurciello, E., & Maltoni, D. (2019). Continual reinforcement
learning in 3D non-stationary environments. arXiv:1905.1011.
Loued-Khenissi, L., Pfeuffer, A., Einhäuser, W., & Preuschoff, K. (2020). Anterior in-
sula reflects surprise in value-based decision-making and perception. NeuroIm-
age, 116549.
Maheu, M., Dehaene, S., & Meyniel, F. (2019). Brain signatures of a multiscale process
of sequence learning in humans. Elife, 8, e41541.
Mars, R. B., Debener, S., Gladwin, T. E., Harrison, L. M., Haggard, P., Rothwell, J. C.,
& Bestmann, S. (2008). Trial-by-trial fluctuations in the event-related electroen-
cephalogram reflect dynamic changes in the degree of surprise. Journal of Neuro-
science, 28(47), 12539–12545.
Masegosa, A., Nielsen, T. D., Langseth, H., Ramos-López, D., Salmerón, A., & Mad-
sen, A. L. (2017). Bayesian models of data streams with hierarchical power priors.
In Proceedings of the 34th International Conference on Machine Learning, vol. 70 (pp.
2334–2343). PMLR.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Learning in Volatile Environments With the Bayes Factor Surprise
339
Mathys, C., Daunizeau, J., Friston, K. J., & Stephan, K. E. (2011). A Bayesian founda-
tion for individual learning under uncertainty. Frontiers in Human Neuroscience,
5, 39.
Meyniel, F., Maheu, M., & Dehaene, S. (2016). Human inferences about sequences:
A minimal transition probability model. PLOS Computational Biology, 12(12),
e1005260.
Modirshanechi, A., Kiani, M. M., & Aghajan, H. (2019). Trial-by-trial surprise-
decoding model for visual and auditory binary oddball tasks. NeuroImage, 196,
302–317.
Musiolek, L., Blankenburg, F., Ostwald, D., & Rabovsky, M. (2019). Modeling the
n400 brain potential as semantic Bayesian surprise. In Proceedings of the 2019
Conference on Cognitive Computational Neuroscience. https://ccneuro.org/2019/
Papers/AcceptedPapers.asp
Nagabandi, A., Clavera, I., Liu, S., Fearing, R. S., Abbeel, P., Levine, S., & Finn,
C. (2018). Learning to adapt in dynamic, real-world environments through meta-
reinforcement learning. arXiv:1803.11347.
Nassar, M. R., Bruckner, R., & Frank, M. J. (2019). Statistical context dictates the re-
lationship between feedback-related EEG signals and learning. Elife, 8, e46975.
Nassar, M. R., Rumsey, K. M., Wilson, R. C., Parikh, K., Heasly, B., & Gold, J. I. (2012).
Rational regulation of learning dynamics by pupil-linked arousal systems. Nature
Neuroscience, 15(7), 1040–1046.
Nassar, M. R., Wilson, R. C., Heasly, B., & Gold, J. I. (2010). An approximately
Bayesian delta-rule model explains the dynamics of belief updating in a changing
environment. Journal of Neuroscience, 30(37), 12366–12378.
Ostwald, D., Spitzer, B., Guggenmos, M., Schmidt, T. T., Kiebel, S. J., & Blankenburg,
F. (2012). Evidence for neural encoding of Bayesian surprise in human somatosen-
sation. NeuroImage, 62(1), 177–188.
Özkan, E., Šmídl, V., Saha, S., Lundquist, C., & Gustafsson, F. (2013). Marginal-
ized adaptive particle filtering for nonlinear models with unknown time-varying
noise parameters. Automatica, 49(6), 1566–1575.
Papoulis, A., & Saunders, H. (1989). Probability, random variables and stochastic pro-
cesses. American Society of Mechanical Engineers Digital Collection.
Prat-Carrabin, A., Wilson, R. C., Cohen, J. D., & Da Silveira, R. A. (2020). Human
inference in changing environments with temporal structure. bioRxiv:720516.
Preuschoff, K., t Hart, B. M., & Einhauser, W. (2011). Pupil dilation signals surprise:
Evidence for noradrenaline’s role in decision making. Frontiers in Neuroscience, 5,
115.
Ryali, C., Reddy, G., & Yu, A. J. (2018). Demystifying excessively volatile human
learning: A Bayesian persistent prior and a neural approximation. In S. Bengio,
H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett, Advances
in neural information processing systems, 31 (pp. 2781–2790). Red Hook, NY: Curran.
Särkkä, S. (2013). Bayesian filtering and smoothing (Vol. 3). Cambridge: Cambridge Uni-
versity Press.
Schmidhuber, J. (2010). Formal theory of creativity, fun, and intrinsic motivation
(1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3), 230–247.
Schwartenbeck, P., FitzGerald, T., Dolan, R., & Friston, K. (2013). Exploration, nov-
elty, surprise, and free energy minimization. Frontiers in Psychology, 4, 710.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
340
V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea
Shannon, C. (1948). A mathematical theory of communication. Bell System Technical
Journal, 20, 623–656; 27, 379-423.
Shi, L., & Griffiths, T. L. (2009). Neural implementation of hierarchical Bayesian
inference by importance sampling. In Y. Bengio, D, Schuurmans, J. Lafferty, C.
Williams, & A. Culotta (Eds.), Advances in neural information processing systems, 22
(pp. 1669–1677). Red Hook, NY: Curran.
Storck, J., Hochreiter, S., & Schmidhuber, J. (1995). Reinforcement driven information
acquisition in non-deterministic environments. In Proceedings of the International
Conference on Artificial Neural Networks (Vol. 2, pp. 159–164). Piscataway, NJ: IEEE.
Traoré, R., Caselles-Dupré, H., Lesort, T., Sun, T., Cai, G., Díaz-Rodríguez, N., &
Filliat, D. (2019). Discorl: Continual reinforcement learning via policy distillation.
arXiv:1907.05855.
Wilson, R. C., Nassar, M. R., & Gold, J. I. (2010). Bayesian online learning of the
hazard rate in change-point problems. Neural Computation, 22(9), 2452–2476.
Wilson, R. C., Nassar, M. R., & Gold, J. I. (2013). A mixture of delta-rules approxima-
tion to Bayesian inference in change-point problems. PLOS Computational Biology,
9(7), e1003150.
Yu, A. J. (2012). Change is in the eye of the beholder. Nature Neuroscience, 15(7), 933.
Yu, A. J., & Cohen, J. D. (2009). Sequential effects: Superstition or rational behavior?
In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in neural
information processing systems, 21 (pp. 1873–1880). Red Hook, NY: Curran.
Yu, A. J., & Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron,
46(4), 681–692.
Received April 27, 2020; accepted September 8, 2020.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
e
d
u
n
e
c
o
a
r
t
i
c
e
-
p
d
/
l
f
/
/
/
/
3
3
2
2
6
9
1
8
9
6
8
0
8
n
e
c
o
_
a
_
0
1
3
5
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3