ARTICLE - IA de Investigación especializada en el MIT

ARTÍCULO

Communicated by Tim Verbelen

Reward Maximization Through Discrete Active Inference

Lancelot Da Costa
l.da-costa@imperial.ac.uk
Department of Mathematics, Imperial College London, London SW7 2AZ, REINO UNIDO.

Noor Sajid
noor.sajid.18@ucl.ac.uk
Thomas Parr
thomas.parr.12@ucl.ac.uk
Karl Friston
k.friston@ucl.ac.uk
Wellcome Centre for Human Neuroimaging, University College London,
Londres, WC1N 3AR, REINO UNIDO.

Ryan Smith
rsmith@laureateinstitute.org
Laureate Institute for Brain Research, Tulsa, OK 74136, U.S.A.

Active inference is a probabilistic framework for modeling the behavior
of biological and artificial agents, which derives from the principle of
minimizing free energy. En años recientes, this framework has been applied
successfully to a variety of situations where the goal was to maximize re-
ward, often offering comparable and sometimes superior performance to
alternative approaches. In this article, we clarify the connection between
reward maximization and active inference by demonstrating how and
when active inference agents execute actions that are optimal for max-
imizing reward. Precisely, we show the conditions under which active
inference produces the optimal solution to the Bellman equation, a
formulation that underlies several approaches to model-based rein-
forcement learning and control. On partially observed Markov decision
procesos, the standard active inference scheme can produce Bellman
optimal actions for planning horizons of 1 but not beyond. A diferencia de,
a recently developed recursive active inference scheme (sophisticated
inferencia) can produce Bellman optimal actions on any finite tempo-
ral horizon. We append the analysis with a discussion of the broader
relationship between active inference and reinforcement learning.

1 Introducción

1.1 Active Inference. Active inference is a normative framework for
modeling intelligent behavior in biological and artificial agents. It simulates

Computación neuronal 35, 807–852 (2023)
https://doi.org/10.1162/neco_a_01574

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

808

l. Da Costa et al.

behavior by numerically integrating equations of motion thought to de-
scribe the behavior of biological systems, a description based on the free
energy principle (Barp et al., 2022; Friston et al., 2022). Active inference com-
prises a collection of algorithms for modeling perception, aprendiendo, and de-
cision making in the context of both continuous and discrete state spaces
(Barp et al., 2022; Da Costa et al., 2020; Friston et al., 2021, 2010; Friston,
Parr, et al., 2017). Brevemente, building active inference agents entails (1) equip-
ping the agent with a (generative) model of the environment, (2) fitting the
model to observations through approximate Bayesian inference by mini-
mizing variational free energy (es decir., optimizing an evidence lower bound
Beal, 2003; obispo, 2006; Blei et al., 2017; Jordan et al., 1998) y (3) se-
lecting actions that minimize expected free energy, a quantity that that can
be decomposed into risk (es decir., the divergence between predicted and pre-
ferred paths) and ambiguity, leading to context-specific combinations of ex-
ploratory and exploitative behavior (Millidge, 2021; Schwartenbeck et al.,
2019). This framework has been used to simulate and explain intelligent be-
havior in neuroscience (Adams et al., 2013; Parr, 2019; Parr et al., 2021; Sajid
et al., 2022), psychology and psychiatry (Herrero, Khalsa, et al., 2021; Herrero,
Kirlic, Stewart, Touthang, Kuplicki, Khalsa, et al., 2021; Herrero, Kirlic, Stew-
arte, Touthang, Kuplicki, McDermott, et al., 2021; Herrero, Kuplicki, Feinstein,
et al., 2020; Herrero, Kuplicki, Teed, et al., 2020; Herrero, Mayeli, et al., 2021;
Herrero, Schwartenbeck, Stewart, et al., 2020; Herrero, taylor, et al., 2022), mamá-
chine learning (Çatal et al., 2020; Fountas et al., 2020; Mazzaglia et al., 2021;
Millidge, 2020; Tschantz et al., 2019; Tschantz, Millidge, et al., 2020), y
robotics (Çatal et al., 2021; Lanillos et al., 2020; Oliver et al., 2021; Pezzato
et al., 2020; Pio-Lopez et al., 2016; Sancaktar et al., 2020; Schneider et al.,
2022).

1.2 Reward Maximization through Active Inference? A diferencia de, el
traditional approaches to simulating and explaining intelligent behavior—
stochastic optimal control (Bellman, 1957; Bertsekas & Shreve, 1996) y
aprendizaje reforzado (rl; Aprender & suton, 1992)—derive from the nor-
mative principle of executing actions to maximize reward scoring the util-
ity afforded by each state of the world. This idea dates back to expected
utility theory (Von Neumann & Morgenstern, 1944), an economic model of
rational choice behavior, which also underwrites game theory (Von Neu-
mann & Morgenstern, 1944) and decision theory (Berger, 1985; Dayán &
Grajilla, 2008). Several empirical studies have shown that active inference can
successfully perform tasks that involve collecting reward, a menudo (but not al-
maneras) showing comparative or superior performance to RL (Cullen et al.,
2018; Markovi´c et al., 2021; Mazzaglia et al., 2021; Millidge, 2020; Paul et al.,
2021; Sajid, Ball, et al., 2021; Herrero, Kirlic, Stewart, Touthang, Kuplicki,
Khalsa, et al., 2021; Herrero, Kirlic, Stewart, Touthang, Kuplicki, McDermott,
et al., 2021; Herrero, Schwartenbeck, Stewart, et al., 2020; Herrero, taylor, et al.,
2022; van der Himst & Lanillos, 2020) and marked improvements when

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

809

interacting with volatile environments (Markovi´c et al., 2021; Sajid, Ball,
et al., 2021). Given the prevalence and historical pedigree of reward maxi-
mization, we ask: How and when do active inference agents execute actions that
are optimal with respect to reward maximization?

1.3 Organization of Paper. In this article, we explain (and prove)
how and when active inference agents exhibit (Bellman) optimal reward-
maximizing behavior.

Para esto, we start by restricting ourselves to the simplest problem: máximo-
imizing reward on a finite horizon Markov decision process (MDP) con
known transition probabilities—a sequential decision-making task with
complete information. In this setting, we review the backward-induction al-
gorithm from dynamic programming, which forms the workhorse of many
optimal control and model-based RL algorithms. This algorithm furnishes a
Bellman optimal state-action mapping, which means that it provides prov-
ably optimal decisions from the point of view of reward maximization
(mira la sección 2).

We then introduce active inference on finite horizon MDPs (mira la sección
3)—a scheme consisting of perception as inference followed by planning
as inference, which selects actions so that future states best align with pre-
ferred states.

En la sección 4, we show how and when active inference maximizes reward
in MDPs. Específicamente, when the preferred distribution is a (uniform mix-
ture of) Dirac distribution(s) over reward-maximizing trajectories, selecting
action sequences according to active inference maximizes reward (see sec-
ción 4.1). Yet active inference agents, in their standard implementation, poder
select actions that maximize reward only when planning one step ahead
(mira la sección 4.2). It takes a recursive, sophisticated form of active inference
to select actions that maximize reward—in the sense of a Bellman optimal
state-action mapping—on any finite time-horizon (mira la sección 4.3).

En la sección 5, we introduce active inference on partially observable
Markov decision processes with known transition probabilities—a se-
quential decision-making task where states need to be inferred from
observations—and explain how the results from the MDP setting generalize
to this setting.

En la sección 6, we step back from the focus on reward maximization and
briefly discuss decision making beyond reward maximization, learning un-
known environments and reward functions, and outstanding challenges
in scaling active inference. We append this with a broader discussion of
the relationship between active inference and reinforcement learning in
appendix A.

Our findings are summarized in section 7.
All of our analyses assume that the agent knows the environmental dy-
namics (es decir., transition probabilities) and reward function. In appendix A,
we discuss how active inference agents can learn their world model and

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

810

l. Da Costa et al.

rewarding states when these are initially unknown—and the broader rela-
tionship between active inference and RL.

2 Reward Maximization on Finite Horizon MDPs

En esta sección, we consider the problem of reward maximization in Markov
procesos de decisión (MDPs) with known transition probabilities.

2.1 Basic Definitions. MDPs are a class of models specifying environ-
mental dynamics widely used in dynamic programming, model-based RL,
and more broadly in engineering and artificial intelligence (Aprender & suton,
1992; Piedra, 2019). They are used to simulate sequential decision-making
tasks with the objective of maximizing a reward or utility function. An MDP
specifies environmental dynamics unfolding in discrete space and time un-
der the actions pursued by an agent.

Definición 1 (Finite Horizon MDP). A finite horizon MDP comprises the fol-
lowing collection of data:

• S, a finite set of states.
• T = {0, . . . , t}, a finite set that stands for discrete time. T is the temporal

horizon (a.k.a. planning horizon).

• A, a finite set of actions.
• P(st = s(cid:2) | st−1
= s, at−1

= a), the probability that action a ∈ A in state s ∈
S at time t − 1 will lead to state s(cid:2) ∈ S at time t. st are random variables over
S that correspond to the state being occupied at time t = 0, . . . , t.

= s), the probability of being at state s ∈ S at the start of the trial.

• P(s0
• R(s), the finite reward received by the agent when at state s ∈ S.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

The dynamics afforded by a finite horizon MDP (ver figura 1) can be written glob-
ally as a probability distribution over state trajectories s0:t := (s0
, . . . , sT ), given
, . . . , aT−1), which factorizes as
a sequence of actions a0:T−1 := (a0

PAG(s0:t | a0:T−1) =P(s0)

t(cid:2)

τ =1

PAG(sτ | sτ −1

, aτ −1).

Remark 1 (On the Definition of Reward). More generally, the reward func-
tion can be taken to be dependent on the previous action and previous state:
Ra (s(cid:2) | s) is the reward received after transitioning from state s to state s(cid:2)
pendiente
to action a (Aprender & suton, 1992; Piedra, 2019). Sin embargo, given an MDP with
such a reward function, we can recover our simplified setting by defining
a new MDP where the new states comprise the previous action, previous
estado, and current state in the original MDP. By inspection, the resulting re-
ward function on the new MDP depends only on the current state (es decir., R(s)).

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

811

Cifra 1: Finite horizon Markov decision process. This is a Markov decision
process pictured as a Bayesian network (Jordan et al., 1998; Pearl, 1998). A finite
horizon MDP comprises a finite sequence of states, indexed in time. The transi-
tion from one state to the next depends on action. Tal como, for any given action
secuencia, the dynamics of the MDP form a Markov chain on state-space. En esto
fully observed setting, actions can be selected under a state-action policy, (cid:3), en-
dicated with a dashed line: this is a probabilistic mapping from state-space and
time to actions.

Remark 2 (Admissible Actions). En general, it is possible that only some ac-
tions can be taken at each state. En este caso, one defines As to be the finite set
de (allowable) actions from state s ∈ S. All forthcoming results concerning
MDPs can be extended to this setting.

To formalize what it means to choose actions in each state, we introduce

the notion of a state-action policy.
Definición 2 (State-action Policy). A state-action policy (cid:3) is a probability dis-
tribution over actions that depends on the state that the agent occupies, y tiempo.
Explicitly,

(cid:3) : A × S × T → [0, 1]

(a, s, t) (cid:5)→ (cid:3)(a | s, t)
(cid:3)

∀(s, t) ∈ S × T :

(cid:3)(a | s, t) = 1.

a∈A

When st = s, we will write (cid:3)(a | st ) := (cid:3)(a | s, t). Note that the action at the
temporal horizon T is redundant, as no further can be reaped from the environ-
mento. Por lo tanto, one often specifies state-action policies only up to time T − 1, como
(cid:3) : A × S × {0, . . . , T − 1} → [0, 1]. The state-action policy—as defined here—
can be regarded as a generalization of a deterministic state-action policy that as-
signs the probability of 1 to an available action and 0 de lo contrario.

Remark 3 (Time-Dependent State-Action Policies). The way an agent
chooses actions at the end of its life is usually going to be very different from
the way it chooses them when it has a longer life ahead of it. In finite horizon
decision problems, state-action policies should generally be considered to
be time-dependent, as time-independent optimal state-action policies may
not exist. To see this, consider the following simple example: S = Z/5Z

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

812

l. Da Costa et al.

(integers mod 5), T = {0, 1, 2}, A = {−1, 0, +1}, R(0) = R(2) = R(3) =
0, R(1) = 1, R(4) = 6. Optimal state-action policies are necessarily time-
dependent as the reward-maximizing trajectory from state 2 at time 0
consists of reaching state 4, while the optimal trajectory from state 2 at time
1 consists of reaching state 1. This is particular to finite-horizon decisions,
como, in infinite-horizon (discounted) problemas, optimal state-action policies
can always be taken to be time-independent (Puterman, 2014, theorem
6.2.7).

Remark 4 (Conflicting Terminologies: Policy in Active Inference). In active
inferencia, a policy is defined as a sequence of actions indexed in time.1 To
avoid terminological confusion, we use action sequences to denote policies
under active inference.

At time t, the goal is to select an action that maximizes future cumulative

premio:

R(st+1:t ) :=

t(cid:3)

τ =t+1

R(sτ ).

Específicamente, this entails following a state-action policy (cid:3) that maximizes

the state-value function:

v(cid:3)(s, t) := mi(cid:3)[R(st+1:t ) | st = s]

para cualquier (s, t) ∈ S × T. The state-value function scores the expected cumula-
tive reward if the agent pursues state-action policy (cid:3) from the state st = s.
When the state st = s is clear from context, we will often write v(cid:3)(st ) :=
v(cid:3)(s, t). Loosely speaking, we will call the expected reward the return.
Remark 5 (Notation E(cid:3)). While standard in RL (Aprender & suton, 1992;
Piedra, 2019), the notation E(cid:3)[R(st+1:t ) | st = s] can be confusing. It denotes
the expected reward, under the transition probabilities of the MDP and a
state-action policy (cid:3), eso es,

PAG(st+1:t |en:T−1

,st =s)(cid:3)(en:T−1

|st+1:T−1

,st =s)[R(st+1:t )].

It is important to keep this correspondence in mind, as we will use both
notations depending on context.

Remark 6 (Temporal Discounting). In infinite horizon MDPs (es decir., when T
is infinite), RL often seeks to maximize the discounted sum of rewards,

These are analogous to temporally extended actions or options introduced under the

options framework in RL (Stolle & Deberes, 2002).

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

813

v(cid:3)(s, t) := mi(cid:3)

(cid:4)

∞(cid:3)

(cid:5)
γ τ −tR(sτ +1) | st = s

τ =t

for a given temporal discounting term γ ∈ (0, 1) (Aprender & suton, 1992; Bert-
sekas & Shreve, 1996; Kaelbling et al., 1998). De hecho, temporal discounting is
added to ensure that the infinite sum of future rewards converges to a finite
valor (Kaelbling et al., 1998). In finite horizon MDPs, temporal discounting
is not necessary so we set γ = 1 (see Schmidhuber, 2006, 2010).

To find the best state-action policies, we would like to rank them in terms
of their return. We introduce a partial ordering such that a state-action pol-
icy is better than another if it yields a higher return in any situation:

(cid:3) ≥ (cid:3)(cid:2) ⇐⇒ ∀(s, t) ∈ S × T : v(cid:3)(s, t) ≥ v(cid:3)(cid:2) (s, t).

Similarmente, a state-action policy (cid:3) is strictly better than another (cid:3)(cid:2)
strictly higher returns:

if it yields

(cid:3) > (cid:3)(cid:2) ⇐⇒ (cid:3) ≥ (cid:3)(cid:2)

and ∃(s, t) ∈ S × T : v(cid:3)(s, t) > v(cid:3)(cid:2) (s, t).

2.2 Bellman Optimal State-Action Policies. A state-action policy is

Bellman optimal if it is better than all alternatives.
Definición 3 (Bellman Optimality). A state-action policy (cid:3)∗ is Bellman optimal
if and only if it is better than all other state-action policies:

(cid:3)∗ ≥ (cid:3), ∀(cid:3).

En otras palabras, it maximizes the state-value function v(cid:3)(s, t) for any state s at
time t.

It is important to verify that this concept is not vacuous.

Proposition 1 (Existence of Bellman Optimal State-Action Policies). Given
a finite horizon MDP as specified in definition 1, there exists a Bellman optimal
state-action policy (cid:3)∗.

A proof is found in appendix B.1. Note that the uniqueness of the
Bellman optimal state-action policy is not implied by proposition 1; en-
deed, multiple Bellman optimal state-action policies may exist (Bertsekas
& Shreve, 1996; Puterman, 2014).

Now that we know that Bellman optimal state-action policies exist, nosotros
can characterize them as a return-maximizing action followed by a Bellman
optimal state-action policy.

Proposition 2 (Characterization of Bellman Optimal State-Action Policies).
For a state-action policy (cid:3), the following are equivalent:

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

814

l. Da Costa et al.

1. (cid:3) is Bellman optimal.
2. (cid:3) is both

a. Bellman optimal when restricted to {1, . . . , t}. En otras palabras, ∀ state-

action policy (cid:3)(cid:2) y (s, t) ∈ S × {1, . . . t}

v(cid:3)(s, t) ≥ v(cid:3)(cid:2) (s, t).

b. At time 0, (cid:3) selects actions that maximize return:

(cid:3)(a | s, 0) > 0 ⇐⇒ a ∈ arg max
a∈A

mi(cid:3)[R(s1:t ) | s0

= s, a0

= a],

∀s ∈ S.

(2.1)

A proof is in appendix B.2. Note that this characterization offers a recur-
sive way to construct Bellman optimal state-action policies by successively
selecting the best action, as specified by equation 2.1, starting from T and
inducting backward (Puterman, 2014).

2.3 Backward Induction. Proposition 2 suggests a straightforward re-
cursive algorithm to construct Bellman optimal state-action policies known
as backward induction (Puterman, 2014). Backward induction has a long his-
conservador. It was developed by the German mathematician Zermelo in 1913 a
prove that chess has Bellman optimal strategies (Zermelo, 1913). In stochas-
tic control, backward induction is one of the main methods for solving the
ecuación de Bellman (Adda & Cooper, 2003; Miranda & Fackler, 2002; Sargent,
2000). In game theory, the same method is used to compute subgame perfect
equilibria in sequential games (Fudenberg & Tirole, 1991).

Backward induction entails planning backward in time, from a goal state
at the end of a problem, by recursively determining the sequence of actions
that enables reaching the goal. It proceeds by first considering the last time
at which a decision might be made and choosing what to do in any situa-
tion at that time in order to get to the goal state. Usando esta información, uno
can then determine what to do at the second-to-last decision time. este profesional-
cess continues backward until one has determined the best action for every
possible situation or state at every point in time.

Proposition 3 (Backward Induction: Construction of Bellman Optimal
State-Action Policies). Backward induction

(cid:3)(a | s, T − 1) > 0 ⇐⇒ a ∈ arg max
a∈A
(cid:3)(a | s, T − 2) > 0 ⇐⇒ a ∈ arg max
a∈A

mi[R(sT ) | sT−1

= s, aT−1

= a],

∀s ∈ S

mi(cid:3)[R(sT−1:t ) | sT−2

= s, aT−2

= a],

∀s ∈ S
…

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

815

(cid:3)(a | s, 0) > 0 ⇐⇒ a ∈ arg max
a∈A

mi(cid:3)[R(s1:t ) | s0

= s, a0

= a],

∀s ∈ S

(2.2)

defines a Bellman optimal state-action policy (cid:3). Además, this characterization
is complete: all Bellman optimal state-action policies satisfy the backward induction
relation, equation 2.2.

A proof is in appendix B.3.
Intuitivamente, the backward induction algorithm 2.2 consists of planning
backward, by starting from the end goal and working out the actions
needed to achieve the goal. To give a concrete example of this kind of plan-
y, backward induction would consider the following actions in the order
mostrado:

1. Desired goal: I would like to go to the grocery store.
2. Intermediate action: I need to drive to the store.
3. Current best action: I should put my shoes on.

Proposition 3 tells us that to be optimal with respect to reward maxi-
mization, one must plan like backward induction. This will be central to
our analysis of reward maximization in active inference.

3 Active Inference on Finite Horizon MDPs

We now turn to introducing active inference agents on finite horizon MDPs
with known transition probabilities. We assume that the agent’s generative
model of its environment is given by the previously defined finite horizon
MDP (see definition 1). We do not consider the case where the transitions
have to be learned but comment on it in appendix A.2 (see also Da Costa
et al., 2020; Friston et al., 2016).

In what follows, we fix a time t ≥ 0 and suppose that the agent has been
, . . . , st. To ease notation, we let (cid:2)s := st+1:t ,(cid:2)a := at:T be the future
in states s0
states and future actions. We define Q to be the predictive distribution, cual
encodes the predicted future states and actions given that the agent is in
state st:

q((cid:2)s,(cid:2)a | st ) :=

T−1(cid:2)

τ =t

q(sτ +1

| aτ , sτ )q(aτ | sτ ).

3.1 Perception as Inference. In active inference, perception entails in-
ferences about future, pasado, and current states given observations and a se-
quence of actions. When states are partially observed, this is done through
variational Bayesian inference by minimizing a free energy functional also
known as an evidence bound (Beal, 2003; obispo, 2006; Blei et al., 2017;
Wainwright & Jordán, 2007).

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

816

l. Da Costa et al.

In the MDP setting, past and current states are known, so it is necessary
only to infer future states given the current state and action sequence P((cid:2)s |
(cid:2)a, st ). These posterior distributions P((cid:2)s | (cid:2)a, st ) can be computed exactly in
virtue of the fact that the transition probabilities of the MDP are known;
hence, variational inference becomes exact Bayesian inference:

q((cid:2)s | (cid:2)a, st ) :=P((cid:2)s | (cid:2)a, st ) =

T−1(cid:2)

τ =t

PAG(sτ +1

| sτ , aτ ).

(3.1)

3.2 Planning as Inference. Now that the agent has inferred future states
given alternative action sequences, we must assess these alternative plans
by examining the resulting state trajectories. The objective that active in-
ference agents optimize—in order to select the best possible actions—is the
expected free energy (Barp et al., 2022; Da Costa et al., 2020; Friston et al.,
2021). Under active inference, agents minimize expected free energy in or-
der to maintain themselves distributed according to a target distribution C
over the state-space S encoding the agent’s preferences.

Definición 4 (Expected Free Energy on MDPs). On MDPs, the expected free
energy of an action sequence (cid:2)a starting from st is defined as (Barp et al., 2022, ver
sección 5):

GRAMO((cid:2)a | st ) = DKL[q((cid:2)s | (cid:2)a, st ) | C((cid:2)s)],

(3.2)

where DKL is the KL-divergence. Por lo tanto, minimizing expected free energy cor-
responds to making the distribution over predicted states close to the distribution
C that encodes prior preferences. Note that the expected free energy in partially
observed MDPs comprises an additional ambiguity term (mira la sección 5), cual es
dropped here as there is no ambiguity about observed states.

Since the expected free energy assesses the goodness of inferred fu-
ture states under a course of action, we can refer to planning as inference
(Attias, 2003; Botvinick & Toussaint, 2012). The expected free energy may
be rewritten as

GRAMO((cid:2)a | st ) = mi
(cid:6)

q((cid:2)s|(cid:2)a,st )[− log C((cid:2)s)]
(cid:9)
(cid:7)(cid:8)

− H[q((cid:2)s | (cid:2)a, st )]
(cid:9)

(cid:7)(cid:8)

(cid:6)

(3.3)

Expected surprise

Entropy of future states

Por eso, minimizing expected free energy minimizes the expected surprise
of states2 according to C and maximizes the entropy of Bayesian beliefs over

The surprise (also known as self-information or surprisal) of states—log C((cid:2)s) es
information-theoretic nomenclature (Piedra, 2015) that scores the extent to which an

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

817

future states (a maximum entropy principle (Jaynes, 1957a), which is some-
times cast as keeping options open (Klyubin et al., 2008)).

Remark 7 (Numerical Tractability). The expected free energy is straight-
forward to compute using linear algebra. Given an action sequence (cid:2)a, C((cid:2)s)
and Q((cid:2)s | (cid:2)a, st ) are categorical distributions over ST−t. Let their parameters
|S|(T−1), dónde | · | denotes the cardinality of a set. Then the
be c, s(cid:2)a
expected free energy reads

∈ [0, 1]

GRAMO((cid:2)a | st ) = sT

(cid:2)a (log s(cid:2)a

− log c).

(3.4)

Notwithstanding, equation 3.4 is expensive to evaluate repeatedly when
all possible action sequences are considered. En la práctica, one can adopt a
temporal mean field approximation over future states (Millidge, Tschantz,
& Buckley, 2020):

q((cid:2)s | (cid:2)a, st ) =

t(cid:2)

τ =t+1

q(sτ | (cid:2)a, sτ −1) ≈

t(cid:2)

τ =t+1

q(sτ | (cid:2)a, st ),

which yields the simplified expression

GRAMO((cid:2)a | st ) ≈

t(cid:3)

τ =t+1

DKL[q(sτ | (cid:2)a, st ) | C(sτ )].

(3.5)

Expression 3.5 is much easier to handle: for each action sequence (cid:2)a, uno
evaluates the summands sequentially τ = t + 1, . . . , t, and if and when
the sum up to τ becomes significantly higher than the lowest expected
free energy encountered during planning, GRAMO((cid:2)a | st ) is set to an arbitrarily
high value. Setting G((cid:2)a | st ) to a high value is equivalent to pruning away
unlikely trajectories. This bears some similarity to decision tree pruning
procedures used in RL (Huys et al., 2012). It finesses exploration of the deci-
sion tree in full depth and provides an Occam’s window for selecting action
sequences.

Complementary approaches can help make planning tractable. por ejemplo-
amplio, hierarchical generative models factorize decisions into multiple lev-
los. By abstracting information at a higher-level, lower levels entertain
fewer actions (Friston et al., 2018), which reduces the depth of the decision
tree by orders of magnitude. Another approach is to use algorithms that
search the decision tree selectively, such as Monte Carlo tree search (cham-
pion, Bowman, et al., 2021; Champion, Da Costa, et al., 2021; Fountas et al.,

observation is unusual under C. It does not imply that the agent experiences surprise
in a subjective or declarative sense.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

818

l. Da Costa et al.

2020; Maisto et al., 2021; Silver et al., 2016) and amortizing planning using
artificial neural networks (es decir., learning to plan) (Çatal et al., 2019; Fountas
et al., 2020; Millidge, 2019; Sajid, Tigas, et al., 2021).

4 Reward Maximization on MDPs through Active Inference

Aquí, we show how active inference solves the reward maximization
problema.

4.1 Reward Maximization as Reaching Preferences. From the defini-
tion of expected free energy, equation 3.2, active inference on MDPs can
be thought of as reaching and remaining at a target distribution C over
state-space.

The basic observation that underwrites the following is that the agent
will maximize reward when the stationary distribution has all of its mass on
reward maximizing states. Para ilustrar esto, we define a preference distri-
bution Cβ , β > 0 over state-space S, such that preferred states are rewarding
estados:3

Cβ (pag ) :=

exp βR(pag )
ς ∈S exp βR(ς )
⇐⇒ − log Cβ (pag ) = −βR(pag ) − c(b ),

(cid:10)

∝ exp(βR(pag )),

∀σ ∈ S

∀σ ∈ S, for some c(b ) ∈ R constant w.r.t σ.

El (inverse temperature) parameter β > 0 scores how motivated the
agent is to occupy reward-maximizing states. Note that states s ∈ S that
maximize the reward R(s) maximize Cβ (s) and minimize − log Cβ (s) para cualquier
β > 0.

Using the additive property of the reward function, we can extend Cβ to a
, . . . , σT ) ∈ ST . Específicamente,

probability distribution over trajectories (cid:2)pag := (pag
Cβ scores to what extent a trajectory is preferred over another trajectory:

Cβ ((cid:2)pag ) :=

(cid:10)

exp βR((cid:2)pag )
(cid:2)ς∈ST exp βR((cid:2)ς )

t(cid:2)

τ =1

(cid:10)

exp βR(στ )
ς ∈S exp βR(ς )

t(cid:2)

τ =1

Cβ (στ ),

∀(cid:2)σ ∈ ST

⇐⇒ − log Cβ ((cid:2)pag ) = −βR((cid:2)pag ) − c

(cid:2)

(b ) = -

t(cid:3)

τ =1

βR(στ ) − c

(cid:2)

(b ),

∀(cid:2)σ ∈ ST ,

(4.1)

donde C(cid:2)(b ) :=c(b )T ∈ R is constant with regard to (cid:2)pag .

Note the connection with statistical mechanics: β is an inverse temperature parame-
ter, −R is a potential function, and Cβ is the corresponding Gibbs distribution (Pavliotis,
2014; Rahme & Adams, 2019).

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

819

When preferences are defined in this way, the preference distribution
assigns exponentially more mass to states or trajectories that have a higher
premio. Put simply, for trajectories (cid:2)pag , (cid:2)ς ∈ ST with reward R((cid:2)pag ) > R((cid:2)ς ), el
ratio of preference mass will be the exponential of the weighted difference
in reward, where the weight is the inverse temperature:

Cβ ((cid:2)pag )
Cβ ((cid:2)ς )

= exp(βR((cid:2)pag ))
exp.(βR((cid:2)ς ))

= exp(b(R((cid:2)pag ) − R((cid:2)ς ))).

(4.2)

As the temperature tends to zero, the ratio diverges so that Cβ ((cid:2)pag ) becomes
infinitely larger than Cβ ((cid:2)ς ). As Cβ is a probability distribution (with a max-
β→+∞
imal value of one), we must have Cβ ((cid:2)ς )
−→ 0 for any suboptimal trajec-
conservador (cid:2)ς and positive preference for reward maximizing trajectories (as all
preferences must sum to one). Además, all reward maximizing trajecto-
ries have the same probability mass by equation 4.2. De este modo, in the zero tem-
perature limit, preferences become a uniform mixture of Dirac distributions
over reward-maximizing trajectories:

lim
β→+∞

Cβ ∝

(cid:3)

(cid:2)σ ∈IT−t

Dirac(cid:2)pag ,

I := arg máx
s∈S

R(s).

(4.3)

Por supuesto, the above holds for preferences over individual states as it does
for preferences over trajectories.

We now show how reaching preferred states can be formulated as reward

maximization:
Lema 1. The sequence of actions that minimizes expected free energy also max-
imizes expected reward in the zero temperature limit β → +∞ (see equation 4.3):

lim
β→+∞

arg min

(cid:2)a

GRAMO((cid:2)a | st ) ⊆ arg max

(cid:2)a

q((cid:2)s|(cid:2)a,st )[R((cid:2)s)].

Además, of those action sequences that maximize expected reward, the expected
free energy minimizers will be those that maximize the entropy of future states
h[q((cid:2)s | (cid:2)a, st )].

A proof is in appendix B.4. In the zero temperature limit β → +∞,
minimizing expected free energy corresponds to choosing the action se-
quence (cid:2)a such that Q((cid:2)s | (cid:2)a, st ) has most mass on reward-maximizing states
or trajectories (ver figura 2). Of those reward-maximizing candidates, el
minimizer of expected free energy maximizes the entropy of future states
h[q((cid:2)s | (cid:2)a, st )], thus keeping options open.

4.2 Reward Maximization on MDPs with a Temporal Horizon of 1.
En esta sección, we first consider the case of a single-step decision problem
(es decir., a temporal horizon of T = 1) and demonstrate how the standard active

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

820

l. Da Costa et al.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 2: Reaching preferences and the zero temperature limit. We illustrate
how active inference selects actions such that the predictive distribution Q((cid:2)s |
(cid:2)a, st ) most closely matches the preference distribution Cβ ((cid:2)s) (top right). We il-
lustrate this with a temporal horizon of one, so that state sequences are states,
which are easier to plot, but all holds analogously for sequences of arbitrary
finite length. In this example, the state-space is a discretization of a real inter-
vale, and the predictive and preference distributions have a gaussian shape. El
predictive distribution Q is assumed to have a fixed variance with respect to
action sequences, such that the only parameter that can be optimized by ac-
tion selection is its mean. In the zero temperature limit, equation 4.3, Cβ be-
comes a Dirac distribution over the reward-maximizing state (abajo). De este modo,
minimizing expected free energy corresponds to selecting the action such that
the predicted states assign most probability mass to the reward-maximizing
estado (bottom-right). Aquí, Q∗ := Q((cid:2)s | (cid:2)a∗, st ) denotes the predictive distribu-
tion over states given the action sequence that minimizes expected free energy
(cid:2)a∗ = arg min(cid:2)a G((cid:2)a | st ).

inference scheme maximizes reward on this problem in the limit β → +∞.
This will act as an important building block for when we subsequently con-
sider more general multistep decision problems.

The standard decision-making procedure in active inference consists of
assigning each action sequence with a probability given by the softmax of
the negative expected free energy (Barp et al., 2022; Da Costa et al., 2020;
Friston, FitzGerald, et al., 2017):

q((cid:2)a | st ) ∝ exp(−G((cid:2)a | st )).

Reward Maximization Active Inference

821

Mesa 1: Standard Active Inference Scheme on Finite Horizon MDPs (Barp et al.,
2022, sección 5).

Process

Perceptual inference
La planificación como inferencia
Decision making
Action selection

Cálculo

(cid:11)

(cid:10)

| sτ , aτ )

(cid:2)a Q(at = a | (cid:2)a)q((cid:2)a | st )

(cid:13)

Agents then select the most likely action under this distribution:

at ∈ arg max
a∈A

q(a | st ) = arg máx
a∈A

q(a | (cid:2)a) exp.(−G((cid:2)a | st )) = arg máx
a∈A

= arg máx
a∈A

(cid:3)

(cid:2)a

q(a | (cid:2)a)q((cid:2)a | st )

exp.(−G((cid:2)a | st )).

(cid:3)

(cid:2)a
(cid:3)

(cid:2)a
=a
((cid:2)a)t

En resumen, this scheme selects the first action within action sequences
eso, on average, maximize their exponentiated negative expected free
energies. As a corollary, if the first action is in a sequence with a very
low expected free energy, this adds an exponentially large contribution
to the selection of this particular action. We summarize this scheme in
Mesa 1.

Teorema 1. In MDPs with known transition probabilities and in the zero tem-
perature limit β → +∞ (4.3), the scheme of Table 1,

at ∈ lim
β→+∞

arg max
a∈A

(cid:3)

(cid:2)a
=a
((cid:2)a)t

exp.(−G((cid:2)a | st )),

GRAMO((cid:2)a | st ) = DKL[q((cid:2)s | (cid:2)a, st ) | Cβ ((cid:2)s)],

(4.4)

is Bellman optimal for the temporal horizon T = 1.

A proof is in appendix B.5. En tono rimbombante, the standard active inference
scheme, equation 4.4, falls short in terms of Bellman optimality on planning hori-
zons greater than one; this rests on the fact that it does not coincide with
backward induction. Recall that backward induction offers a complete de-
scription of Bellman optimal state-action policies (see proposition 3). en contra-
contraste, active inference plans by adding weighted expected free energies of
each possible future course of action. En otras palabras, unlike backward in-
ducción, it considers future courses of action beyond the subset that will
subsequently minimize expected free energy, given subsequently encoun-
tered states.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

822

l. Da Costa et al.

4.3 Reward Maximization on MDPs with Finite Temporal Hori-
zons. To achieve Bellman optimality on finite temporal horizons, we turn
to the expected free energy of an action given future actions that also mini-
mize expected free energy. Para hacer esto, we can write the expected free energy
recursively, as the immediate expected free energy, plus the expected free
energy that one would obtain by subsequently selecting actions that mini-
mize expected free energy (Friston et al., 2021). The resulting scheme con-
sists of minimizing an expected free energy defined recursively, desde el
last time step to the current time step. In finite horizon MDPs, this reads

GRAMO(aT−1

| sT−1) = DKL[q(sT | aT−1

GRAMO(aτ | sτ ) = DKL[q(sτ +1

, sT−1) | Cβ (sT )]
| aτ , sτ ) | Cβ (sτ +1)]

+ mi

q(aτ +1

,sτ +1

|aτ ,sτ )[GRAMO(aτ +1

| sτ +1)],

τ = t, . . . , T − 2,

dónde, at each time step, actions are chosen to minimize expected free
energía:

q(aτ +1

| sτ +1) > 0 ⇐⇒ aτ +1

∈ arg min
a∈A

GRAMO(a | sτ +1).

(4.5)

To make sense of this formulation, we unravel the recursion,

GRAMO(en | st ) = DKL[q(st+1
= DKL[q(st+1

| en, st ) | Cβ (st+1)] + mi
| en, st ) | Cβ (st+1)]

q(at+1

,st+1

|en ,st )[GRAMO(at+1

| st+1)]

(cid:12)

+ mi

q(at+1

,st+1

|en ,st )

DKL[q(st+2

q(at+1:t+2

,st+1:t+2

|en ,st )[GRAMO(at+2
T−1(cid:3)

| at+1
| st+2)]

(cid:13)
, st+1) | Cβ (st+2)]

= . . . = mi

q((cid:2)a,(cid:2)s|en ,st )

DKL[q(sτ +1

| aτ , sτ ) | Cβ (sτ +1)]

τ =t
q((cid:2)a,(cid:2)s|en ,st )DKL[q((cid:2)s | (cid:2)a, st ) | Cβ ((cid:2)s)],

= mi

(4.6)

which shows that this expression is exactly the expected free energy under
action at, if one is to pursue future actions that minimize expected free en-
ergy, equation 4.5. We summarize this “sophisticated inference” scheme in
Mesa 2.

The crucial improvement over the standard active inference scheme (ver
Mesa 1) is that planning is now performed based on subsequent counterfac-
tual actions that minimize expected free energy as opposed to considering
all future courses of action. Translating this into the language of state-action
policies yields ∀s ∈ S:

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

823

Mesa 2: Sophisticated active inference scheme on finite horizon MDPs (Friston
et al., 2021).

Process

Cálculo

Perceptual inference
La planificación como inferencia

Decision making
Action selection

q(aτ +1

aT−1(s) ∈ arg min
a∈A
aT−2(s) ∈ arg min
a∈A

GRAMO(a | sT−1

= s)

GRAMO(a | sT−2

= s)

…
a1(s) ∈ arg min
a∈A
a0(s) ∈ arg min
a∈A

GRAMO(a | s1

= s)

GRAMO(a | s0).

(4.7)

Ecuación 4.7 is strikingly similar to the backward induction algorithm
(proposition 3), and indeed we recover backward induction in the limit β →
+∞.

Teorema 2 (Backward Induction as Active Inference). In MDPs with known
transition probabilities and in the zero temperature limit β → +∞, equation 4.3,
the scheme of Table 2,

q(aτ | sτ ) > 0 ⇐⇒ at ∈ lim
β→+∞

GRAMO(aτ | sτ ) = DKL[q(sτ +1

GRAMO(a | sτ )

arg min
a∈A
| aτ , sτ ) | Cβ (sτ +1)]

+ mi

q(aτ +1

,sτ +1

|aτ ,sτ )[GRAMO(aτ +1

| sτ +1)],

(4.8)

is Bellman optimal on any finite temporal horizon as it coincides with the back-
ward induction algorithm from proposition 3. Además, if there are multiple
actions that maximize future reward, those that are selected by active inference also
maximize the entropy of future states H[q((cid:2)s | (cid:2)a, a, s0)].

Note that maximizing the entropy of future states keeps the agent’s op-
tions open (Klyubin et al., 2008) in the sense of committing the least to a
specified sequence of states. A proof of theorem 2 is in appendix B.6.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

824

l. Da Costa et al.

5 Generalization to POMDPs

Partially observable Markov decision processes (POMDPs) generalize
MDPs in that the agent observes a modality ot, which carries incomplete
information about the current state st, as opposed to the current state itself.

Definición 5 (Finite Horizon POMDP). A finite horizon POMDP is an MDP
(see definition 1) with the following additional data:

• O is a finite set of observations.
• P(ot = o | st = s) is the probability that the state s ∈ S at time t will lead to
the observation o ∈ O at time t. ot are random variables over O that corre-
spond to the observation being sampled at time t = 0, . . . , t.

5.1 Active Inference on Finite Horizon POMDPs. We briefly introduce
active inference agents on finite horizon POMDPs with known transition
probabilities (for more details, see Da Costa et al., 2020; Parr et al., 2022;
Herrero, Friston, et al., 2022). We assume that the agent’s generative model of
its environment is given by POMDP (see definition 5).4

Dejar (cid:2)s := s0:t ,(cid:2)a := a0:T−1 be all states and actions (pasado, present, and fu-
tura), let ˜o := o0:t be the observations available up to time t, and let(cid:2)oh := ot+1:t
be the future observations. The agent has a predictive distribution over
states given actions

q((cid:2)s | (cid:2)a, ˜o) :=

T−1(cid:2)

τ =0

q(sτ +1

| aτ , sτ , ˜o),

which is continuously updated following new observations.

5.1.1 Perception as Inference. In active inference, perception entails in-
ferences about (pasado, present, and future) states given observations and a
sequence of actions. When states are partially observed, the posterior distri-
bution P((cid:2)s | (cid:2)a, ˜o) is intractable to compute directly. De este modo, one approximates
it by optimizing a variational free energy functional F(cid:2)a (also known as an
evidence bound; Beal, 2003; obispo, 2006; Blei et al., 2017; Wainwright &
Jordán, 2007) over a space of probability distributions Q(· | (cid:2)a, ˜o) called the
variational family:

PAG((cid:2)s | (cid:2)a, ˜o) = arg min
q

F(cid:2)a[q((cid:2)s | (cid:2)a, ˜o)] = arg min
q

DKL[q((cid:2)s | (cid:2)a, ˜o) | PAG((cid:2)s | (cid:2)a, ˜o)]

F(cid:2)a[q((cid:2)s | (cid:2)a, ˜o)] := mi

q((cid:2)s|(cid:2)a, ˜o)[log Q((cid:2)s | (cid:2)a, ˜o) − log P( ˜o,(cid:2)s | (cid:2)a)].

(5.1)

We do not consider the case where the model parameters have to be learned but com-

ment on it in appendix A.2 (details in Da Costa et al., 2020; Friston et al., 2016).

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

825

Aquí, PAG( ˜o,(cid:2)s | (cid:2)a) is the POMDP, which is supplied to the agent, y P((cid:2)s | (cid:2)a, ˜o).
When the free energy minimum (see equation 5.1) se alcanza, the inference
is exact:

q((cid:2)s | (cid:2)a, ˜o) =P((cid:2)s | (cid:2)a, ˜o).

(5.2)

For numerical tractability, the variational family may be constrained to a
parametric family of distributions, in which case equality is not guaranteed:

q((cid:2)s | (cid:2)a, ˜o) ≈ P((cid:2)s | (cid:2)a, ˜o).

(5.3)

5.1.2 Planning as Inference. The objective that active inference minimizes
in order the select the best possible courses of action is the expected free energy
(Barp et al., 2022; Da Costa et al., 2020; Friston et al., 2021). In POMDPs, el
expected free energy reads (Barp et al., 2022, sección 5)

GRAMO((cid:2)a | ˜o) = DKL[q((cid:2)s | (cid:2)a, ˜o) | Cβ ((cid:2)s)]
(cid:9)

(cid:7)(cid:8)

(cid:6)

+ mi
(cid:6)

q((cid:2)s|(cid:2)a, ˜o)h[PAG((cid:2)oh | (cid:2)s)]
(cid:9)
(cid:7)(cid:8)

Risk

Ambiguity

The expected free energy on POMDPs is the expected free energy on MDPs
plus an extra term called ambiguity. This ambiguity term accommodates the
uncertainty implicit in partially observed problems. The reason that this
resulting functional is called expected free energy is because it comprises
a relative entropy (riesgo) and expected energy (ambiguity). The expected
free energy objective subsumes several decision-making objectives that pre-
dominate in statistics, aprendizaje automático, and psychology, which confers it
with several useful properties when simulating behavior (ver figura 3 para
details).

5.2 Maximizing Reward on POMDPs. Fundamentalmente, our reward maxi-
mization results translate to the POMDP case. To make this explicit, nosotros
rehearse lemma 1 in the context of POMDPs.

Proposition 4 (Reward Maximization on POMDPs). In POMDPs with
known transition probabilities, provided that the free energy minimum is reached
(see equation 5.2), the sequence of actions that minimizes expected free energy also
maximizes expected reward in the zero temperature limit β → +∞ (see equation
4.3):

lim
β→+∞

arg min

(cid:2)a

GRAMO((cid:2)a | ˜o) ⊆ arg max

(cid:2)a

q((cid:2)s|(cid:2)a, ˜o)[R((cid:2)s)].

Además, of those action sequences that maximize expected reward, the ex-
pected free energy minimizers will be those that maximize the entropy of future

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

826

l. Da Costa et al.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

Cifra 3: Active inference. The top panels illustrate the perception-action loop
in active inference, in terms of minimization of variational and expected free en-
ergy. The lower panels illustrate how expected free energy relates to several de-
scriptions of behavior that predominate in the psychological, aprendizaje automático,
and economics. These descriptions are disclosed when one removes particular
terms from the objective. Por ejemplo, if we ignore extrinsic value, we are left
with intrinsic value, variously known as expected information gain (Lindley,
1956; MacKay, 2003). This underwrites intrinsic motivation in machine learning
and robotics (Barto et al., 2013; Deci & ryan, 1985; Oudeyer & Kaplan, 2007) y
expected Bayesian surprise in visual search (Itti & Baldi, 2009; Sun et al., 2011)
and the organization of our visual apparatus (Barlow, 1961, 1974; Linsker, 1990;
Optican & Richmond, 1987). In the absence of ambiguity, we are left with mini-
mizing risk, which corresponds to aligning predicted states to preferred states.
This leads to risk-averse decisions in behavioral economics (kahneman & Tver-
sky, 1979) and formulations of control as inference in engineering such as KL
control (van den Broek et al., 2010). If we then remove intrinsic value, we are left
with expected utility in economics (Von Neumann & Morgenstern, 1944) eso
underwrites RL and behavioral psychology (Aprender & suton, 1992). Bayesian for-
mulations of maximizing expected utility under uncertainty are also the basis
of Bayesian decision theory (Berger, 1985). Finalmente, if we only consider a fully
observed environment with no preferences, minimizing expected free energy
corresponds to a maximum entropy principle over future states (Jaynes, 1957b,
1957a). Note that here C(oh) denotes the preferences over observations derived
from the preferences over states. These are related by P(oh | s)C(s) =P(s | oh)C(oh).

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

827

states minus the (esperado) entropy of outcomes given states H[q((cid:2)s | (cid:2)a, ˜o)] −
mi
q((cid:2)s|en , ˜o)h[PAG((cid:2)oh | (cid:2)s)]].

From proposition 4, we see that if there are multiple maximize reward

action sequences, those that are selected maximize

h[q((cid:2)s | (cid:2)a, ˜o)]
(cid:9)
(cid:7)(cid:8)
(cid:6)

−

mi
(cid:6)

q((cid:2)s|en , ˜o)[h[PAG((cid:2)oh | (cid:2)s)]]
(cid:9)
(cid:7)(cid:8)

Entropy of future states

Entropy of observations given future states

En otras palabras, they least commit to a prespecified sequence of future
states and ensure that their expected observations are maximally informa-
tive of states. Por supuesto, when inferences are inexact, the extent to which
proposition 4 holds depends on the accuracy of the approximation, equa-
ción 5.3. A proof of proposition 4 is in appendix B.7.

The schemes of Tables 1 y 2 exist in the POMDP setting, (p.ej., Barp
et al., 2022, sección 5, and Friston et al., 2021, respectivamente). De este modo, en
POMDPs with known transition probabilities, provided that inferences are
exact (see equation 5.2) and in the zero temperature limit β → +∞ (ver
equation 4.3), standard active inference (Barp et al., 2022, sección 5) máximo-
imizes reward on temporal horizons of one but not beyond, and a recursive
scheme such as sophisticated active inference (Friston et al., 2021) máximo-
imizes reward on finite temporal horizons. Note that for computational
tractability, the sophisticated active inference scheme presented in Friston
et al. (2021) does not generally perform exact inference; de este modo, the extent to
which it will maximize reward in practice will depend on the accuracy of
its inferences. Sin embargo, our results indicate that sophisticated active
inference will vastly outperform standard active inference in most reward-
maximization tasks.

6 Discusión

In this article, we have examined a specific notion of optimality, a saber,
Bellman optimality, defined as selecting actions to maximize future ex-
pected rewards. We demonstrated how and when active inference is Bell-
man optimal on finite horizon POMDPs with known transition probabilities
and reward function.

These results highlight important relationships among active inference,
stochastic control, and RL, as well as conditions under which they would
and would not be expected to behave similarly (p.ej., environments with
multiple reward-maximizing trajectories, those affording ambiguous ob-
servaciones). We refer readers to appendix A for a broader discussion of the
relationship between active inference and reinforcement learning.

6.1 Decision Making beyond Reward Maximization. More broadly,
it is important to ask if reward maximization is the right objective

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

828

l. Da Costa et al.

underwriting intelligent decision making. This is an important question for
decision neuroscience. Eso es, do humans optimize a reward signal, ex-
pected free energy, or other planning objectives? This can be addressed by
comparing the evidence for these competing hypotheses based on empirical
datos (Herrero, Kirlic, Stewart, Touthang, Kuplicki, Khalsa, et al., 2021; Herrero,
Kirlic, Stewart, Touthang, Kuplicki, McDermott, et al., 2021; Herrero,
Schwartenbeck, Stewart, et al., 2020; Herrero, taylor, et al., 2022). Current em-
pirical evidence suggests that humans are not purely reward-maximizing
agents; they also engage in both random and directed exploration (Grajilla
et al., 2006; Gershman, 2018; Mirza et al., 2018; Schulz & Gershman, 2019;
Wilson et al., 2021, 2014; Xu et al., 2021) and keep their options open
(Schwartenbeck, FitzGerald, Mathys, Dolan, Kronbichler, et al., 2015). Como
we have illustrated, active inference implements a clear form of directed
exploration through minimizing expected free energy. Although not cov-
ered in detail here, active inference can also accommodate random explo-
ration by sampling actions from the posterior belief over action sequences,
as opposed to selecting the most likely action as presented in Tables 1
y 2.

Note that behavioral evidence favoring models that do not solely maxi-
mize reward within reward-maximization tasks—that is, where “maximize
reward” is the explicit instruction—is not a contradiction. Bastante, gathering
information about the environment (exploration) generally helps to reap
more reward in the long run, as opposed to greedily maximizing reward
based on imperfect knowledge (Cullen et al., 2018; Sajid, Ball, et al., 2021).
This observation is not new, and many approaches to simulating adaptive
agents employed today differ significantly from their reward-maximizing
antecedents (see appendix A.3).

6.2 Aprendiendo. When the transition probabilities or reward function are
unknown to the agent, the problem becomes one of reinforcement learning
(rl; Shoham et al., 2003 as opposed to stochastic control. Although we did
not explicitly consider it above, this scenario can be accommodated by ac-
tive inference by simply equipping the generative model with a prior and
updating the model using variational Bayesian inference to best fit observed
datos. Depending on the specific learning problem and generative model
estructura, this can involve updating the transition probabilities and/or the
target distribution C. In POMDPs it can also involve updating the prob-
abilities of observations under each state. We refer to appendix A.2 for
discussion of reward learning through active inference and connections to
representative RL approaches, and Da Costa et al. (2020) and Friston et al.
(2016) for learning transition probabilities through active inference.

6.3 Scaling Active Inference. When comparing RL and active inference
approaches generally, one outstanding issue for active inference is whether
it can be scaled up to solve the more complex problems currently handled

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

829

by RL in machine learning contexts (Çatal et al., 2020, 2021; Fountas et al.,
2020; Mazzaglia et al., 2021; Millidge, 2020; Tschantz et al., 2019). This is an
area of active research.

One important issue along these lines is that planning ahead by evaluat-
ing all or many possible sequences of actions is computationally prohibitive
in many applications. Three complementary solutions have emerged:
(1) employing hierarchical generative models that factorize decisions into
multiple levels and reduce the size of the decision tree by orders of magni-
tude (Çatal et al., 2021; Friston et al., 2018; Parr et al., 2021); (2) efficiently
searching the decision tree using algorithms like Monte Carlo tree search
(Champion, Bowman, et al., 2021; Champion, Da Costa, et al., 2021; Foun-
tas et al., 2020; Maisto et al., 2021; Silver et al., 2016); y (3) amortizing
planning using artificial neural networks (Çatal et al., 2019; Fountas et al.,
2020; Millidge, 2019; Sajid, Tigas, et al., 2021).

Another issue rests on learning the generative model. Active inference
may readily learn the parameters of a generative model; sin embargo, más
work needs to be done on devising algorithms for learning the structure of
generative models themselves (Friston, lin, et al., 2017; Herrero, Schwarten-
beck, Parr, et al., 2020). This is an important research problem in generative
modelado, called Bayesian model selection or structure learning (Gershman
& NVI, 2010; Tervo et al., 2016).

Note that these issues are not unique to active inference. Model-based
RL algorithms deal with the same combinatorial explosion when evaluat-
ing decision trees, which is one primary motivation for developing efficient
model-free RL algorithms. Sin embargo, other heuristics have also been devel-
oped for efficiently searching and pruning decision trees in model-based RL
(Huys et al., 2012; Lally et al., 2017). Además, model-based RL suffers
the same limitation regarding learning generative model structure. Yet RL
may have much to offer active inference in terms of efficient implementa-
tion and the identification of methods to scale to more complex applications
(Fountas et al., 2020; Mazzaglia et al., 2021).

7 Conclusión

En resumen, we have shown that under the specification that the active
inference agent prefers maximizing reward, equation 4.3:

1. On finite horizon POMDPs with known transition probabilities, el
objective optimized for action selection in active inference (es decir., ex-
pected free energy) produces reward-maximizing action sequences
when state estimation is exact. When there are multiple reward-
maximizing candidates, this selects those sequences that maximize
the entropy of future states—thereby keeping options open—and
that minimize the ambiguity of future observations so that they are
maximally informative. More generally, the extent to which action

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

830

l. Da Costa et al.

sequences will be reward maximizing will depend on the accuracy
of state estimation.

2. The standard active inference scheme (p.ej., Barp et al., 2022, sección
5) produces Bellman optimal actions for planning horizons of one
when state estimation is exact but not beyond.

3. A sophisticated active inference scheme (p.ej., Friston et al., 2021) pro-
duces Bellman optimal actions on any finite planning horizon when
state estimation is exact. Además, this scheme generalizes the
well-known backward induction algorithm from dynamic program-
ming to partially observed environments. Note that for computa-
tional efficiency, the scheme presented in Friston et al. (2021), does
not generally perform exact state estimation; de este modo, the extent to which
it will maximize reward in practice will depend on the accuracy of its
inferences. Sin embargo, it is clear from our results that sophisticated
active inference will vastly outperform standard active inference in
most reward-maximization tasks.

Note that for computational tractability, the sophisticated active infer-
ence scheme presented in Friston et al. (2021) does not generally perform
exact inference; de este modo, the extent to which it will maximize reward in prac-
tice will depend on the accuracy of its inferences. Sin embargo, it is clear
from these results that sophisticated active inference will vastly outperform
standard active inference in most reward-maximization tasks.

En conclusión, the sophisticated active inference scheme should be the
method of choice when applying active inference to optimally solve the
reward-maximization problems considered here.

Apéndice A: Active Inference and Reinforcement Learning

This article considers how active inference can solve the stochastic control
problema. In this appendix, we discuss the broader relationship between ac-
tive inference and RL.

Loosely speaking, RL is the field of methodologies and algorithms that
learn reward-maximizing actions from data and seek to maximize reward
in the long run. Because RL is a data-driven field, algorithms are selected
based on how well they perform on benchmark problems. This has pro-
duced a plethora of diverse algorithms, many designed to solve specific
problemas, each with its own strengths and limitations. This makes RL dif-
ficult to characterize as a whole. Agradecidamente, many approaches to model-
based RL and control can be traced back to approximating the optimal
solution to the Bellman equation (Bellman & Dreyfus, 2015; Bertsekas &
Shreve, 1996) (although this may become computationally intractable in
high dimensions; Aprender & suton, 1992). Our results showed how and when
decisions under active inference and such RL approaches are similar.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

831

This appendix discusses how active inference and RL relate and differ
more generally. Their relationship has become increasingly important to
understand, as a growing body of research has begun to (1) compare the
performance of active inference and RL models in simulated environments
(Cullen et al., 2018; Millidge, 2020; Sajid, Ball, et al., 2021), (2) apply active
inference to model human behavior on reward learning tasks (Herrero, Kir-
lic, Stewart, Touthang, Kuplicki, Khalsa, et al., 2021; Herrero, Kirlic, Stew-
arte, Touthang, Kuplicki, McDermott, et al., 2021; Herrero, Schwartenbeck,
Stewart, et al., 2020; Herrero, taylor, et al., 2022), y (3) consider the comple-
mentary predictions and interpretations each offers in computational neu-
roscience, psicología, and psychiatry (Cullen et al., 2018; Huys et al., 2012;
Schwartenbeck, FitzGerald, Mathys, Dolan, & Friston, 2015; Schwartenbeck
et al., 2019; Tschantz, Seth, et al., 2020).

A.1 Main Differences between Active Inference and Reinforcement

Aprendiendo.

A.1.1 Philosophy. Active inference and RL differ profoundly in their phi-
losophy. RL derives from the normative principle of maximizing reward
(Aprender & suton, 1992), while active inference describes systems that main-
tain their structural integrity over time (Barp et al., 2022; Friston et al., 2022).
Despite this difference, these frameworks have many practical similarities.
Por ejemplo, recall that behavior in active inference is completely deter-
mined by the agent’s preferences, determined as priors in their generative
modelo. Fundamentalmente, log priors can be interpreted as reward functions and vice
versa, which is how behavior under RL and active inference can be related.

A.1.2 Model Based and Model Free. Active inference agents always em-
body a generative (es decir., forward) model of their environment, while RL
comprises both model-based and simpler model-free algorithms. In brief,
“model-free” means that agents learn a reward-maximizing state-action
mapping, based on updating cached state-action pair values through ini-
tially random actions that do not consider future state transitions. En
contrast, model-based RL algorithms attempt to extend stochastic control
approaches by learning the dynamics and reward function from data. Re-
call that stochastic control calls on strategies that evaluate different actions
on a carefully handcrafted forward model of dynamics (es decir., known transi-
tion probabilities) to finally execute the reward-maximizing action. Under
this terminology, all active inference agents are model-based.

A.1.3 Modeling Exploration. Exploratory behavior—which can improve
reward maximization in the long run—is implemented differently in the
two approaches. In most cases, RL implements a simple form of explo-
ration by incorporating randomness in decision making (Tokic & Palm,
2011; Wilson et al., 2014), where the level of randomness may or may not

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

832

l. Da Costa et al.

change over time as a function of uncertainty. En otros casos, RL incorpo-
rates ad hoc information bonuses in the reward function or other decision-
making objectives to build in directed exploratory drives (p.ej., superior-
confidence-bound algorithms or Thompson sampling). A diferencia de, directed
exploration emerges naturally within active inference through interactions
between the risk and ambiguity terms in the expected free energy (Da Costa
et al., 2020; Schwartenbeck et al., 2019). This addresses the explore-exploit
dilemma and confers the agent with artificial curiosity (Friston, lin, et al.,
2017; Schmidhuber, 2010; Schwartenbeck et al., 2019; Still & Deberes, 2012),
as opposed to the need to add ad hoc information bonus terms (Tokic &
Palm, 2011). We expand on this relationship in appendix A.3.

A.1.4 Control and Learning as Inference. Active inference integrates state
estimation, aprendiendo, Toma de decisiones, and motor control under the single
objective of minimizing free energy (Da Costa et al., 2020). De hecho, active
inference extends previous work on the duality between inference and con-
controlar (Kappen et al., 2012; Rawlik et al., 2013; todorov, 2008; Toussaint, 2009)
to solve motor control problems via approximate inference (es decir., planificación
as inference: Attias, 2003; Botvinick & Toussaint, 2012; Friston et al., 2012,
2009; Millidge, Tschantz, Seth, et al., 2020). Por lo tanto, some of the clos-
est RL methods to active inference are control as inference, also known as
maximum entropy RL (Levin, 2018; Millidge, Tschantz, Seth, et al., 2020;
Ziebart, 2010), though one major difference is in the choice of decision-
making objective. Loosely speaking, these aforementioned methods min-
imize the risk term of the expected free energy, while active inference also
minimizes ambiguity.

Useful Features of Active Inference

1. Active inference allows great flexibility and transparency when mod-
eling behavior. It affords explainable decision making as a mixture of
información- and reward-seeking policies that are explicitly encoded
(and evaluated in terms of expected free energy) in the generative
model as priors, which are specified by the user (Da Costa, Lanillos,
et al., 2022). As we have seen, the kind of behavior that can be pro-
duced includes the optimal solution to the Bellman equation.

2. Active inference accommodates deep hierarchical generative models
combining both discrete and continuous state-spaces (Friston, Parr,
et al., 2017; Friston et al., 2018; Parr et al., 2021).

3. The expected free energy objective optimized during planning sub-
sumes many approaches used to describe and simulate decision mak-
ing in the physical, engineering, and life sciences, affording it various
interesting properties as an objective (ver figura 3 and Friston et al.,
2021). Por ejemplo, exploratory and exploitative behavior are canon-
ically integrated, which finesses the need for manually incorporating

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

833

ad hoc exploration bonuses in the reward function (Da Costa, Tenka,
et al., 2022).

4. Active inference goes beyond state-action policies that predominate
in traditional RL to sequential policy optimization. In sequential pol-
icy optimization, one relaxes the assumption that the same action is
optimal given a particular state and acknowledges that the sequen-
tial order of actions may matter. This is similar to the linearly solvable
MDP formulation presented by Todorov (2006, 2009), where transi-
tion probabilities directly determine actions and an optimal policy
specifies transitions that minimize some divergence cost. This way
of approaching policies is perhaps most apparent in terms of explo-
ration. Put simply, it is clearly better to explore and then exploit than
the converse. Because expected free energy is a functional of beliefs,
exploration becomes an integral part of decision making—in contrast
with traditional RL approaches that try to optimize a reward function
of states. En otras palabras, active inference agents will explore until
enough uncertainty is resolved for reward-maximizing, goal-seeking
imperatives to start to predominate.

Such advantages should motivate future research to better characterize
the environments in which these properties offer useful advantages—such
as where performance benefits from learning and planning at multiple tem-
poral scales and from the ability to select policies that resolve both state and
parameter uncertainty.

A.2 Reward Learning. Given the focus on relating active inference to
the objective of maximizing reward, it is worth briefly illustrating how ac-
tive inference can learn the reward function from data and its potential con-
nections to representative RL approaches. One common approach for active
inference to learn a reward function (Herrero, Schwartenbeck, Stewart, et al.,
2020; Herrero, taylor, et al., 2022) is to set preferences over observations rather
than states, which corresponds to assuming that inferences over states given
outcomes are accurate:

DKL [q ((cid:2)s | (cid:2)a, ˜o) | C ((cid:2)s)]
(cid:9)
(cid:7)(cid:8)
(cid:6)

= DKL [q ((cid:2)oh | (cid:2)a, ˜o) | C ((cid:2)oh)]
(cid:9)
(cid:7)(cid:8)

(cid:6)

Risk (estados)

+ mi
(cid:6)

(cid:6)

Risk (resultados)

eso es, equality holds whenever the free energy minimum is reached
(see equation 5.2). Then one sets the preference distribution such that the

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

834

l. Da Costa et al.

observations designated as rewards are most preferred. In the zero tem-
perature limit (see eqnarray 4.3), preferences only assign mass to reward-
maximizing observations. When formulated in this way, the reward signal
is treated as sensory data, as opposed to a separate signal from the envi-
ambiente. When one sets allowable actions (controllable state transitions)
to be fully deterministic such that the selection of each action will transi-
tion the agent to a given state with certainty, the emerging dynamics are
such that the agent chooses actions to resolve uncertainty about the prob-
ability of observing reward under each state. De este modo, learning the reward
probabilities of available actions amounts to learning the likelihood matrix
PAG((cid:2)oh | (cid:2)s) := ot · Ast, where A is a stochastic matrix. This is done by setting a
prior a over A, eso es, a matrix of nonnegative components, the columns
of which are Dirichlet priors over the columns of A. The agent then learns
by accumulating Dirichlet parameters. Explicitly, at the end of a trial or
episode, one sets (Da Costa et al., 2020; Friston et al., 2016)

a ← a +

t(cid:3)

τ =0

oτ ⊗ Q(sτ | o0:t ).

(A.1)

In equation A.1, q(sτ | o0:t ) is seen as a vector of probabilities over the state-
space S, corresponding to the probability of having been in one or another
state at time τ after having gathered observations throughout the trial. Este
rule simply amounts to counting observed state-outcome pairs, cual es
equivalent to state-reward pairs when the observation modalities corre-
spond to reward.

One should not conflate this approach with the update rule consisting of

accumulating state-observation counts in the likelihood matrix,

A ← A +

t(cid:3)

τ =0

oτ ⊗ Q(sτ | o0:t ),

(A.2)

and then normalizing its columns to sum to one when computing proba-
bilities. The latter simply approximates the likelihood matrix A by accumu-
lating the number of observed state-outcome pairs. This is distinct from the
approach outlined above, which encodes uncertainty over the matrix A, como
a probability distribution over possible distributions P(ot | st ). The agent
is initially very unconfident about A, which means that it doesn’t place
high-probability mass on any specification of P(ot | st ). This uncertainty is
gradually resolved by observing state-observation (or state-reward) pares.
Computationally, it is a general fact of Dirichlet priors that an increase
in elements of a causes the entropy of P(ot | st ) to decrease. As the terms
added in equation A.1 are always positive, one choice of distribution

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

835

PAG(ot | st )—which best matches available data and prior beliefs—is ulti-
mately singled out. En otras palabras, the likelihood mapping is learned.

The update rule consisting of accumulating state-observation counts in
the likelihood matrix (see equation A.2) (es decir., not incorporating Dirichlet
priors) bears some similarity to off-policy learning algorithms such as Q-
aprendiendo. In Q-learning, the objective is to find the best action given the cur-
rent observed state. Para esto, the Q-learning agent accumulates values for
state-action pairs with repeated observation of rewarding or punishing ac-
tion outcomes—much like state-observation counts. This allows it to learn
the Q-value function that defines a reward maximizing policy.

As always in partially observed environments, we cannot guarantee that
the true likelihood mapping will be learned in practice. Smith et al. (2019)
provides examples where, although not in an explicit reward-learning con-
texto, learning the likelihood can be more or less successful in different sit-
uations. Learning the true likelihood fails when the inference over states
is inaccurate, such as when using too severe a mean-field approximation
to the free energy (Blei et al., 2017; Parr et al., 2019; Tanaka, 1999), cual
causes the agent to misinfer states and thereby accumulate Dirichlet param-
eters in the wrong locations. Intuitivamente, this amounts to jumping to conclu-
sions too quickly.

Remark 8. If so desired, reward learning in active inference can also be
| st, en ).
equivalently formulated as learning transition probabilities P(st+1
In this alternative setup (as exemplified in Sales et al. (2019)), mappings
between reward states and reward outcomes in A are set as identity matri-
ces, and the agent instead learns the probability of transitioning to states
that deterministically generate preferred (rewarding) observations given
the choice of each action sequence. The transition probabilities under each
action are learned in a similar fashion as above (see equation A.1), by accu-
| st, en ). See Da Costa et al.,
mulating counts on a Dirichlet prior over P(st+1
2020, appendix, for details.

Given the model-based Bayesian formulation of active inference, más
direct links can be made between the active inference approach to reward
learning described above and other Bayesian model-based RL approaches.
For such links to be realized, the Bayesian RL agent would be required to
have a prior over a prior (p.ej., a prior over the reward function prior or
transition function prior). One way to implicitly incorporate this is through
Thompson sampling (Ghavamzadeh et al., 2016; ruso & Van Roy, 2014,
2016; Russo et al., 2017). While that is not the focus of this article, future
work could further examine the links between reward learning in active
inference and model-based Bayesian RL schemes.

A.3 Solving the Exploration-Exploitation Dilemma. An important
distinction between active inference and reinforcement learning schemes
is how they solve the exploration-exploitation dilemma.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

836

l. Da Costa et al.

The exploration-exploitation dilemma (Berger-Tal et al., 2014) arises
whenever an agent has incomplete information about its environment, semejante
as when the environment is partially observed or the generative model has
to be learned. The dilemma is then about deciding whether to execute ac-
tions aiming to collect reward based on imperfect information about the
environment or to execute actions aiming to gather more information—
allowing the agent to reap more reward in the future. Intuitivamente, it is al-
ways best to explore and then exploit, but optimizing this trade-off can be
difficult.

Active inference balances exploration and exploitation through mini-
mizing the risk and ambiguity inherent in the minimization of expected
free energy. This balance is context sensitive and can be adjusted by mod-
ifying the agent’s preferences (Da Costa, Lanillos, et al., 2022). Sucesivamente,
the expected free energy is obtained from a description of agency in bi-
ological systems derived from physics (Barp et al., 2022; Friston et al.,
2022).

Modern RL algorithms integrate exploratory and exploitative behavior
in many different ways. One option is curiosity-driven rewards to en-
courage exploration. Maximum entropy RL and control-as-inference make
decisions by minimizing a KL divergence to the target distribution (Eysen-
bach & Levin, 2019; Haarnoja et al., 2017, 2018; Levin, 2018; todorov, 2008;
Ziebart et al., 2008), which combines reward maximization with maximum
entropy over states. This is similar to active inference on MDPs (Millidge,
Tschantz, Seth, et al., 2020). Similarmente, the model-free Soft Actor-Critic
(Haarnoja et al., 2018) algorithm maximizes both expected reward and
entropy. This outperforms other state-of-the-art algorithms in continuous
control environments and has been shown to be more sample efficient than
its reward-maximizing counterparts (Haarnoja et al., 2018). Hyper (Zintgraf
et al., 2021) proposes reward maximization alongside minimizing uncer-
tainty over both external states and model parameters. Bayes-adaptive
rl (Guez et al., 2013a, 2013b; Ross et al., 2008, 2011; Zintgraf et al., 2020)
provides policies that balance exploration and exploitation with the aim
of maximizing reward. Thompson sampling provides a way to balance
exploiting current knowledge to maximize immediate performance and ac-
cumulating new information to improve future performance (Russo et al.,
2017). This reduces to optimizing dual objectives, reward maximization and
information gain, similar to active inference on POMDPs. Empirically, Sajid,
Ball, et al. (2021) demonstrated that an active inference agent and a Bayesian
model-based RL agent using Thompson sampling exhibit similar behavior
when preferences are defined over outcomes. They also highlighted that
when completely removing the reward signal from the environment, el
two agents both select policies that maximize some sort of information
gain.

En general, the way each of these approaches the exploration-exploitation

dilemma differs in theory and in practice remains largely unexplored.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

837

apéndice B: Proofs

B.1 Proof of Proposition 1. Note that a Bellman optimal state-action
política (cid:3)∗
is a maximal element according to the partial ordering ≤. Exis-
tence thus consists of a simple application of Zorn’s lemma. Zorn’s lemma
states that if any increasing chain

(cid:3)

≤ (cid:3)

≤ . . .

(B.1)

has an upper-bound that is a state-action policy, then there is a maximal
element (cid:3)∗.

Given the chain equation B.1, we construct an upper bound. We enumer-
, t1), . . . ,(αN, σN, tN ). Then the state-action policy

, pag

ate A × S × T by (a
1
secuencia

(cid:3)norte(a
1

| pag

, t1),

norte = 1, 2, 3, . . .

| pag

is bounded within [0, 1]. By the Bolzano-Weierstrass theorem, there ex-
ists a sub-sequence (cid:3)nk (a
, t1), k = 1, 2, 3, . . . that converges. Similarmente,
(cid:3)nk (a
, t2) is also a bounded sequence, and by Bolzano-Weierstrass, él
has a sub-sequence (cid:3)nk j
, t2) that converges. We repeatedly take sub-
sequences until N. To ease notation, call the resulting sub-sequence (cid:3)metro,
m = 1, 2, 3, . . .

(a2

| pag

With this, we define ˆ(cid:3) = limm→∞ (cid:3)metro. It is straightforward to see that ˆ(cid:3)

is a state-action policy:

ˆ(cid:3)(a | pag, t) = lim
m→∞

(cid:3)

α∈A

(cid:3)metro(a | pag, t) ∈ [0, 1],
(cid:3)

(cid:3)metro(a | pag, t) = 1,

α∈A

∀(a, pag, t) ∈ A × S × T,

∀(pag, t) ∈ S × T.

To show that ˆ(cid:3) is an upper bound, take any (cid:3) in the original chain of
state-action policies, equation B.1. Then by the definition of an increasing
sub-sequence, there exists an index M ∈ N such that ∀k ≥ M: (cid:3)
≥ (cid:3). Desde
k
limits commute with finite sums, we have v ˆ(cid:3)(s, t) = limm→∞ v(cid:3)metro (s, t) ≥
v(cid:3)
k (s, t) ≥ v(cid:3)(s, t) para cualquier (s, t) ∈ S × T. De este modo, by Zorn’s lemma, there ex-
ists a Bellman, optimal state-action policy (cid:3)∗
(cid:2)

B.2 Proof of Proposition 2. 1) ⇒ 2) : We only need to show assertion
(b). By contradiction, suppose that ∃(s, a) ∈ S × A such that (cid:3)(a | s, 0) > 0
y

mi(cid:3)[R(s1:t ) | s0

= s, a0

= α] < max a∈A E(cid:3)[R(s1:T ) | s0 = s, a0 = a]. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 5 5 8 0 7 2 0 7 9 4 7 3 n e c o _ a _ 0 1 5 7 4 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 838 We let α(cid:2) be the Bellman optimal action at state s and time 0 defined as L. Da Costa et al. α(cid:2) := arg max a∈A E(cid:3)[R(s1:T ) | s0 = s, a0 = a]. Then we let (cid:3)(cid:2) assigns α(cid:2) deterministically. Then be the same state-action policy as (cid:3) except that (cid:3)(cid:2) (· | s, 0) (cid:3) v(cid:3)(s, 0) = E(cid:3)[R(s1:T ) | s0 = s, a0 = a](cid:3)(a | s, 0) a∈A < max a∈A E(cid:3)[R(s1:T ) | s0 = s, a0 = a] = E(cid:3)(cid:2) [R(s1:T ) | s0 = (cid:3) E(cid:3)(cid:2) [R(s1:T ) | s0 = s, a0 = s, a0 = α(cid:2) ](cid:3)(cid:2) = a](cid:3)(cid:2) (α(cid:2) | s, 0) (a | s, 0) a∈A = v(cid:3)(cid:2) (s, 0). So (cid:3) is not Bellman optimal, which is a contradiction. 1) ⇐ 2) : We only need to show that (cid:3) maximizes v(cid:3)(s, 0), ∀s ∈ S. By and a state s ∈ S such that contradiction, there exists a state-action policy (cid:3)(cid:2) v(cid:3)(s, 0) < v(cid:3)(cid:2) (s, 0), (cid:3) ⇐⇒ E(cid:3)[R(s1:T ) | s0 = s, a0 = a](cid:3)(a | s, 0) < a∈A (cid:3) a∈A E(cid:3)(cid:2) [R(s1:T ) | s0 = s, a0 = a](cid:3)(cid:2) (a | s, 0). By a, the left-hand side equals E(cid:3)[R(s1:T ) | s0 = s, a0 = a]. max a∈A Unpacking the expression on the right-hand side, (cid:3) E(cid:3)(cid:2) [R(s1:T ) | s0 = s, a0 = a](cid:3)(cid:2) (a | s, 0) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 5 5 8 0 7 2 0 7 9 4 7 3 n e c o _ a _ 0 1 5 7 4 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 a∈A (cid:3) = = = (cid:3) E(cid:3)(cid:2) [R(s1:T ) | s1 = σ ]P(s1 = σ | s0 = s, a0 = a)(cid:3)(cid:2) (a | s, 0) a∈A (cid:3) σ ∈S (cid:3) a∈A (cid:3) σ ∈S (cid:3) a∈A σ ∈S (cid:14) E(cid:3)(cid:2) [R(s2:T ) | s1 (cid:15) = σ ] + R(σ ) P(s1 = σ | s0 = s, a0 = a)(cid:3)(cid:2) (a | s, 0) (cid:14) v(cid:3)(cid:2) (σ, 1) + R(σ )] P(s1 = σ | s0 = s, a0 = a)(cid:3)(cid:2) (a | s, 0). (B.2) Reward Maximization Active Inference 839 Since (cid:3) is Bellman optimal when restricted to {1, . . . , T}, we have v(cid:3)(cid:2) (σ, 1) ≤ v(cid:3)(σ, 1), ∀σ ∈ S. Therefore, (cid:3) (cid:3) (cid:14) v(cid:3)(cid:2) (σ, 1) + R(σ )] P(s1 = σ | s0 = s, a0 = a)(cid:3)(cid:2) (a | s, 0) a∈A σ ∈S (cid:3) (cid:3) ≤ (cid:14) v(cid:3)(σ, 1) + R(σ )] P(s1 = σ | s0 = s, a0 = a)(cid:3)(cid:2) (a | s, 0). a∈A σ ∈S Repeating the steps above equation B.2, but in reverse order, yields (cid:3) a∈A E(cid:3)(cid:2) [R(s1:T ) | s0 = s, a0 = a](cid:3)(cid:2) (a | s, 0) ≤ (cid:3) a∈A E(cid:3)[R(s1:T ) | s0 = s, a0 = a](cid:3)(cid:2) (a | s, 0). However, (cid:3) a∈A E(cid:3)[R(s1:T ) | s0 = s, a0 = a](cid:3)(cid:2) (a | s, 0) < max a∈A E(cid:3)[R(s1:T ) | s0 = s, a0 = a], which is a contradiction. (cid:2) B.3 Proof of Proposition 3. We first prove that state-action policies (cid:3) defined as in equation 2.2 are Bellman optimal by induction on T. T = 1 : (cid:3)(a | s, 0) > 0 ⇐⇒ a ∈ arg max

mi[R(s1) | s0

= s, a0

= a],

∀s ∈ S

is a Bellman optimal state-action policy as it maximizes the total reward
possible in the MDP.

Let T > 1 be finite and suppose that the proposition holds for MDPs with

a temporal horizon of T − 1. This means that

(cid:3)(a | s, T − 1) > 0 ⇐⇒ a ∈ arg max

mi[R(sT ) | sT−1

= s, aT−1

= a],

∀s ∈ S,

(cid:3)(a | s, T − 2) > 0 ⇐⇒ a ∈ arg max

a
∀s ∈ S,
…

mi(cid:3)[R(sT−1:t ) | sT−2

= s, aT−2

= a],

(cid:3)(a | s, 1) > 0 ⇐⇒ a ∈ arg max

mi(cid:3)[R(s2:t ) | s1

= s, a1

= a],

∀s ∈ S,

is a Bellman optimal state-action policy on the MDP restricted to times 1 a
t. Por lo tanto, desde

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

840

l. Da Costa et al.

(cid:3)(a | s, 0) > 0 ⇐⇒ a ∈ arg max

mi(cid:3)[R(s1:t ) | s0

= s, a0

= a],

∀s ∈ S.

Proposition 2 allows us to deduce that (cid:3) is Bellman optimal.

We now show that any Bellman optimal state-action policy satisfies the

backward induction algorithm equation 2.2.

Suppose by contradiction that there exists a state-action policy (cid:3) eso es
Bellman optimal but does not satisfy equation 2.2. Say, ∃(a, s, t) ∈ A × S ×
t, t < T, such that (cid:3)(a | s, t) > 0 and a /∈ arg max
α∈A

mi(cid:3)[R(st+1:t ) | st = s, at = α].

This implies

mi(cid:3)[R(st+1:t ) | st = s, at = a] < max α∈A E(cid:3)[R(st+1:T ) | st = s, at = α]. Let ˜a ∈ arg maxα E(cid:3)[R(st+1:T ) | st = s, at = α]. Let ˜(cid:3) be a state-action pol- icy such that ˜(cid:3)(· | s, t) assigns ˜a ∈ A deterministically and such that ˜(cid:3) = (cid:3) otherwise. Then we can contradict the Bellman optimality of (cid:3) as follows: v(cid:3)(s, t) = E(cid:3)[R(st+1:T ) | st = s] (cid:3) = E(cid:3)[R(st+1:T ) | st = s, at = α](cid:3)(α | s, t) α∈A < max α∈A E(cid:3)[R(st+1:T ) | st = s, at = α] = E(cid:3)[R(st+1:T ) | st = s, at = ˜a] = E ˜(cid:3)[R(st+1:T ) | st = s, at = ˜a] = (cid:3) E ˜(cid:3)[R(st+1:T ) | st = s, at = α] ˜(cid:3)(α | s, t) α∈A = v ˜(cid:3)(s, t). B.4 Proof of Lemma 1. lim β→+∞ arg min (cid:2)a DKL[Q((cid:2)s | (cid:2)a, st ) | Cβ ((cid:2)s)] = lim β→+∞ arg min (cid:2)a = lim β→+∞ arg min (cid:2)a = lim β→+∞ arg max (cid:2)a −H[Q((cid:2)s | (cid:2)a, st )] + E Q((cid:2)s|(cid:2)a,st )[− log Cβ ((cid:2)s)] −H[Q((cid:2)s | (cid:2)a, st )] − βE Q((cid:2)s|(cid:2)a,st )[R((cid:2)s)] H[Q((cid:2)s | (cid:2)a, st )] + βE Q((cid:2)s|(cid:2)a,st )[R((cid:2)s)] l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . / e d u n e c o a r t i c e - p d / l f / / / / 3 5 5 8 0 7 2 0 7 9 4 7 3 n e c o _ a _ 0 1 5 7 4 p d . / f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 (cid:2) Reward Maximization Active Inference 841 ⊆ lim β→+∞ arg max βE Q((cid:2)s|(cid:2)a,st )[R((cid:2)s)] = arg max (cid:2)a E (cid:2)a Q((cid:2)s|(cid:2)a,st )[R((cid:2)s)]. The inclusion follows from the fact that, as β → +∞, a minimizer of the expected free energy has to maximize E Q((cid:2)s|(cid:2)a,st )[R((cid:2)s)]. Among such action se- quences, the expected free energy minimizers are those that maximize the entropy of future states H[Q((cid:2)s | (cid:2)a, st )]. (cid:2) B.5 Proof of Theorem 1. When T = 1, the only action is a0. We fix an = s ∈ S. By proposition 2, a Bellman optimal state- 0 that maximizes immedi- arbitrary initial state s0 action policy is fully characterized by an action a∗ ate reward: ∗ 0 a ∈ arg max a∈A E[R(s1) | s0 = s, a0 = a]. Recall that by remark 5, this expectation stands for return under the transi- tion probabilities of the MDP: ∗ 0 a ∈ arg max a∈A E P(s1 |a0 =a,s0 =s)[R(s1)]. Since transition probabilities are assumed to be known (see equation 3.1), this reads ∗ a 0 ∈ arg max a∈A E Q(s1 |a0 =a,s0 =s)[R(s1)]. On the other hand, a0 ∈ lim β→+∞ arg max a∈A exp(−G(a | st )) = lim β→+∞ arg min a∈A G(a | st ). By lemma 1, this implies a0 ∈ arg max a∈A E Q(s1 |a0 =a,s0 =s)[R(s1)], which concludes the proof. (cid:2) B.6 Proof of Theorem 2. We prove this result by induction on the tem- poral horizon T of the MDP. The proof of the theorem when T = 1 can be seen from the proof of the- orem 1. Now suppose that T > 1 is finite and that the theorem holds for
MDPs with a temporal horizon of T − 1.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

842

l. Da Costa et al.

Our induction hypothesis says that Q(aτ | sτ ), as defined in equation 4.8,
is a Bellman optimal state-action policy on the MDP restricted to times τ =
1, . . . , t. Por lo tanto, by proposition 2, we only need to show that the action
a0 selected under active inference satisfies

∈ arg max
a∈A

q[R((cid:2)s) | s0

, a0

= a].

This is simple to show as

q[R((cid:2)s) | s0

, a0

= a]

arg max
a∈A
= arg máx
a∈A
= arg máx
a∈A

PAG((cid:2)s|a1:t ,a0

=a,s0 )q((cid:2)a|s1:t )[R((cid:2)s)]

(by remark 4)

q((cid:2)s,(cid:2)a|a0

=a,s0 )[R((cid:2)s)]

(as the transitions are known)

= lim
β→+∞

arg max
a∈A

⊇ lim
β→+∞

arg max
a∈A

q((cid:2)s,(cid:2)a|a0

=a,s0 )[βR((cid:2)s)]

q((cid:2)s,(cid:2)a|a0

=a,s0 )[βR((cid:2)s)] − H[q((cid:2)s | (cid:2)a, a0

= a, s0)]

= lim
β→+∞

arg min
a∈A

q((cid:2)s,(cid:2)a|a0

=a,s0 )[− log Cβ ((cid:2)s)] − H[q((cid:2)s | (cid:2)a, a0

= a, s0)]

(by equation 4.1)

= lim
β→+∞

arg min
a∈A

= lim
β→+∞

arg min
a∈A

q((cid:2)s,(cid:2)a|a0

=a,s0 )DKL[q((cid:2)s | (cid:2)a, a0

= a, s0) | Cβ ((cid:2)s)]

GRAMO(a0

= a | s0)

(by equation 4.6).

Por lo tanto, an action a0 selected under active inference is a Bellman optimal
state-action policy on finite temporal horizons. Además, the inclusion
follows from the fact that if there are multiple actions that maximize ex-
pected reward, that which is selected under active inference maximizes the
(cid:2)
entropy of beliefs about future states.

B.7 Proof of Proposition 4. Unpacking the zero temperature limit,

lim
β→+∞

arg min

(cid:2)a

GRAMO((cid:2)a | ˜o)

= lim
β→+∞

arg min

(cid:2)a

= lim
β→+∞

arg min

(cid:2)a

= lim
β→+∞

arg min

(cid:2)a

DKL[q((cid:2)s | (cid:2)a, ˜o) | Cβ ((cid:2)s)] + mi

q((cid:2)s|(cid:2)a, ˜o)h[PAG((cid:2)oh | (cid:2)s)]

−H[q((cid:2)s | (cid:2)a, ˜o)] + mi

q((cid:2)s|(cid:2)a, ˜o)[− log Cβ ((cid:2)s)] + mi

q((cid:2)s|(cid:2)a, ˜o)h[PAG((cid:2)oh | (cid:2)s)]

−H[q((cid:2)s | (cid:2)a, ˜o)] − βE

q((cid:2)s|(cid:2)a, ˜o)[R((cid:2)s)] + mi

q((cid:2)s|(cid:2)a, ˜o)h[PAG((cid:2)oh | (cid:2)s)]

(by equation 4.1)

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

843

⊆ lim
β→+∞

arg max

βE

q((cid:2)s|(cid:2)a, ˜o)[R((cid:2)s)]

= arg máx

(cid:2)a

(cid:2)a
q((cid:2)s|(cid:2)a, ˜o)[R((cid:2)s)].

The inclusion follows from the fact that as β → +∞, a minimizer of
the expected free energy has first and foremost to maximize E
q((cid:2)s|(cid:2)a, ˜o)[R((cid:2)s)].
Among such action sequences, the expected free energy minimizers are
those that maximize the entropy of (beliefs about) future states H[q((cid:2)s | (cid:2)a, ˜o)]
and resolve ambiguity about future outcomes by minimizing E
q((cid:2)s|(cid:2)a, ˜o)h[PAG((cid:2)oh |
(cid:2)s)].

Expresiones de gratitud

We thank Dimitrije Markovic and Quentin Huys for providing helpful feed-
back during the preparation of the manuscript.

Información de financiación

L.D. is supported by the Fonds National de la Recherche, Luxembourg
(Project code: 13568875). N.S. is funded by the Medical Research Coun-
cil (MR/S502522/1) y 2021-2022 Microsoft PhD Fellowship. K.F. is sup-
ported by funding for the Wellcome Centre for Human Neuroimaging
(Ref: 205103/Z/16/Z), a Canada-U.K. Artificial Intelligence Initiative (Ref:
ES/T01279X/1), and the European Union’s Horizon 2020 Framework Pro-
gramme for Research and Innovation under the Specific Grant Agreement
945539 (Human Brain Project SGA3). R.S. is supported by the William K.
Warren Foundation, the Well-Being for Planet Earth Foundation, the Na-
tional Institute for General Medical Sciences (P20GM121312), and the Na-
tional Institute of Mental Health (R01MH123691). This publication is based
on work partially supported by the EPSRC Centre for Doctoral Training
in Mathematics of Random Systems: Análisis, Modelling and Simulation
(EP/S023925/1).

Contribuciones de autor

L.D.: conceptualization, pruebas, writing: first draft, review and editing. N.S.,
T.P., K.F., R.S.: conceptualization, writing: review and editing.

Referencias

Adams, R. A., Esteban, k. MI., Marrón, h. r., Frith, C. D., & Friston, k. j. (2013).
The computational anatomy of psychosis. Frontiers in Psychiatry, 4. 10.3389/
fpsyt.2013.00047

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

844

l. Da Costa et al.

Adda, J., & Cooper, R. W.. (2003). Dynamic economics: Quantitative methods and appli-

cations. CON prensa.

Attias, h. (2003). Planning by probabilistic inference. In Proceedings of the 9th Int.

Workshop on Artificial Intelligence and Statistics.

Barlow, h. B. (1961). Possible principles underlying the transformations of sensory mes-

sages. CON prensa.

Barlow, h. B. (1974). Inductive inference, codificación, percepción, and language. Percep-

ción, 3(2), 123–134. 10.1068/p030123, PubMed: 4457815

Barp, A., Da Costa, l., França, GRAMO., Friston, K., Girolami, METRO., Jordán, METRO. I., & Pavli-
otis, GRAMO. A. (2022). Geometric methods for sampling, optimisation, inference and
adaptive agents. In F. Nielsen, A. S. R. Srinivasa Rao, & C. Rao (Editores.), Geometría
and statistics. Elsevier.

Aprender, A., Mirolli, METRO., & Baltasar, GRAMO. (2013). Novelty or surprise? Frontiers in Psy-

chology, 4. 10.3389/fpsyg.2013.00907

Aprender, A., & suton, R. (1992). Aprendizaje reforzado: An introduction. CON prensa.
Beal, METRO. j. (2003). Variational algorithms for approximate Bayesian inference. PhD diss.,

University of London.

Bellman, R. mi. (1957). Dynamic programming. Prensa de la Universidad de Princeton.
Bellman, R. MI., & Dreyfus, S. mi. (2015). Applied dynamic programming. Princeton Uni-

versity Press.

Berger, j. oh. (1985). Statistical decision theory and Bayesian analysis (2y ed.). Saltador-

Verlag.

Berger-Tal, o., Nathan, J., Meron, MI., & Saltz, D. (2014). The exploration-exploitation
dilemma: A multidisciplinary framework. PLOS One, 9(4), e95693. 10.1371/
diario.pone.0095693

Bertsekas, D. PAG., & Shreve, S. mi. (1996). Stochastic optimal control: The discrete time case.

Athena Scientific.

obispo, C. METRO. (2006). Pattern recognition and machine learning. Saltador.
Blei, D. METRO., Kucukelbir, A., & McAuliffe, j. D. (2017). Variational inference: A review
for statisticians. Journal of the American Statistical Association, 112(518), 859–877.
10.1080/01621459.2017.1285773

Botvinick, METRO., & Toussaint, METRO. (2012). La planificación como inferencia. Trends in Cognitive Sci-

ences, 16(10), 485–488. 10.1016/j.tics.2012.08.006, PubMed: 22940577

Çatal, o., Nauta, J., Verbelen, T., Simoens, PAG., & Dhoedt, B. (2019). Bayesian policy

selection using active inference. http://arxiv.org/abs/1904.08149

Çatal, o., Verbelen, T., Nauta, J., Boom, C. D., & Dhoedt, B. (2020). Learning per-
ception and planning with deep active inference. In Proceedings of the IEEE In-
ternational Conference on Acoustics, Speech and Signal Processing (páginas. 3952–3956).
10.1109/ICASSP40776.2020.9054364

Çatal, o., Verbelen, T., Van de Maele, T., Dhoedt, B., & Safron, A. (2021). Robot nav-
igation as hierarchical active inference. Neural Networks, 142, 192–204. 10.1016/
j.neunet.2021.05.010

Champion, T., Bowman, h., & Grz´s, METRO.

(2021). Branching time active infer-
ence: Empirical study and complexity class analysis. http://arxiv.org/abs/2111
.11276

Champion, T., Da Costa, l., Bowman, h., & Grze´s, METRO. (2021). Branching time active

inferencia: The theory and its generality. http://arxiv.org/abs/2111.11107

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

845

Cullen, METRO., Davey, B., Friston, k. J., & Moran, R. j. (2018). Active inference in
OpenAI Gym: A paradigm for computational investigations into psychiatric ill-
ness. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 3(9), 809–818.
10.1016/j.bpsc.2018.06.010, PubMed: 30082215

Da Costa, l., Lanillos, PAG., Sajid, NORTE., Friston, K., & Kan, S. (2022). How active infer-
ence could help revolutionise robotics. Entropy, 24(3), 361. 10.3390/e24030361
Da Costa, l., Parr, T., Sajid, NORTE., Veselic, S., Neacsu, v., & Friston, k. (2020). Activo
inference on discrete state-spaces: A synthesis. Revista de Psicología Matemática,
99, 102447. 10.1016/j.jmp.2020.102447

Da Costa, l., Tenka, S., zhao, D., & Sajid, norte. (2022). Active inference as a model
of agency. Workshop on RL as a Model of Agency. https://www.sciencedirect
.com/science/article/pii/S0022249620300857

Grajilla, norte. D., O’Doherty, j. PAG., Dayán, PAG., Seymour, B., & Dolan, R. j. (2006). Corti-
cal substrates for exploratory decisions in humans. Naturaleza, 441(7095), 876–879.
10.1038/nature04766, PubMed: 16778890

Dayán, PAG., & Grajilla, norte. D. (2008). Decision theory, aprendizaje reforzado, y el
cerebro. Cognitivo, Afectivo, and Behavioral Neuroscience, 8(4), 429–453. 10.3758/
CABN.8.4.429

Deci, MI., & ryan, R. METRO. (1985). Intrinsic motivation and self-determination in human be-

havior. Saltador.

Eysenbach, B., & Levin, S. (2019). If MaxEnt Rl is the answer, what is the question?

arXiv:1910.01913.

Fountas, Z., Sajid, NORTE., Mediano, PAG. A. METRO. Deep active inference agents using Monte-Carlo

methods. http://arxiv.org/abs/2006.04176

Friston, K., Da Costa, l., Sajid, NORTE., Heins, C., Ueltzhöffer, K., Pavliotis, GRAMO. A., & Parr,
t. (2022). The free energy principle made simpler but not too simple. http://arxiv.org/
abs/2201.06387

Friston, K., Da Costa, l., Hafner, D., Hesp, C., & Parr, t. (2021). Sophis-
ticated inference. Computación neuronal, 33(3), 713–763. 10.1162/neco_a_01351,
PubMed: 33626312

Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, PAG., O’Doherty, J., & pezzulo, GRAMO.
(2016). Active inference and learning. Revisiones de neurociencia y biocomportamiento, 68,
862–879. 10.1016/j.neubiorev.2016.06.022, PubMed: 27375276

Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, PAG., & pezzulo, GRAMO. (2017). C.A-
tive inference: A process theory Neural Computation, 29(1), 1–49. 10.1162/NECO_a
_00912, PubMed: 27870614

Friston, K., Samotracia, S., & montesco, R. (2012). Active inference and agency:
Optimal control without cost functions. Cibernética biológica, 106(8), 523–541.
10.1007/s00422-012-0512-8, PubMed: 22864468

Friston, k. J., Daunizeau, J., & Kiebel, S. j. (2009). Reinforcement learning or active

inferencia? PLOS One, 4(7), e6421. 10.1371/diario.pone.0006421

Friston, k. J., Daunizeau, J., Kilner, J., & Kiebel, S. j. (2010). Action and behav-
ior: A free-energy formulation. Cibernética biológica, 102(3), 227–260. 10.1007/
s00422-010-0364-z, PubMed: 20148260

Friston, k. J., lin, METRO., Frith, C. D., pezzulo, GRAMO., Hobson, j. A., & Ondobaka, S. (2017).
Active inference, curiosity and insight. Computación neuronal, 29(10), 2633–2683.
10.1162/neco_a_00999, PubMed: 28777724

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

846

l. Da Costa et al.

Friston, k. J., Parr, T., & de Vries, B. (2017). The graphical brain: Belief propaga-
tion and active inference. Neurociencia en red, 1(4), 381–414. 10.1162/NETN_a
_00018, PubMed: 29417960

Friston, k. J., Rosch, r., Parr, T., Precio, C., & Bowman, h. (2018). Deep temporal
models and active inference. Revisiones de neurociencia y biocomportamiento, 90, 486–501.
10.1016/j.neubiorev.2018.04.004, PubMed: 29747865
Fudenberg, D., & Tirole, j. (1991). Game theory. CON prensa.
Gershman, S. j. (2018). Deconstructing the human algorithms for exploration. Cog-

nition, 173, 34–42. 10.1016/j.cognition.2017.12.014, PubMed: 29289795

Gershman, S. J., & NVI, Y. (2010). Learning latent structure: Carving nature at its
articulaciones. Opinión actual en neurobiología, 20(2), 251–256. 10.1016/j.conb.2010.02.008,
PubMed: 20227271

Ghavamzadeh, METRO., Mannor, S., Pineau, J., & Tamar, A. (2016). Bayesian reinforcement

aprendiendo: A survey. arXiv:1609.04436.

Guez, A., Silver, D., & Dayán, PAG. (2013a). Scalable and efficient Bayes-adaptive rein-
forcement learning based on Monte-Carlo tree search. Journal of Artificial Intelli-
gence Research, 48, 841–883. 10.1613/jair.4117

Guez, A., Silver, D., & Dayán, PAG. (2013b). Efficient Bayes-adaptive reinforcement learning

using sample-based search. http://arxiv.org/abs/1205.3109

Haarnoja, T., Espiga, h., Abbeel, PAG., & Levin, S. (2017). Reinforcement learning with deep

energy-based policies. arXiv:1702.08165.

Haarnoja, T., zhou, A., Abbeel, PAG., & Levin, S. (2018). Soft actor critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic actor. CORR,
abs/1801.01290. http://arxiv.org/abs/1801.01290

Huys, q. j. METRO., Eshel, NORTE., O’Nions, MI., Sheridan, l., Dayán, PAG., & Roiser, j. PAG. (2012).
Bonsai trees in your head: How the Pavlovian system sculpts goal-directed
choices by pruning decision trees. PLOS Computational Biology, 8(3), e1002410.
10.1371/journal.pcbi.1002410

Itti, l., & Baldi, PAG. (2009). Bayesian surprise attracts human attention. Vision Research,

49(10), 1295–1306. 10.1016/j.visres.2008.09.007, PubMed: 18834898

Jaynes, mi. t. (1957a). Information theory and statistical mechanics. Physical Review,

106(4), 620–630. 10.1103/PhysRev.106.620

Jaynes, mi. t. (1957b). Information theory and statistical mechanics. II. Physical Review,

108(2), 171–190. 10.1103/PhysRev.108.171

Jordán, METRO. I., Ghahramani, Z., Jaakkola, t. S., & Saul, l. k. (1998). An introduction to
variational methods for graphical models. En m. I. Jordán (Ed.), Learning in graph-
ical models (páginas. 105–161). Springer Netherlands. 10.1007/978-94-011-5014-9_5
Kaelbling, l. PAG., Littman, METRO. l., & Cassandra, A. R. (1998). Planning and acting
in partially observable stochastic domains. Artificial Intelligence, 101(1), 99–134.
10.1016/S0004-3702(98)00023-X

kahneman, D., & Tverski, A. (1979). Prospect theory: An analysis of decision under

riesgo. Econometrica, 47(2), 263–291. 10.2307/1914185

Kappen, h. J., Gómez, v., & Opper, METRO. (2012). Optimal control as a graphical model
inference problem. Machine Learning, 87(2), 159–182. 10.1007/s10994-012-5278-7
Klyubin, A. S., Polani, D., & Nehaniv, C. l. (2008). Keep your options open: Un
information-based driving principle for sensorimotor systems. PLOS One, 3(12),
e4018. 10.1371/diario.pone.0004018

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

847

Lally, NORTE., Huys, q. j. METRO., Eshel, NORTE., Faulkner, PAG., Dayán, PAG., & Roiser, j. PAG.
(2017). The neural basis of aversive Pavlovian guidance during planning
Revista de neurociencia, 37(42), 10215–10229. 10.1523/JNEUROSCI.0085-17.2017,
PubMed: 28924006

Lanillos, PAG., Pages, J., & cheng, GRAMO. (2020). Robot self/other distinction: Active infer-
ence meets neural networks learning in a mirror. In Proceedings of the European
Conference on Artificial Intelligence.

Levin, S. (2018, Puede 20). Reinforcement learning and control as probabilistic inference:

tutorial and review. http://arxiv.org/abs/1805.00909

Lindley, D. V. (1956). On a measure of the information provided by an ex-
perimento. Annals of Mathematical Statistics, 27(4), 986–1005. 10.1214/aoms/
1177728069

Linsker, R. (1990). Perceptual neural organization: Some approaches based on net-
work models and information theory. Revisión anual de neurociencia, 13(1), 257–
281. 10.1146/annurev.ne.13.030190.001353, PubMed: 2183677

MacKay, D. j. C. (2003, Septiembre 25). Information theory, inference and learning algo-

rithms. Prensa de la Universidad de Cambridge.

Maisto, D., Gregoretti, F., Friston, K., & pezzulo, GRAMO. (2021, Marzo 25). Active tree search

in large POMDPs. http://arxiv.org/abs/2103.13860

Markovi´c, D., Stoji´c, h., Schwöbel, S., & Kiebel, S. j. (2021). An empirical evalua-
tion of active inference in multi-armed bandits. Neural Networks, 144, 229–246.
10.1016/j.neunet.2021.08.018

Mazzaglia, PAG., Verbelen, T., & Dhoedt, B. (2021). Contrastive active inference. https://

openreview.net/forum?id=5t5FPwzE6mq

Millidge, B. (2019, Marzo 11). Implementing predictive processing and active inference:

Preliminary steps and results. PsyArXiv. 10.31234/osf.io/4hb58

Millidge, B. (2020). Deep active inference as variational policy gradients. Diario de

Mathematical Psychology, 96, 102348. 10.1016/j.jmp.2020.102348

Millidge, B. (2021). Applications of the free energy principle to machine learning and neu-

roscience. http://arxiv.org/abs/2107.00140

Millidge, B., Tschantz, A., & Buckley, C. l. (2020, Abril 21). Whence the expected free

energía? http://arxiv.org/abs/2004.08128

Millidge, B., Tschantz, A., Seth, A. K., & Buckley, C. l. (2020). On the relationship
between active inference and control as inference. In T. Verbelen, PAG. Lanillos, C. l.
Buckley, & C. De Boom (Editores.), Active inference (páginas. 3–11). Saltador.

Miranda, METRO. J., & Fackler, PAG. l. (2002, Septiembre 1). Applied computational economics

and finance. CON prensa.

Mirza, METRO. B., Adams, R. A., Mathys, C., & Friston, k. j. (2018). Human visual explo-
ration reduces uncertainty about the sensed world. PLOS One, 13(1), e0190429.
10.1371/diario.pone.0190429

Oliver, GRAMO., Lanillos, PAG., & cheng, GRAMO. (2021). An empirical study of active inference
on a humanoid robot. IEEE Transactions on Cognitive and Developmental Systems
PÁGINAS(99), 1–1. 10.1109/TCDS.2021.3049907

Optican, l. METRO., & Richmond, B. j. (1987). Temporal encoding of two-dimensional
patterns by single units in primate inferior temporal cortex. III. Information theo-
retic analysis. Revista de neurofisiología, 57(1), 162–178. 10.1152/jn.1987.57.1.162,
PubMed: 3559670

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

848

l. Da Costa et al.

Oudeyer, P.-Y., & Kaplan, F. (2007). What is intrinsic motivation? A typology of
computational approaches. Frontiers in Neurorobotics, 1, 6. 10.3389/neuro.12.006
.2007

Parr, t. (2019). The computational neurology of active vision (PhD diss.). University Col-

lege London.

Parr, T., Limanowski, J., Rawji, v., & Friston, k. (2021). The computational neurology
of movement under active inference. Cerebro, 144(6), 1799–1818. 10.1093/brain/
awab085, PubMed: 33704439

Parr, T., Markovic, D., Kiebel, S. J., & Friston, k. j. (2019). Neuronal message pass-
ing using mean-field, Bethe, and marginal approximations. Informes Científicos, 9(1),
1889. 10.1038/s41598-018-38246-3

Parr, T., pezzulo, GRAMO., & Friston, k. j. (2022, Marzo 29). Active inference: The free energy

principle in mind, cerebro, and behavior. CON prensa.

Pablo, A., Sajid, NORTE., Gopalkrishnan, METRO., & Razi, A. (2021, Agosto 27). Active inference

for stochastic control. http://arxiv.org/abs/2108.12245

Pavliotis, GRAMO. A. (2014). Stochastic processes and applications: Diffusion processes, el

Fokker-Planck and Langevin equations. Saltador.

Pearl, j. (1998). Graphical models for probabilistic and causal reasoning. In P.
Smets (Ed.), Quantified representation of uncertainty and imprecision (páginas. 367–389).
Springer Netherlands.

Pezzato, C., Ferrari, r., & Corbato, C. h. (2020). A novel adaptive controller for robot
manipulators based on active inference. IEEE Robotics and Automation Letters, 5(2),
2973–2980. 10.1109/LRA.2020.2974451

Pio-Lopez, l., Nizard, A., Friston, K., & pezzulo, GRAMO. (2016). Active inference and
robot control: A case study. Journal of the Royal Society Interface, 13(122), 20160616.
10.1098/rsif.2016.0616

Puterman, METRO. l. (2014, Agosto 28). Markov decision processes: Discrete stochastic dy-

namic programming. wiley.

Rahme, J., & Adams, R. PAG. (2019, Junio 24). A theoretical connection between statistical

physics and reinforcement learning. http://arxiv.org/abs/1906.10228

Rawlik, K., Toussaint, METRO., & Vijayakumar, S. (2013). On stochastic optimal con-
trol and reinforcement learning by approximate inference. En procedimientos de
the Twenty-Third International Joint Conference on Artificial Intelligence. https://
www.aaai.org/ocs/index.php/IJCAI/IJCAI13/paper/view/6658

ross, S., Chaib-draa, B., & Pineau, j. (2008). Bayes-adaptive POMDPs. In J. C.
Platón, D. Koller, Y. Cantante, & S. t. Roweis (Editores.), Advances in neural informa-
tion processing systems, 20 (páginas. 1225–1232). Curran. http://papers.nips.cc/paper/
3333-bayes-adaptive-pomdps.pdf

ross, S., Pineau, J., Chaib-draa, B., & Kreitmann, PAG. (2011). A Bayesian approach for
learning and planning in partially observable Markov decision processes. Diario
de Investigación sobre Aprendizaje Automático, 12 (2011).

ruso, D., & Van Roy, B. (2014). Learning to optimize via posterior sampling. Matemáticas-

ematics of Operations Research, 39(4), 1729–1770. 10.1287/moor.2014.0650

ruso, D., & Van Roy, B. (2016). An information-theoretic analysis of Thompson sam-

pling. Journal of Machine Learning Research, 17(1), 2442–2471.

ruso, D., Van Roy, B., Kazerouni, A., Osband, I., & Wen, z. (2017). A tutorial on

Thompson sampling. arXiv:1707.02038.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

849

Sajid, NORTE., Ball, PAG. J., Parr, T., & Friston, k. j. (2021). Active inference: Demysti-
fied and compared. Computación neuronal, 33(3), 674–712. 10.1162/neco_a_01357,
PubMed: 33400903

Sajid, NORTE., holmes, MI., Costa, l. D., Precio, C., & Friston, k. (2022). A mixed generative

model of auditory word repetition. 10.1101/2022.01.20.477138

Sajid, NORTE., Tigas, PAG., Zakharov, A., Fountas, Z., & Friston, k. (2021, Julio 18). Explo-
ration and preference satisfaction trade-off in reward-free learning. http://arxiv.org/
abs/2106.04316

Sales, A. C., Friston, k. J., jones, METRO. w., Pickering, A. MI., & Moran, R. j. (2019). Lo-
cus Coeruleus tracking of prediction errors optimises cognitive flexibility: Un
active inference model. PLOS Computational Biology, 15(1), e1006267. 10.1371/
journal.pcbi.1006267

Sancaktar, C., van Gerven, METRO., & Lanillos, PAG. (2020, Puede 29). End-to-end pixel-based
deep active inference for body perception and action. http://arxiv.org/abs/2001.05847
Sargent, R. W.. h. (2000). Control óptimo. Journal of Computational and Applied Math-

ematics, 124(1), 361–371. 10.1016/S0377-0427(00)00418-0

Schmidhuber, j. (2006). Developmental robotics, optimal artificial curiosity, cre-
ativity, música, and the fine arts. Connection Science, 18(2), 173–187. 10.1080/
09540090600768658

Schmidhuber, j. (2010). Formal theory of creativity, fun, and intrinsic motivation
(1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3), 230–247.
10.1109/TAMD.2010.2056368

Schneider, T., Belousov, B., Abdulsamad, h., & Peters, j. (2022, Junio 1). Active infer-

ence for robotic manipulation. 10.48550/arXiv.2206.10313

Schulz, MI., & Gershman, S. j. (2019). The algorithmic architecture of explo-
ration in the human brain. Opinión actual en neurobiología, 55, 7–14. 10.1016/
j.conb.2018.11.003, PubMed: 30529148

Schwartenbeck, PAG., FitzGerald, t. h. B., Mathys, C., Dolan, r., & Friston, k.
(2015). The dopaminergic midbrain encodes the expected certainty about
desired outcomes. Corteza cerebral, 25(10), 3434–3445. 10.1093/cercor/bhu159,
PubMed: 25056572

Schwartenbeck, PAG., FitzGerald, t. h. B., Mathys, C., Dolan, r., Kronbichler, METRO., &
Friston, k. (2015). Evidence for surprise minimization over value maximization
in choice behavior. Informes Científicos. 5, 16575. 10.1038/srep16575

Schwartenbeck, PAG., Passecker, J., Hauser, t. Ud., FitzGerald, t. h., Kronbichler, METRO., &
Friston, k. j. (2019). Computational mechanisms of curiosity and goal-directed
exploration. eVida, 45.

Shoham, y., Powers, r., & Grenager, t. (2003). Multi-agent reinforcement learning: A

critical survey. Computer Science Department, Universidad Stanford.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, l., van den Driessche, GRAMO., . . .
Hassabis, D. (2016). Mastering the game of Go with deep neural networks and
tree search. Naturaleza, 529(7587), 484–489. 10.1038/nature16961, PubMed: 26819042
Herrero, r., Friston, k. J., & Whyte, C. j. (2022). A step-by-step tutorial on active infer-
ence and its application to empirical data. Revista de Psicología Matemática, 107,
102632. 10.1016/j.jmp.2021.102632

Herrero, r., Khalsa, S. S., & Paulus, METRO. PAG. (2021). An active inference approach to
dissecting reasons for nonadherence to antidepressants. Biological Psychiatry.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

850

l. Da Costa et al.

Cognitive Neuroscience and Neuroimaging, 6(9), 919–934. 10.1016/j.bpsc.2019.11
.012, PubMed: 32044234

Herrero, r., Kirlic, NORTE., Stewart, j. l., Touthang, J., Kuplicki, r., Khalsa, S. S. F., . . .
Aupperle, R. l. (2021). Greater decision uncertainty characterizes a transdiagnos-
tic patient sample during approach-avoidance conflict: A computational mod-
eling approach. Journal of Psychiatry an Neuroscience, 46(1), E74–E87. 10.1503/
jpn.200032

Herrero, r., Kirlic, NORTE., Stewart, j. l., Touthang, J., Kuplicki, r., McDermott, t. J., . . .
Aupperle, R. l. (2021). Long-term stability of computational parameters during
approach-avoidance conflict in a transdiagnostic psychiatric patient sample. Sci-
entific Reports, 11(1), 11783. 10.1038/s41598-021-91308-x

Herrero, r., Kuplicki, r., Feinstein, J., Forthman, k. l., Stewart, j. l., Paulus, METRO. PAG.,
. . . Khalsa, S. S. (2020). A Bayesian computational model reveals a failure to
adapt interoceptive precision estimates across depression, ansiedad, eating, y
substance use disorders. PLOS Computational Biology, 16(12), e1008484. 10.1371/
journal.pcbi.1008484

Herrero, r., Kuplicki, r., Teed, A., Upshaw, v., & Khalsa, S. S. (2020, Septiembre 29).
Confirmatory evidence that healthy individuals can adaptively adjust prior expectations
and interoceptive precision estimates. 10.1101/2020.08.31.275594

Herrero, r., Mayeli, A., taylor, S., Al Zoubi, o., Naegele, J., & Khalsa, S. S. (2021). Gut
inferencia: A computational modeling approach. Biological Psychology, 164 108152.
10.1016/j.biopsycho.2021.108152

Herrero, r., Schwartenbeck, PAG., Parr, T., & Friston, k. j. (2019). An active inference model

of concept learning. bioRxiv:633677. 10.1101/633677

Herrero, r., Schwartenbeck, PAG., Parr, T., & Friston, k. j. (2020). An active inference ap-
proach to modeling structure learning: concept learning as an example case. Fron-
tiers in Computational Neuroscience, 14. 10.3389/fncom.2020.00041

Herrero, r., Schwartenbeck, PAG., Stewart, j. l., Kuplicki, r., Ekhtiari, h., & Paulus, METRO. PAG.
(2020). Imprecise action selection in substance use disorder: evidence for active
learning impairments when solving the explore-exploit dilemma. Drug and Alco-
hol Dependence, 215, 108208. 10.1016/j.drugalcdep.2020.108208

Herrero, r., taylor, S., Stewart, j. l., Guinjoan, S. METRO., Ironside, METRO., Kirlic, NORTE., . . . Paulus,
METRO. PAG. (2022). Slower learning rates from negative outcomes in substance use dis-
order over a 1-year period and their potential predictive utility. computacional
Psiquiatría, 6(1), 117–141. 10.5334/cpsy.85

Still, S., & Deberes, D. (2012). An information-theoretic approach to curiosity-
driven reinforcement learning. Theory in Biosciences, 131(3), 139–148. 10.1007/
s12064-011-0142-z, PubMed: 22791268

Stolle, METRO., & Deberes, D. (2002). Learning options in reinforcement learning. Lecture

Notes in Computer Science, 212–223. Saltador. 10.1007/3-540-45622-8_16

Piedra, j. V. (2015, Febrero 1). Information theory: A tutorial introduction. Sebtel Press.
Piedra, j. V. (2019). Artificial intelligence engines: A tutorial introduction to the mathematics

of deep learning. Sebtel Press.

Sol, y., Gómez, F., & Schmidhuber, j. (2011, Marzo 29). Planning to be surprised: Opti-
mal Bayesian exploration in dynamic environments. http://arxiv.org/abs/1103.5708
Tanaka, t. (1999). A theory of mean field approximation. In S. Solla, t. Leen, & k.
Müller (Editores.), Advances in neural information processing systems, 11. CON prensa.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Reward Maximization Active Inference

851

Tervo, D. GRAMO. r., Tenenbaum, j. B., & Gershman, S. j. (2016). Toward the neural im-
plementation of structure learning. Opinión actual en neurobiología, 37, 99–105.
10.1016/j.conb.2016.01.014, PubMed: 26874471

todorov, mi. (2006). Linearly-solvable Markov decision problems. In Advances in neu-
ral information processing systems, 19. CON prensa. https://papers.nips.cc/paper/
2006/hash/d806ca13ca3449af72a1ea5aedbed26a-Abstract.html

todorov, mi. (2008). General duality between optimal control and estimation. En
Proceedings of the 47th IEEE Conference on Decision and Control (páginas. 4286–4292).
10.1109/CDC.2008.4739438

todorov, mi. (2009). Efficient computation of optimal actions. In Proceedings of the Na-

tional Academy of Sciences, 106(28), 11478–11483. 10.1073/pnas.0710743106

Tokic, METRO., & Palm, GRAMO. (2011). Value-difference based exploration: Adaptive Con-
trol between epsilon-greedy and Softmax. In J. Bach & S. Edelkamp (Editores.),
intelligence (páginas. 335–346). Saltador. 10.1007/
KI 2011: Advances in artificial
978-3-642-24455-1_33

Toussaint, METRO. (2009). Robot trajectory optimization using approximate inference. En
Proceedings of the 26th Annual International Conference on Machine Learning (páginas.
1049–1056). 10.1145/1553374.1553508

Tschantz, A., Baltieri, METRO., Seth, A. K., & Buckley, C. l. (2019, Noviembre 24). Scaling

active inference. http://arxiv.org/abs/1911.10601

Tschantz, A., Millidge, B., Seth, A. K., & Buckley, C. l. (2020). Aprendizaje reforzado

through active inference. http://arxiv.org/abs/2002.12636

Tschantz, A., Seth, A. K., & Buckley, C. l. (2020). Learning action-oriented models
through active inference. PLOS Computational Biology, 16(4), e1007805. 10.1371/
journal.pcbi.1007805

van den Broek, B., Wiegerinck, w., & Kappen, B. (2010). Risk sensitive path integral

control. https://arxiv.org/ftp/arxiv/papers/1203/1203.3523.pdf

van der Himst, o., & Lanillos, PAG. (2020). Deep active inference for partially observable

MDPs. 10.1007/978-3-030-64919-7_8

Von Neumann, J., & Morgenstern, oh. (1944). Theory of games and economic behavior.

Prensa de la Universidad de Princeton.

Wainwright, METRO. J., & Jordán, METRO. I. (2007). Graphical models, exponential families,
and variational inference. Foundations and Trend in Machine Learning, 1(1–2), 1–
305. 10.1561/2200000001

wilson, R. C., Bonawitz, MI., Costa, V. D., & Ebitz, R. B. (2021). Balancing exploration
and exploitation with information and randomization. Current Opinion in Behav-
ioral Sciences, 38, 49–56. 10.1016/j.cobeha.2020.10.001, PubMed: 33184605

wilson, R. C., Geana, A., Blanco, j. METRO., ludwig, mi. A., & cohen, j. D. (2014). Humanos
use directed and random exploration to solve the explore–exploit dilemma. Jour-
nal of Experimental Psychology. General, 143(6), 2074–2081. 10.1037/a0038199
Xu, h. A., Modirshanechi, A., Lehmann, METRO. PAG., Gerstner, w., & Herzog, METRO. h.
(2021). Novelty is not surprise: Human exploratory and adaptive behavior in se-
quential decision-making. PLOS Computational Biology, 17(6), e1009070. 10.1371/
journal.pcbi.1009070

Zermelo, mi. (1913). Über eine Anwendung der Mengenlehre auf die Theorie des
https://www.mathematik.uni-muenchen.de/∼spielth/artikel/

Schachspiels.
Zermelo.pdf

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

852

l. Da Costa et al.

Ziebart, B. (2010). Modeling purposeful adaptive behavior with the principle of maximum

causal entropy. Carnegie Mellon University.

Ziebart, B. D., Maas, A. l., Bagnell, j. A., & Dey, A. k. (2008). Maximum entropy
inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial
Inteligencia.

Zintgraf, l., Shiarlis, K., Igl, METRO., Schulze, S., Gal, y., Hofmann, K., & Whiteson, S.
(2020, Febrero 27). VariBAD: A very good method for Bayes-adaptive deep RL via
meta- aprendiendo. http://arxiv.org/abs/1910.08348

Zintgraf, l. METRO., feng, l., Lu, C., Igl, METRO., Hartikainen, K., Hofmann, K., & Whiteson,
S. (2021). Exploration in approximate hyper-state space for meta reinforcement
aprendiendo. International Conference on Machine Learning (páginas. 12991–13001).

Received September 13, 2022; accepted December 17, 2022.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
norte
mi
C
oh
a
r
t
i
C
mi
–
pag
d

F
/

3
5
5
8
0
7
2
0
7
9
4
7
3
norte
mi
C
oh
_
a
_
0
1
5
7
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Descargar PDF