Imaginative Reinforcement Learning: 计算型 - 麻省理工学院人工智能研究专业

Imaginative Reinforcement Learning: 计算型
Principles and Neural Mechanisms

Samuel J. 格什曼, Jimmy Zhou, and Cody Kommers

抽象的

■ Imagination enables us not only to transcend reality but also to
learn about it. In the context of reinforcement learning, an agent
can rationally update its value estimates by simulating an internal
model of the environment, provided that the model is accurate. 在
a series of sequential decision-making experiments, we investi-
gated the impact of imaginative simulation on subsequent deci-
西翁. We found that imagination can cause people to pursue
imagined paths, even when these paths are suboptimal. This bias

is systematically related to participants’ optimism about how much
reward they expect to receive along imagined paths; providing feed-
back strongly attenuates the effect. The imagination effect can be
captured by a reinforcement learning model that includes a bonus
added onto imagined rewards. Using fMRI, we show that a network
of regions associated with valuation is predictive of the imagination
影响. These results suggest that imagination, although a powerful
tool for learning, is also susceptible to motivational biases. ■

介绍

Imagination is a fertile source of knowledge. Philoso-
phers and scientists routinely use thought experiments
to explore their mental models of the world and thereby
make “discoveries” in the absence of new experience.
Lucretius inferred the infinitude of space by picturing
himself throwing spears at the boundary of the universe,
and Einstein discovered relativity by picturing himself
riding on a beam of light.

Imagination has also been put to practical use in com-
puter science. Niyogi, Girosi, and Poggio (1998) 的-
scribed how an image classifier could be fed training
examples synthesized by applying mental transforma-
tions to a set of objects. 例如, suppose you were
training a classifier to recognize faces. You might only
have a single image for a given face, but in the real world,
faces appear in many orientations and positions. 如果你
have access to a 3-D model of the face, 那么你可以
mentally apply transformations that preserve identity
(例如, rotating the face). Each transformation yields a
new image with the same label and more training data
for the classifier.

A similar idea was applied to reinforcement learning by
Sutton (1990): A model of the environment can be used
to simulate training data (transitions and rewards) for a
computationally cheap “model-free” learning algorithm
that updates a set of cached value estimates (future re-
ward expectations). In this architecture, the same learn-
ing algorithm operates on both real and simulated
经历. The key advantage is that a model-based

哈佛大学

action policy can be approximated without computation-
ally expensive model-based algorithms like tree search or
dynamic programming; the model-free cached values
map directly to a policy without additional computation.
These examples illustrate how learning systems can be
integrated with imaginative simulation to acquire knowl-
edge in the absence of new experience. 然而, 那里
is relatively little direct evidence that the brain uses imag-
ination in this way.

Indirect evidence for the role of imaginative simulation
in reinforcement learning comes from a series of retro-
spective revaluation experiments (格什曼, Markman,
& Otto, 2014). In these experiments, human participants
learned conflicting policies at different stages of a se-
quential decision task and were then tested for revalua-
tion of the policy learned earlier in the task. A period of
quiet rest before the test phase enhanced retrospective
revaluation, consistent with the idea that model-free
cached values can be updated via offline simulation. 这
finding cannot be explained by pure model-based or
model-free accounts of learning or even by stochastic
mixtures of the two (Daw, 格什曼, Seymour, Dayan,
& Dolan, 2011); it appears to require a particular kind
of cooperative interaction between the systems.

在本文中, we take a closer look at the role of imag-
inative simulation in reinforcement learning. We asked
human participants to perform a sequential decision task
with dynamic rewards, while intermittently having them
imagine particular paths through the state space.
Although participants do not gain any information from
these imagination trials, it has a potent effect on their
subsequent decision behavior, influencing them to
pursue imagined paths that are in fact suboptimal. 我们

认知神经科学杂志 29:12, PP. 2103–2113
土井:10.1162/jocn_a_01170

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

我

/
j

F
/

t
t

我
t
.

:
/
/

H
t
t
p
:
/
D
/
哦
米
w
我
n
t
哦
p
A
r
d
C
e
.
d
s
F
我
r
哦
我
米
v
e
H
r
C
p
H
A
d
我
我
r
r
e
.
C
C
t
.
哦
米
米
/
j
e
哦
d
你
C
n
哦
/
C
A
n
r
A
t
r
我
t
我
C
C
我
e
e
–
p
–
d
p
d
2
F
9
/
1
2
2
9
/
2
1
1
2
0
/
3
2
1
1
9
0
5
3
3
/
2
1
6
6
7
8
哦
6
C
7
n
7
_
7
A
/
_
j
0
哦
1
C
1
n
7
0
_
A
p
_
d
0
1
乙
1
y
7
G
0
你
.
e
p
s
t
d
哦
F
n
乙
0
y
7
S
中号
e
我
p
时间
e
米
L
我
乙
乙
e
r
r
A
2
r
0
2
我
3
e
s

/
j

你
s
e
r

哦
n

1
7

中号
A
y

2
0
2
1

show that this bias arises in part because participants are
optimistic about the amount of reward they will receive
in imagined states; the bias is reduced when participants
are given feedback about the true reward. A simple rein-
forcement learning model with an “imagination bonus”
can capture the bias. Using fMRI, we find that the bias
is associated with activation in medial pFC and OFC, 骗局-
sistent with the role of those regions in reward expecta-
的. 合在一起, these findings suggest that
imagination can drive reinforcement learning, 虽然
it can fall prey to miscalibrated reward expectations.

ity for participation in the scanning portion. 参加者
gave informed consent before both sessions. The study
was approved by the ethics committee of Harvard Univer-
城市. Participants earned $35 for the scanning session and $10 for the behavioral session, plus a performance-based
bonus in both.

此外, we recruited 230 human participants using
the Amazon Mechanical Turk Web service. All partici-
pants were given informed consent and paid for their
参与. This study was also approved by the ethics
committee of Harvard University.

方法

参加者

Twenty healthy volunteers (10 女性; mean age =
25.45 年, 标准差= 4.5 年) participated in the scanning
portion of this study. These same 20 individuals also par-
ticipated in a behavioral session to determine their eligibil-

Design and Procedure: fMRI Experiment

The following describes the task that participants per-
formed in the scanning experiment. There were two
kinds of trials: “decision” trials and “imagination” trials
(数字 1). A block consisted of five decision trials followed
by one imagination trial with the addition of a single

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

我

/
j

t
t

F
/

我
t
.

:
/
/

/
j

你
s
e
r

哦
n

1
7

中号
A
y

2
0
2
1

数字 1. 实验设计. The design of the fMRI study is shown here, which is identical to the design of the behavioral studies except that the
timing was faster in the behavioral studies (参见方法) and they used a continuous reward prediction rather than a binary judgment. (A) 在
decision trials, participants traversed the state space by making a sequence of two decisions, followed by a reward in the terminal state. 状态
transition diagram is shown on the right; colored arrows correspond to available actions in each state. (乙) Example reward sequence. Each line
corresponds to a terminal state. (C) On imagination trials, participants performed the sequence of actions necessary to arrive at a particular terminal
状态 (shown on the right) and then predicted whether the reward would be greater or less than 0 in that terminal state. In the behavioral studies,
participants made a continuous numerical reward prediction.

2104

认知神经科学杂志

体积 29, 数字 12

decision trial at the end, because we were particularly in-
terested in the decision trials immediately after an imagi-
nation trial. A run consisted of eight blocks. 参加者
performed five runs in the scanner. Most participants per-
formed all five runs, but some participants had exceptions
in the number of runs, with some participants completing
fewer runs because of experimental glitches (two partici-
pants: two runs, one participant: three runs, two partici-
pant: four runs) and some initial participants completing
more runs when we were first piloting the experiment
(three participants: six runs, one participant: eight runs).
In decision trials, participants made two consecutive
decisions of left or right and received feedback after each
决定. These left or right decisions allowed the partic-
ipant to navigate different states. Each trial began with
the same start state. There were two intermediate states
(one for left, one for right) and four terminal states (左边
or right from either of the second-level states). 这些
states were represented by black and white pictures of
objects or scenes. The transitions between states were de-
terministic. We showed participants the transition struc-
ture of these states before the start of the experiment.

The decision trials began with the participant seeing
the first state and receiving a prompt for a forced-choice,
two-alternative (left or right) 决定. Participants had
1.5 sec to make this decision. If participants failed to
make a decision, then they were shown a fixation cross
during the remaining time allotted for the trial (8 秒
from onset of the first picture to the end of final feed-
后退). After the first decision, participants were given re-
ward feedback and shown the picture associated with the
intermediate state (one of two possible states depending
on whether they chose left or right). The reward feed-
back after the first decision was always 0 and was shown
为了 1.5 秒. Participants were then prompted to make
another forced-choice left/right decision. They had
1.5 sec to make this decision. 再次, if they failed to make
a decision, they were shown a fixation cross during the
remaining time allotted for the trial. After they made their
second decision, participants were given reward feedback
and shown the picture associated with the terminal state
they had selected. The feedback lingered for 1.5 sec before
participants were shown a fixation cross for 2–4 sec of
jitter, after which the next trial would begin.

The underlying rewards were predetermined for each
审判, independent of the path chosen by the participant.
The underlying reward structure defines the ground-
truth optimal path. Rewards were randomly generated
at the time of each new block. Rewards were symmetri-
cally distributed, such that the highest and lowest re-
wards were on the same branch of the path structure
(例如, the highest and lowest could be associated with
the two terminal states reachable from the left intermedi-
ate state) and the average expected reward was the same
at both intermediate states. The highest reward was sam-
pled from a uniform distribution between 15 和 25. 这
two intermediate rewards were sampled from a uniform

distribution between 0 和 10. The lowest reward was
sampled from a uniform distribution between −15 and
−5. Rewards reset, 一般, every 10 试验 (选择的
uniformly from 8 到 12). These rewards drifted according
to a Gaussian random walk (标准差= 0.5) until the next reset
occurred. We chose this distribution, which was biased to
yield positive rewards on average, so that participants
would not get frustrated by experiencing a large number
of losses. For some participants (n = 39), the mean re-
wards of the left and right branches of the tree were
matched (IE。, the sum of the highest and lowest rewards
was about equal to the sum of the two middle rewards).
For the rest of the participants, the rewards were un-
matched. These reward sequences were qualitatively sim-
ilar, so we collapsed across the different sequence types.
In imagination trials, participants were shown the pic-
ture representing the start state and the picture rep-
resenting one of the terminal states, with an arrow
pointing from the start state to the terminal state. 这
terminal state was selected at random from one of the
three states that did not offer the highest reward. Partic-
ipants were asked to imagine the sequence of actions
that would take them from the start state to the indicated
terminal state and then to indicate the appropriate se-
quence of left or right decisions (例如, press left and right
or left and left). Participants had 4 sec to indicate the cor-
rect path, and 2–4 sec of jitter followed after indicating
the imagined path. There was no fixation cross if the par-
ticipants failed to make the decisions. Participants were
then asked to predict whether the imagined path would
yield a reward that was more or less than zero. They had
2.5 sec to respond and then were given 2–4 sec of jitter
before the onset of the next decision trial.

We first recruited participants to participate in the be-
havioral portion of the experiment outside the scanner.
In this behavioral session, a run consisted of eight blocks
with the addition of a single decision trial at the end.
Each participant performed four runs. Participants prac-
ticed the task for one run before beginning the actual
实验. After the participant had completed the
behavioral session, we invited them to return for the
scanning portion if their data showed an increased prob-
ability of selecting the imagined path on the decision tri-
als immediately after the imagined trials (the basis for the
effect in Experiment 1). We had 35 participants partici-
pate in this behavioral portion of the task, 15 其中
were excluded from scanning because either they did
not show the effect or they declined our invitation to re-
turn for the scanning session (8 的 35 participants did not
show effect and were excluded from scanning accord-
英利; 7 的 35 participants declined invitation to return
for scanning session). Although we selected participants
for scanning on the basis of the imagination effect, we still
found a significant effect on average when analyzing all
35 参与者. 更普遍, the choice behavior re-
ported in the Results section was quantitatively and qual-
itatively unchanged when including all 35 参与者.

格什曼, 周, and Kommers

2105

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

我

/
j

F
/

t
t

我
t
.

:
/
/

/
j

你
s
e
r

哦
n

1
7

中号
A
y

2
0
2
1

Individual trials were excluded from the behavioral and
model analyses if participants failed to reach a terminal
状态 (IE。, they did not make two decisions).

Design and Procedure: Behavioral Experiments

实验 1 featured the same experimental paradigm
as the scanning experiment described above, except that
participants made continuous (numerical) predictions in
the imagination trial. Individual trials were excluded if
participants made a prediction with an absolute value
greater than or equal to 25. 此外, participants were
required to indicate the correct imagined path before
moving onto the next trial. 例如, if the correct
decision sequence was left and then right, 他们是
prompted to repeat the decision sequence until they se-
lected the correct one. The time constraints described in
the scanning experiment were relaxed in these experi-
评论. A block consisted of five decision trials and one
imagination trial with the addition of a single decision
trial at the end. Each participant performed 31 blocks.

实验 2 was the same as Experiment 1 描述的
多于, except that, after participants had made their
预测, they received veridical feedback about the
reward associated with the imagined path.

实验 3 was the same as Experiment 1 描述的
多于, except that participants were asked neither to
imagine the path nor to indicate the sequence of deci-
sions to get there. They only made a prediction about
the value of a given terminal state.

Computational Model Fitting and Comparison

We fit the four computational models described in the
Results section to the choice data from the decision trials.
Maximum likelihood estimates of each parameter were
obtained for each participant individually using nonlinear
优化 (MATLAB’s fmincon function) with five ran-
dom initializations to avoid local optima; the parameter
estimates achieving the highest likelihood across the ran-
dom initializations were used in subsequent analyses. 我们
placed the following bounds on the parameters: inverse
temperature [0,10], learning rate [0,1], eligibility trace
[0,1], imagination bonus [0,20], and forgetting decay
[1,3]. No transformations were applied to the parameters
during model fitting.

Models were compared using random effects Bayesian
model comparison (Rigoux, Stephan, 弗里斯顿, & Daunizeau,
2014), which estimates the frequency of each model class
in the population. The input to this procedure is the log
model evidence for each participant, which we approx-
imated using −0.5 × BIC, where BIC is the Bayesian Infor-
mation Criterion. We used the exceedance probability (这
posterior probability that a particular model is more
frequent in the population than the other models under
consideration) as a model comparison metric.

fMRI Data Acquisition

Neuroimaging data were collected using a 3-T Siemens
Magnetom Prisma MRI scanner (Siemens Healthcare,
Erlangen, 德国) with the vendor’s 32-channel head
coil. Anatomical images were collected with a T1-weighted
multiecho MPRAGE sequence (176 sagittal slices; repe-
tition time = 2530 毫秒; echo times = 1.64, 3.50, 5.36,
和 7.22 毫秒; flip angle = 7°; 1-mm3 voxels; field of
view = 256 毫米). All BOLD data were collected via a
T2*-weighted EPI pulse sequence that employed multi-
band RF pulses and Simultaneous Multi-Slice (短信) 交流电-
quisition (徐等人。, 2013; Feinberg et al., 2010; Moeller
等人。, 2010). For the six task runs, the EPI parameters
were as follows: 69 interleaved axial–oblique slices
(25° toward coronal from AC–PC alignment), repetition
time = 2000 毫秒, echo time = 35 毫秒, flip angle = 80°,
2.2-mm3 voxels, field of view = 207 毫米, and SMS = 3.
The SMS-EPI acquisitions used the CMRR-MB pulse
sequence from the University of Minnesota.

fMRI Data Preprocessing and Analysis

Data preprocessing and statistical analyses were per-
formed using SPM12 (Wellcome Department of Imaging
神经科学, 伦敦, 英国). Functional (EPI) image vol-
umes were realigned to correct for small movements oc-
curring between scans. This process generated an aligned
set of images and a mean image per participant. 每个
participant’s T1-weighted structural MRI was then co-
registered to the mean of the realigned images and seg-
mented to separate out the gray matter, 这是
normalized to the gray matter in a template image based
on the Montreal Neurological Institute reference brain.
Using the parameters from this normalization process,
the functional images were normalized to the Montreal
Neurological Institute template (resampled voxel size =
2 mm isotropic) and smoothed with an 8-mm FWHM
Gaussian kernel. A high-pass filter of 1/128 Hz was used
to remove low-frequency noise, and a first-order auto-
regressive model was used to correct for temporal
autocorrelations.

We defined two general linear models (GLMs) to ana-
lyze the fMRI data. Both GLMs included stimulus events
(cues and outcomes) as impulse regressors convolved
with the canonical hemodynamic response function
(HRF). In GLM1, a boxcar regressor was defined over
the entire imagination trial epoch and then convolved
with the canonical HRF. Separate regression coefficients
were estimated for imagination trials, which were fol-
lowed by a choice of the imagined path, 和试验, 哪个
were followed by a choice of the optimal path. In GLM2,
the temporal difference prediction error from the imagi-
国家 + forgetting model was entered as a parametric
modulator of the outcome events on decision trials and
orthogonalized with respect to the outcome event re-
gressor and convolved with the canonical HRF.

2106

认知神经科学杂志

体积 29, 数字 12

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

我

/
j

F
/

t
t

我
t
.

:
/
/

/
j

你
s
e
r

哦
n

1
7

中号
A
y

2
0
2
1

数字 2. Imagination trials
affect subsequent decisions.
(A) Participants are more likely
to take the imagined path
after an imagination trial than
before an imagination trial and
correspondingly less likely to
take the optimal path. 他们是
also slightly more likely to take a
path that is neither optimal nor
imagined. (乙) Providing reward
feedback on imagination
trials strongly attenuates the
imagination effect. (C) Asking
participants to make reward
predictions without imagining
the action sequence also strongly
attenuates the imagination
影响. (D) Comparison of effects
across experiments. The y axis
shows the difference in
probability of choosing the
imagined path after and before
an imagination trial. Error bars
denote SEM. Expt = experiment.

Group-level results were analyzed using t contrasts
with cluster-based FWE thresholding at the whole-brain
等级 ( p < .05) using a cluster-forming threshold of p < .001. For the ventral striatum analysis, we used a bilateral anatomical mask taken from the automated anatomical labeling atlas (Tzourio-Mazoyer et al., 2002). RESULTS Behavioral Results Human participants (N = 87) performed a reinforcement learning task in which they navigated through a sequence of states to maximize rewards (Figure 1A). Rewards were only delivered in the terminal states, and the reward mag- nitudes changed dynamically (Figure 1B), such that par- ticipants had to be continually updating their policy and exploring the decision tree. In addition to these “deci- sion” trials, participants intermittently performed “imagi- nation” trials in which they were asked to first enter the sequence of actions that would take them to a particular terminal state and then to make a prediction about how much reward they would obtain in that state (Figure 1C). The key question we asked was how imagination trials affected behavior on subsequent decision trials. A partic- ipant’s choice of path on a decision trial can be broken down into three categories: the objectively optimal path, the previously imagined path, and the two other possible paths, which are neither optimal nor imagined. Critically, we asked participants to imagine paths that were always suboptimal, setting up a conflict between optimal and imagined paths. We found that participants were more likely to choose the imagined path after an imagination trial compared with before an imagination trial (t(86) = 8.46, p < .0001; Figure 2A) and correspondingly less likely to choose the optimal path (t(86) = 11.5, p < .0001). Participants were also more likely to choose an “other” path (t(86) = 5.28, p < .0001), suggesting the possibility that participants simply forgot the optimal path because of memory interference from the imagination trial, as op- posed to being systematically biased toward the imagined path. However, the shift toward the imagined path was marginally stronger than the shift toward the other paths (t(86) = 1.88, p = .06). We will address the question of forgetting further using computational modeling in the next section. We next explored several variations of our paradigm. In Experiment 2 (n = 46), participants received feedback about the true rewards after their predictions on imagina- tion trials. This attenuated the imagination effect (change in probability of choosing the imagined path after an imagination trial) relative to Experiment 1 (t(131) = 4.05, p < .0001; Figure 2D), but the effect was still mar- ginally significant (t(45) = 2.02, p = .05; Figure 2B). The imagination effect was significantly smaller than the change in probability of choosing one of the “other” paths (t(45) = 4.03, p < .001), and the magnitude of this “other” effect was comparable with Experiment 1, indicat- ing that reward feedback selectively reduced the imagina- tion effect without affecting the “other” effect. In Experiment 3 (n = 97), participants made reward predictions (without feedback) but did not enter the path that would take them to the specified terminal state. We hypothesized that this experiment would reduce the demands on imaginative simulation. The imagination effect was again attenuated relative to Experiment 1 (t(182) = 3.81, p < .001; Figure 2D) but significantly Gershman, Zhou, and Kommers 2107 D o w n l o a d e d f r o m l l / / / / j f / t t i t . : / / h t t p : / D / o m w i n t o p a r d c e . d s f i r o l m v e h r c p h a d i i r r e . c c t . o m m / j e o d u c n o / c a n r a t r i t i c c l e e - p - d p d 2 f 9 / 1 2 2 9 / 2 1 1 2 0 / 3 2 1 1 9 0 5 3 3 / 2 1 6 6 7 8 o 6 c 7 n 7 _ 7 a / _ j 0 o 1 c 1 n 7 0 _ a p _ d 0 1 b 1 y 7 g 0 u . e p s t d o f n b 0 y 7 S M e I p T e m L i b b e r r a 2 r 0 2 i 3 e s / j t . f / u s e r o n 1 7 M a y 2 0 2 1 greater than 0 (t(96) = 4.2, p < .0001; Figure 2C). There was no significant difference in the size of the imagina- tion effect between Experiments 2 and 3 ( p = .31). One clue about the nature of the underlying mecha- nisms comes from inspection of the reward predictions themselves (Figure 3A): Participants are systematically miscalibrated across all three experiments ( p < .0001), estimating the rewards to be greater than they actually are. In other words, reward predictions are optimistic, even when reward feedback is provided in Experiment 2 (although the miscalibration is significantly reduced relative to Experiment 1; t(131) = 2.35, p < .05). This miscalibration is predictive of behavior on subsequent decision trials in Experiment 1: The imagination effect is significantly greater after positively miscalibrated (opti- mistic) imagination trials compared with negatively mis- calibrated (pessimistic) trials (t(77) = 3.91, p < .001; Figure 3B), although it is still significantly greater than 0 after negatively miscalibrated trials (t(77) = 5.00, p < .0001). To summarize so far, the imagination effect depends on both reward feedback and imaginative simulation. An important (but not exclusive) contributing factor is the prevalence of miscalibrated reward predictions, such that imaginative simulation combined with optimistic reward predictions increase the probability of choosing the imagined path. Computational Modeling To disentangle the different possible mechanisms driving the imagination effect, we fit a family of reinforcement learning models to choice behavior. All of these models have in common the well-accepted idea that cached values are updated using temporal difference learning (Daw et al., 2011; Gläscher, Daw, Dayan, & O’Doherty, 2010; Seymour et al., 2004; Schultz, Dayan, & Montague, Figure 3. Miscalibration of reward predictions. (A) Participants are optimistic (positively miscalibrated) about expected reward in imagined states. This optimism is reduced, but not eliminated, by reward feedback. (B) Participants are more likely to switch to the imagined path when they are positively miscalibrated compared with when they are negatively miscalibrated. Error bars denote SEM. Neg = negative; Pos = positive. 1997). In addition, the models assume that the same learning algorithm applies to imagined paths and rewards. The critical differences between the models lie in how imagined rewards are distorted and whether cached values can be forgotten. Cached values encode estimates of expected dis- counted future return in a lookup table. Specifically, we define the Q value of taking action a in state s as Q s; að Þ ¼ E½ X∞ t¼0 γtrt s0 ¼ s; a0 ¼ a j (cid:2) where rt is the reward received at time t and γ is a discount factor that down-weights distal rewards. The temporal difference learning algorithm (specifically the SARSA algorithm; see Sutton & Barto, 1998) updates a cached value estimate ^Qt s; að Þ according to the prediction error δt ¼ rt þ γ ^Qt stþ1; atþ1 ð Þ − ^Qt st; at ð Þ: This same update can apply to both experienced and imagined state–action pairs, a key insight of Sutton’s (1990) Dyna architecture. We refer to this as the “standard” model. Note that, although we focus on model-free learning algorithms in this article, our data can also be accounted for by model-based variants. We do not explore these here because they make essentially the same predictions. We consider two modifications of the standard model. In the “forgetting” model, all the Q values are decayed toward 0 by a factor ω. This captures the idea that the imagination trial can lead to forgetting of the Q values, independent of any effect of imagination per se. In the “imagination bonus” model (“imagination” model for short), reward predictions are distorted by a fixed addi- tive bias, ε. This captures the idea that imagination can be contaminated by optimistic or pessimistic beliefs about unknown rewards. Finally, we considered a hybrid of these two extended models (the “imagination + forgetting” model), which includes both parameters. Parameters were estimated by fitting the model to the choice data from the decision trials (see Methods for de- tails). We found that the imagination + forgetting model could qualitatively capture the pattern of experimental results (Figure 4A), and random effects Bayesian model comparison favored this model over the other variants (protected exceedance probability of .94; Figure 4B). As an additional test of the models, we matched their reward predictions on the imagination trials to the empir- ical data (note that the models were not fit to these data). The average correlation between model and empirical re- ward predictions for the imagination + forgetting model was .57 ± .03 SEM (Figure 4C). After Fisher z transform- ing to approximate a normally distributed random vari- able, this correlation was significantly larger than the correlation for the forgetting model (t(86) = 3.18, p < .005). Thus, the reward prediction analysis recapitulates 2108 Journal of Cognitive Neuroscience Volume 29, Number 12 D o w n l o a d e d f r o m l l / / / / j f / t t i t . : / / h t t p : / D / o m w i n t o p a r d c e . d s f i r o l m v e h r c p h a d i i r r e . c c t . o m m / j e o d u c n o / c a n r a t r i t i c c l e e - p - d p d 2 f 9 / 1 2 2 9 / 2 1 1 2 0 / 3 2 1 1 9 0 5 3 3 / 2 1 6 6 7 8 o 6 c 7 n 7 _ 7 a / _ j 0 o 1 c 1 n 7 0 _ a p _ d 0 1 b 1 y 7 g 0 u . e p s t d o f n b 0 y 7 S M e I p T e m L i b b e r r a 2 r 0 2 i 3 e s / j f . / t u s e r o n 1 7 M a y 2 0 2 1 Neuroimaging Results A separate group of participants (n = 20) completed our task while their brains were scanned with fMRI. We first asked whether neural activity during imagination trials could predict whether imagined or optimal paths would be taken on the subsequent decision trial. The contrast between subsequently imagined versus subsequently optimal paths revealed a striking dissociation between several brain regions (Figure 5A). Medial pFC, OFC, and lateral temporal cortex showed greater activity during imagination trials that lead to choosing the imagined path on the next decision trial, compared with trials that lead to choosing the optimal path. The reverse contrast showed greater activity in regions of the parietal cortex as well as precuneus, fusiform gyrus, and calcarine sulcus. Motivated by data indicating involvement of the hippo- campus in imaginative simulation (Buckner, 2010), we tested the a priori hypothesis that the hippocampus would show greater activity for the imagined versus opti- mal contrast. The hippocampus showed weak bilateral activation for imagined > optimal (Figure 5B), 虽然
this effect did not survive small-volume correction within
an anatomically defined ROI.

Reward prediction errors derived from temporal differ-
ence models reliably correlate with BOLD signal in the
ventral striatum (Daw et al., 2011; Gläscher et al., 2010;
Seymour et al., 2004). This is the case in our study as well
(Figure 6A). 至关重要的是, Bayesian model comparison applied
to the ventral striatum strongly favored the imagination +
forgetting model (exceedance probability of .99; Figure 6B).
因此, the neural and behavioral model comparisons pro-
vide converging evidence for a model in which imagina-
tion both decays and distorts cached values.

数字 4. Computational modeling. (A) A reinforcement learning
model that includes both an imagination bonus and a forgetting
parameter can reproduce the pattern of choice behavior in Experiment 1
(compare with Figure 2A). (乙) Bayesian model comparison favors
the imagination + forgetting model over models with forgetting only,
imagination only, or a standard model (neither imagination nor
forgetting). The x axis represents the protected exceedance probability
(Rigoux et al., 2014)—the probability that a particular model is more
frequent in the population compared with all other models under
consideration. (C) Models fit to decision trial data correlate with reward
predictions on imagination trials in Experiment 1. The imagination +
forgetting model has a significantly higher correlation compared with
the next best model (forgetting only).

the results of the Bayesian model comparison, 支持-
ing the imagination + forgetting model as the best quan-
titative account of our behavioral data among the
alternatives we considered.

讨论

Whereas learning from experience has figured prom-
inently in computational theories of reinforcement

D
哦
w
n
我
哦
A
d
e
d

F
r
哦
米

我

/
j

F
/

t
t

我
t
.

:
/
/

/
j

你
s
e
r

哦
n

1
7

中号
A
y

2
0
2
1

数字 5. Brain regions showing greater BOLD activity during imagination trials before choosing the imagined path, compared with trials before
choosing the optimal path. (A) Medial pFC, OFC, and lateral temporal cortex were activated more strongly for imagined > optimal, 然而
inferior parietal, 枕骨, precuneus, and fusiform regions were activated more strongly for optimal > imagined. Results are thresholded at p < .05, cluster FWE. (B) Bilateral activation in anatomical hippocampus ROI for optimal > imagined, p < .001, uncorrected. Note that this activation did not survive small-volume correction. Gershman, Zhou, and Kommers 2109 Figure 6. Neural model comparison in the ventral striatum. (A) Temporal difference prediction errors correlated with BOLD activity in the ventral striatum, p < .001, uncorrected. The prediction error regressor was derived from the imagination + forgetting model. (B) Bayesian model comparison within an anatomically defined ventral striatum ROI favored the imagination + forgetting model. learning, learning from imagination remains poorly under- stood. Our experiments provide novel insights into the contribution of imagination, demonstrating that people will shift their policies toward imagined paths, even when these are objectively suboptimal. A key factor in this “imagination effect” is the miscalibration of reward predictions: People are consistently optimistic about how much reward they expect to receive in imagined states and are more likely to take imagined paths when they are more optimistic. This optimism can be captured in reinforcement learning models that learn from both experience and imagination (Gershman et al., 2014; Sutton, 1990). Our fMRI data provide converging evidence for such models, showing that classical value-coding regions, such as ventromedial cortex and OFC, are more active dur- ing imagination trials that lead to subsequently choosing the imagined path. Two main conclusions can be drawn from our findings. First, they argue against a plausible alternative hypoth esis that imagination is cognitively encapsulated from learning—a kind of “transcendent” use of the imagina- tion (cf. Kind & Kung, 2016). This hypothesis would pre- dict that the imagination trials should have no influence on subsequent decision-making, contrary to our findings. Instead, they support the “instructive” use of imagina- tion, whereby an agent can learn new things about the world purely through acts of imagination. Philosophers have long debated the epistemic status of such acts, in particular, whether imagination can produce genuinely new knowledge (Sorensen, 1992), but regardless of the answer to this question, our findings demonstrate empir- ically that imagination can guide reinforcement learning. The second conclusion is that imaginative simulation is susceptible to optimism bias (Sharot, 2011). This sug- gests that, although learning from the imagination is a powerful tool for going beyond limited experience, it is susceptible to, and may even amplify, certain cognitive biases. One limitation of our study is that we cannot entirely rule out a demand effect where the participant assumes that the experimenter is implicitly recommending a destination in the imagination trials. However, this possi- bility does not explain why participants are sometimes negatively miscalibrated (i.e., pessimistic) and why this miscalibration predicts the imagination effect. Moreover, it does not explain why participants sometimes chose the nonimagined/nonoptimal path. Nonetheless, these observations do not exclude the possibility that demand effects are exerting an influence on behavior in our task; further control experiments will be necessary to deci- sively rule out demand effects. Acquiring Knowledge through Imagination Our findings dovetail with several other lines of research on the role of imagination in learning. Motor skills can improve after a rest period without additional training (Korman, Raz, Flash, & Karni, 2003; Walker, Brakefield, Morgan, Hobson, & Stickgold, 2002), and reactivating memories during sleep can enhance subsequent task performance (Oudiette & Paller, 2013). Explicit mental practice tasks have yielded similar results (Tartaglia, Bamert, Mast, & Herzog, 2009; Wohldmann, Healy, & Bourne, 2007; Driskell, Copper, & Moran, 1994). Mast and Kosslyn (2002) provide a striking example of learning from imagination in the domain of visual percep- tion. They presented participants with an ambiguous im- age whose alternative interpretation was only revealed after rotating it. Critically, participants could discover this alternative interpretation by mentally rotating the image, indicating that imagery is sufficient for discovering new information about the world. Similar processes may underlie ubiquitous (yet still mysterious) animal learning phenomena such as sponta- neous recovery and latent inhibition (Ludvig, Mirian, Kehoe, & Sutton, 2017). Another animal learning phe- nomenon that may lend itself to this analysis is “paradox- ical enhancement of fear” (Rohrbaugh & Riccio, 1970): Animals conditioned to associate a tone and a shock will increase their fear after being presented with a single iso- lated tone, despite the fact that this presentation is oper- ationally an extinction trial and would be expected to 2110 Journal of Cognitive Neuroscience Volume 29, Number 12 D o w n l o a d e d f r o m l l / / / / j t t f / i t . : / / h t t p : / D / o m w i n t o p a r d c e . d s f i r o l m v e h r c p h a d i i r r e . c c t . o m m / j e o d u c n o / c a n r a t r i t i c c l e e - p - d p d 2 f 9 / 1 2 2 9 / 2 1 1 2 0 / 3 2 1 1 9 0 5 3 3 / 2 1 6 6 7 8 o 6 c 7 n 7 _ 7 a / _ j 0 o 1 c 1 n 7 0 _ a p _ d 0 1 b 1 y 7 g 0 u . e p s t d o f n b 0 y 7 S M e I p T e m L i b b e r r a 2 r 0 2 i 3 e s / j t . / f u s e r o n 1 7 M a y 2 0 2 1 decrease fear. This finding might be accommodated by positing that the animal is learning from the reinforcing effects of an imagined shock. Interactions between Model-based and Model-free Reinforcement Learning The current standard theory of reinforcement learning in the brain depicts two systems (one model-based and one model-free) locked in competition for control of behav- ior (Kool, Cushman, & Gershman, 2016; Dolan & Dayan, 2013; Daw et al., 2011; Daw, Niv, & Dayan, 2005). Con- siderable evidence supports this theory, including the fact that the systems can be independently manipulated both neurally (Smittenaar, FitzGerald, Romei, Wright, & Dolan, 2013; Wunderlich, Smittenaar, & Dolan, 2012; Balleine & Dickinson, 1998) and behaviorally (Otto, Gershman, Markman, & Daw, 2013). Despite its success, the competitive theory is incom- plete; other lines of research indicate that several forms of cooperation between the systems also occur (see Kool, Cushman, & Gershman, in press, for a review). The model-free system may select goals for the model- based system to pursue (Cushman & Morris, 2015) or provide value estimates for approximate model-based planning (Keramati, Smittenaar, Dolan, & Dayan, 2016). Imaginative reinforcement learning is based on the idea that influence can flow in the opposite direction, with the model-based system supplying simulations for training the model-free system (Gershman et al., 2014; Pezzulo, Rigoli, & Chersi, 2013; Sutton, 1990). Neural Substrates of Imaginative Reinforcement Learning Several previous studies have examined the neural corre- lates of imagination during reward-based tasks. Bray, Shimojo, and O’Doherty (2010) asked participants to either experience or imagine rewards in the scanner, finding that medial OFC was active for both experienced and imagined rewards. This same region was sensitive to hypothetical rewards in a Pavlovian conditioning task, along with the midbrain, which parametrically tracked expectations about the amount of hypothetical reward (Miyapuram, Tobler, Gregorios-Pippas, & Schultz, 2012). Finally, Bulganin and Wittmann (2015) found that imagination of rewarding personal events activated the striatum, midbrain, and hippocampus as well as increased functional connectivity between these regions. Johnson and Redish (2005) have suggested that place cells in the hippocampus may act as the neural substrate for a simulation engine. The key evidence for this hy- pothesis comes from studies showing that place cells replay sequences visited locations during rest and sleep (see Carr, Jadhav, & Frank, 2011, for a review). Many human brain imaging studies have also implicated the hippocampus in imaginative simulation (Buckner, 2010). Consistent with these prior results, we found weak evidence that hippocampal activity predicted whether imagined paths would be subsequently taken, with the caveat that this effect did not survive correction for multiple comparisons. In addition to the hippocampus, our analyses revealed a collection of regions involved in imaginative effects on decision-making. Broadly speaking, relatively anterior regions (medial pFC, OFC, and lateral temporal cortex) predicted the choice of the imagined path, whereas rel- atively posterior regions (parietal and occipital cortex, precuneus, fusiform gyrus, and calcarine sulcus) pre- dicted the choice of the optimal path. A perhaps overly simplistic functional division would be into anterior regions dedicated to evaluating the motivational conse- quences of decisions and posterior regions dedicated to simulating the perceptual consequences of decisions. Some of these same regions have been implicated in several different forms of prospection (Spreng, Mar, & Kim, 2009). Prior studies have found that inferior parietal cortex and precuneus predict correct rejection of imagined informa- tion during memory retrieval (Kensinger & Schacter, 2006; Gonsalves et al., 2004). In some cases, false memo- ries are associated with activity in ventromedial pFC (Kensinger & Schacter, 2006), consistent with our neuro- imaging results. However, no prior studies have directly ex- amined the neural processes involved in imagination during reinforcement learning. Bug or Feature? Is imagination useful or hurtful? Clearly, the ability to imagine certain scenarios without actually experiencing them can be useful, perhaps even indispensable in the real world. Most of us do not need to experience killing someone to know that it has undesirable consequences. Moreover, simulating such scenarios can exert a powerful effect on psychophysiological measures of aversion (Cushman, Gray, Gaffey, & Mendes, 2012), suggesting that acts of imagination approach the potency of real experience. On the other hand, we have demonstrated that imag- ination falls prey to the well-known optimism bias (Sharot, 2011), and this in turn influences subsequent decisions. Our findings are also closely related to another bias: imagination inflation, the observation that simply imagining an event can increase one’s judgment of its likelihood. For example, participants asked to imagine either Gerald Ford or Jimmy Carter winning the 1976 presidential race subsequently rated the imagined event as more likely (Carroll, 1978). In essence, our main find- ing is a reinforcement learning version of imagination inflation, whereby imagining an event increases one’s judgment of its value. Thus, overzealous use of the imagination could easily go awry. As philosophers have recognized (Kind & Kung, Gershman, Zhou, and Kommers 2111 D o w n l o a d e d f r o m l l / / / / j f / t t i t . : / / h t t p : / D / o m w i n t o p a r d c e . d s f i r o l m v e h r c p h a d i i r r e . c c t . o m m / j e o d u c n o / c a n r a t r i t i c c l e e - p - d p d 2 f 9 / 1 2 2 9 / 2 1 1 2 0 / 3 2 1 1 9 0 5 3 3 / 2 1 6 6 7 8 o 6 c 7 n 7 _ 7 a / _ j 0 o 1 c 1 n 7 0 _ a p _ d 0 1 b 1 y 7 g 0 u . e p s t d o f n b 0 y 7 S M e I p T e m L i b b e r r a 2 r 0 2 i 3 e s / j f / t . u s e r o n 1 7 M a y 2 0 2 1 2016; Sorensen, 1992), the instructive use of the imagina- tion is critically dependent on its obedience to con- straints imposed by the real world. If imagination can be untethered from these constraints, then we may find ourselves mistakenly using it to transcend reality rather than to learn about it. Acknowledgments This project was made possible through grant support from the National Institutes of Health (CRCNS R01-1207833). This work involved the use of instrumentation supported by the NIH Shared Instrumentation Grant Program, grant number S10OD020039. We acknowledge the University of Minnesota Center for Magnetic Resonance Research for use of the multi- band-EPI pulse sequences. We are grateful to Bradley Doll for sharing his stimuli, to Florian Froehlich for helping to collect data, and to Adam Morris for comments on a previous draft of the article. Reprint requests should be sent to Samuel J. Gershman, Department of Psychology, Harvard University, Room 295.05, 52 Oxford St., Cambridge, MA 02138, or via e-mail: gershman@ fas.harvard.edu. REFERENCES Balleine, B. W., & Dickinson, A. (1998). Goal-directed instrumental action: Contingency and incentive learning and their cortical substrates. Neuropharmacology, 37, 407–419. Bray, S., Shimojo, S., & O’Doherty, J. P. (2010). Human medial orbitofrontal cortex is recruited during experience of imagined and real rewards. Journal of Neurophysiology, 103, 2506–2512. Buckner, R. L. (2010). The role of the hippocampus in prediction and imagination. Annual Review of Psychology, 61, 27–48. Bulganin, L., & Wittmann, B. C. (2015). Reward and novelty enhance imagination of future events in a motivational- episodic network. PLoS One, 10, e0143477. Carr, M. F., Jadhav, S. P., & Frank, L. M. (2011). Hippocampal replay in the awake state: A potential physiological substrate of memory consolidation and retrieval. Nature Neuroscience, 14, 147–153. Carroll, J. S. (1978). The effect of imagining an event on expectations for the event: An interpretation in terms of the availability heuristic. Journal of Experimental Social Psychology, 14, 88–96. Cushman, F., & Morris, A. (2015). Habitual control of goal selection in humans. Proceedings of the National Academy of Sciences, U.S.A., 112, 13817–13822. Cushman, F. A., Gray, K., Gaffey, A., & Mendes, W. (2012). Simulating murder: The aversion to harmful action. Emotion, 12, 2–7. Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P., & Dolan, R. J. (2011). Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69, 1204–1215. Daw, N. D., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8, 1704–1711. Dolan, R. J., & Dayan, P. (2013). Goals and habits in the brain. Neuron, 80, 312–325. Driskell, J. E., Copper, C., & Moran, A. (1994). Does mental practice enhance performance? Journal of Applied Psychology, 79, 481–492. Feinberg, D. A., Moeller, S., Smith, S. M., Auerbach, E., Ramanna, S., Gunther, M., et al. (2010). Multiplexed echo planar imaging for subsecond whole brain FMRI and fast diffusion imaging. PLoS One, 5, e15710. Gershman, S. J., Markman, A. B., & Otto, A. R. (2014). Retrospective revaluation in sequential decision making: A tale of two systems. Journal of Experimental Psychology: General, 143, 182–194. Gläscher, J., Daw, N., Dayan, P., & O’Doherty, J. P. (2010). States versus rewards: Dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron, 66, 585–595. Gonsalves, B. D., Reber, P. J., Gitelman, D. R., Parrish, T. B., Mesulam, M. M., & Paller, K. A. (2004). Neural evidence that vivid imagining can lead to false remembering. Psychological Science, 15, 655–660. Johnson, A., & Redish, A. D. (2005). Hippocampal replay contributes to within session learning in a temporal difference reinforcement learning model. Neural Networks, 18, 1163–1171. Kensinger, E. A., & Schacter, D. L. (2006). Neural processes underlying memory attribution on a reality-monitoring task. Cerebral Cortex, 16, 1126–1133. Keramati, M., Smittenaar, P., Dolan, R. J., & Dayan, P. (2016). Adaptive integration of habits into depth-limited planning defines a habitual-goal-directed spectrum. Proceedings of the National Academy of Sciences, U.S.A., 113, 12868–12873. Kind, A., & Kung, P. (2016). Knowledge through imagination. New York: Oxford University Press. Kool, W., Cushman, F. A., & Gershman, S. J. (2016). When does model-based control pay off? PLoS Computational Biology, 12, e1005090. Kool, W., Cushman, F. A., & Gershman, S. J. (in press). Competition and cooperation between multiple reinforcement learning systems. In R. W. Morris, A. Bornstein, & A. Shenhav (Eds.), Goal-directed decision making: Computations and neural circuits. New York: Elsevier. Korman, M., Raz, N., Flash, T., & Karni, A. (2003). Multiple shifts in the representation of a motor sequence during the acquisition of skilled performance. Proceedings of the National Academy of Sciences, U.S.A., 100, 12492–12497. Ludvig, E. A., Mirian, M. S., Kehoe, E. J., & Sutton, R. S. (2017). Associative learning from replayed experience. http://www. biorxiv.org/content/early/2017/01/16/100800. Mast, F. W., & Kosslyn, S. M. (2002). Visual mental images can be ambiguous: Insights from individual differences in spatial transformation abilities. Cognition, 86, 57–70. Miyapuram, K. P., Tobler, P. N., Gregorios-Pippas, L., & Schultz, W. (2012). BOLD responses in reward regions to hypothetical and imaginary rewards. Neuroimage, 59, 1692–1699. Moeller, S., Yacoub, E., Olman, C. A., Auerbach, E., Strupp, J., Harel, N., et al. (2010). Multiband multislice GE-EPI at 7 tesla, with 16-fold acceleration using partial parallel imaging with application to high spatial and temporal whole-brain fMRI. Magnetic Resonance Medicince, 63, 1144–1153. Niyogi, P., Girosi, F., & Poggio, T. (1998). Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE, 86, 2196–2209. Otto, A. R., Gershman, S. J., Markman, A. B., & Daw, N. D. (2013). The curse of planning: Dissecting multiple reinforcement-learning systems by taxing the central executive. Psychological Science, 24, 751–761. Oudiette, D., & Paller, K. A. (2013). Upgrading the sleeping brain with targeted memory reactivation. Trends in Cognitive Sciences, 17, 142–149. Pezzulo, G., Rigoli, F., & Chersi, F. (2013). The mixed instrumental controller: Using value of information to 2112 Journal of Cognitive Neuroscience Volume 29, Number 12 D o w n l o a d e d f r o m l l / / / / j t t f / i t . : / / h t t p : / D / o m w i n t o p a r d c e . d s f i r o l m v e h r c p h a d i i r r e . c c t . o m m / j e o d u c n o / c a n r a t r i t i c c l e e - p - d p d 2 f 9 / 1 2 2 9 / 2 1 1 2 0 / 3 2 1 1 9 0 5 3 3 / 2 1 6 6 7 8 o 6 c 7 n 7 _ 7 a / _ j 0 o 1 c 1 n 7 0 _ a p _ d 0 1 b 1 y 7 g 0 u . e p s t d o f n b 0 y 7 S M e I p T e m L i b b e r r a 2 r 0 2 i 3 e s / j . f / t u s e r o n 1 7 M a y 2 0 2 1 combine habitual choice and mental simulation. Frontiers in Psychology, 4, 92. Rigoux, L., Stephan, K. E., Friston, K. J., & Daunizeau, J. (2014). Bayesian model selection for group studies—Revisited. Neuroimage, 84C, 971–985. Rohrbaugh, M., & Riccio, D. (1970). Paradoxical enhancement of learned fear. Journal of Abnormal Psychology, 75, 210–216. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Seymour, B., O’Doherty, J. P., Dayan, P., Koltzenburg, M., Jones, A. K., Dolan, R. J., et al. (2004). Temporal difference models describe higher-order learning in humans. Nature, 429, 664–667. Sharot, T. (2011). The optimism bias. New York: Pantheon. Smittenaar, P., FitzGerald, T. H. B., Romei, V., Wright, N. D., & Dolan, R. J. (2013). Disruption of dorsolateral prefrontal cortex decreases model-based in favor of model-free control in humans. Neuron, 80, 914–919. Sorensen, R. E. (1992). Thought experiments. Oxford: Oxford University Press. Spreng, R. N., Mar, R. A., & Kim, A. S. N. (2009). The common neural basis of autobiographical memory, prospection, navigation, theory of mind, and the default mode: A quantitative meta-analysis. Journal of Cognitive Neuroscience, 21, 489–510. Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In M. Morgan (Ed.), Proceedings of the Seventh International Conference on Machine Learning (pp. 216–224). San Francisco: Morgan Kaufmann. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Tartaglia, E. M., Bamert, L., Mast, F. W., & Herzog, M. H. (2009). Human perceptual learning by mental imagery. Current Biology, 19, 2081–2085. Tzourio-Mazoyer, N., Landeau, B., Papathanassiou, D., Crivello, F., Etard, O., Delcroix, N., et al. (2002). Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. Neuroimage, 15, 273–289. Walker, M. P., Brakefield, T., Morgan, A., Hobson, J. A., & Stickgold, R. (2002). Practice with sleep makes perfect: Sleep-dependent motor skill learning. Neuron, 35, 205–211. Wohldmann, E. L., Healy, A. F., & Bourne, L. E., Jr. (2007). Pushing the limits of imagination: Mental practice for learning sequences. Journal of Experimental Psychology: Learning, Memory, and Cognition, 33, 254–261. Wunderlich, K., Smittenaar, P., & Dolan, R. (2012). Dopamine enhances model-based over model-free choice behavior. Neuron, 75, 418–424. Xu, J., Moeller, S., Auerbach, E. J., Strupp, J., Smith, S. M., Feinberg, D. A., et al. (2013). Evaluation of slice accelerations using multiband echo planar imaging at 3T. Neuroimage, 83, 991–1001. D o w n l o a d e d f r o m l l / / / / j t t f / i t . : / / h t t p : / D / o m w i n t o p a r d c e . d s f i r o l m v e h r c p h a d i i r r e . c c t . o m m / j e o d u c n o / c a n r a t r i t i c c l e e - p - d p d 2 f 9 / 1 2 2 9 / 2 1 1 2 0 / 3 2 1 1 9 0 5 3 3 / 2 1 6 6 7 8 o 6 c 7 n 7 _ 7 a / _ j 0 o 1 c 1 n 7 0 _ a p _ d 0 1 b 1 y 7 g 0 u . e p s t d o f n b 0 y 7 S M e I p T e m L i b b e r r a 2 r 0 2 i 3 e s / j f t / . u s e r o n 1 7 M a y 2 0 2 1 Gershman, Zhou, and Kommers 2113 Imaginative Reinforcement Learning: Computational image

下载pdf