Emergent Solutions to High-Dimensional
Multitask Reinforcement Learning
Stephen Kelly
Department of Computer Science, Dalhousie University, 6050 University Avenue,
Halifax, NS, B3H 4R2, Canada
skelly@cs.dal.ca
Malcolm I. Heywood
Department of Computer Science, Dalhousie University, 6050 University Avenue,
Halifax, NS, B3H 4R2, Canada
mheywood@cs.dal.ca
doi:10.1162/EVCO_a_00232
Astratto
Algorithms that learn through environmental interaction and delayed rewards, or re-
inforcement learning (RL), increasingly face the challenge of scaling to dynamic, high-
dimensional, and partially observable environments. Significant attention is being paid
to frameworks from deep learning, which scale to high-dimensional data by decompos-
ing the task through multilayered neural networks. While effective, the representation
is complex and computationally demanding. In this work, we propose a framework
based on genetic programming which adaptively complexifies policies through inter-
action with the task. We make a direct comparison with several deep reinforcement
learning frameworks in the challenging Atari video game environment as well as more
traditional reinforcement learning frameworks based on a priori engineered features.
Results indicate that the proposed approach matches the quality of deep learning while
being a minimum of three orders of magnitude simpler with respect to model com-
plexity. This results in real-time operation of the champion RL agent without recourse
to specialized hardware support. Inoltre, the approach is capable of evolving solu-
tions to multiple game titles simultaneously with no additional computational cost. In
this case, agent behaviours for an individual game as well as single agents capable of
playing all games emerge from the same evolutionary run.
Keywords
Emergent modularity, cooperative coevolution, genetic programming, rinforzo
apprendimento, multitask learning.
1
introduzione
Reinforcement learning (RL) is an area of machine learning in which an agent develops a
decision-making policy through direct interaction with a task environment. Specifically,
the agent observes the environment and suggests an action based on the observation,
repeating the process until a task end state is encountered. The end state provides a
reward signal that characterizes quality of the policy, or the degree of success/failure.
The policy’s objective is therefore to select actions that maximize this long-term reward.
In real-world applications of RL, the agent is likely to observe the environment
through a high-dimensional sensory interface (per esempio., a video camera). This potentially
implies that: (1) RL agents need to be able to assess large amounts of “low-level” in-
formation; (2) complete information about the environment is often not available from
Manuscript received: 11 Luglio 2017; revised: 22 May 2018 E 1 Giugno 2018; accepted: 6 Giugno 2018.
© 2018 Istituto di Tecnologia del Massachussetts.
Pubblicato sotto Creative Commons
Attribuzione 4.0 Unported (CC BY 4.0) licenza.
Evolutionary Computation 26(3): 347–380
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
.
/
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
S. Kelly and M. IO. Heywood
a single observation; E (3) extended interactions and sparse rewards are common,
requiring the agent to make thousands of decisions before receiving enough feedback
to assess the quality of the policy. That said, the potential applications for RL are vast
and diverse, from autonomous robotics (Kober and Peters, 2012) to video games (Szita,
2012), thus motivating research into RL frameworks that are general enough to be ap-
plied to a variety of environments without the use of application specific features.
Addressing dynamic, high-dimensional, and partially observable tasks in RL has
recently received significant attention on account of: (1) the availability of a convenient
video game emulator supporting hundreds of titles, such as the Arcade Learning Envi-
ronment (ALE) (Bellemare, Naddaf et al., 2012); E, (2) human competitive results from
deep learning (per esempio., Mnih et al., 2015). ALE defines state, (cid:2)S(T ), in terms of direct screen
capture, while actions are limited to those of the original Atari console. Così, Imparare-
ing agents interact with games via the same interface experienced by human players.
In sampling 49 game titles, each designed to be interesting and challenging for human
players, task environments with a wide range of properties are identified. As such, each
game title requires a distinct RL policy that is capable of maximizing the score over the
course of the game.
In this work, we introduce a genetic programming (GP) framework that specifi-
cally addresses challenges in scaling RL to real-world tasks while maintaining minimal
model complexity. The algorithm uses emergent modularity (Nolfi, 1997) to adaptively
complexify policies through interaction with the task environment. A team of programs
represents the basic behavioural module (Lichodzijewski and Heywood, 2008B), or a
mapping from state observation to an action. In sequential decision-making tasks, each
program within a team defines a unique bidding behaviour (Sezione 3.2), such that pro-
grams cooperatively select one action from the team relative to the current state obser-
vation at each time step.
Evolution begins with a population of simple teams, Figure 1a, which are then fur-
ther developed by adding, removing, and modifying individual programs. This work
extends previous versions of an earlier (symbiotic) approach to GP teaming (Lichodz-
ijewski and Heywood, 2011; Doucette et al., 2012; Kelly et al., 2012; Kelly and Hey-
legna, 2014B, 2014UN) to enable emergent behavioural modularity from a single cycle of
evolution by adaptively recombining multiple teams into variably deep/wide directed
graph structures, or Tangled Program Graphs (TPG)1 (Figure 1b). The behaviour of each
program, complement of programs per team, complement of teams per graph, and the
connectivity within each graph are all emergent properties of an open-ended evolution-
ary process. The benefits of this approach are twofold:
1. A single graph of teams, or policy graph, may eventually evolve to include hun-
dreds of teams, where each represents a simple, specialized behaviour (Figura
1B). Tuttavia, mapping a state observation to an action requires traversing only
one path through the graph from root (team) to leaf (action). Così, the repre-
sentation is capable of compartmentalizing many behaviours and recalling only
those relevant to the current environmental conditions. This allows TPG to scale
to complex, high-dimensional task environments while maintaining a relatively
low computational cost per decision.
2. The programs in each team will collectively index a small, unique subset of the
state space. As multiteam policy graphs emerge, only specific regions of the state
1Source code is available at https://web.cs.dal.ca/∼mheywood/Code/index.html
348
Evolutionary Computation Volume 26, Numero 3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Emergent High-Dimensional Multitask Reinforcement Learning
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Figura 1: TPG policies. Decision making in each time step (frame) begins at the root
team (black node) and follows the edge with the winning program bid (produzione) until an
atomic action (Atari Joystick Position) is reached. The initial population contains only
single-team polices (UN). Multiteam graphs emerge as evolution progresses (B).
space that are important for decision making will be indexed by the graph as a
whole. Così, emergent modularity allows the policy to simultaneously decom-
pose the task spatially and behaviourally, detecting important regions of the state
space and optimizing the decisions made in different regions. This minimizes
the requirement for a priori crafting task specific features, and lets TGP perform
both feature construction and policy discovery simultaneously.
Unlike deep learning, the proposed TPG framework takes an explicitly emergent,
developmental approach to policy identification. Our interest is whether we can con-
struct policy graph topologies “bottom-up” that match the quality of deep learning so-
lutions without the corresponding complexity. Specifically, deep learning assumes that
the neural architecture is designed a priori, with the same architecture employed for
each game title. Così, deep learning always performs millions of calculations per deci-
sion. TPG, on the other hand, has the potential to tune policy complexity to each task
Evolutionary Computation Volume 26, Numero 3
349
S. Kelly and M. IO. Heywood
ambiente, or game title, requiring only ≈ 1000 calculations per decision in the most
complex case, and ≈ 100 calculations in the simpler cases.
In short, the aim of this work is to demonstrate that much simpler solutions can
be discovered to dynamic, high-dimensional, and partially observable environments in
RL without making any prior decisions regarding model complexity. As a consequence,
the computational costs typically associated with deep learning are avoided without im-
pacting on the quality of the resulting policies, questo è, the cost of training and deploying
a solution is now much lower. Solutions operate in real time without any recourse to
multicore or GPU hardware platforms, thus potentially simplifying the developmen-
tal/deployment overhead in posing solutions to challenging RL tasks.
Relative to our earlier work, we: (1) extend the single title comparison of 20 titles
with two comparator algorithms (Kelly and Heywood, 2017UN) to include all 49 Atari
game titles and eight comparator algorithms (Sezione 5); E (2) demonstrate that mul-
titask performance can be extended from 3 to at least 5 game titles per policy and, unlike
the earlier work, does not necessitate a Pareto objective formulation (Kelly and Hey-
legna, 2017UN), just elitism (Sezione 7).
2 Background
2.1
Task Environment
The Arcade Learning Environment (ALE) (Bellemare, Naddaf et al., 2012) is an Atari
2600 video game emulator designed specifically to benchmark RL algorithms. The ALE
allows RL agents to interact with hundreds of classic video games using the same in-
terface as experienced by human players. Questo è, an RL agent is limited to interacting
with the game using state, (cid:2)S(T ), as defined by the game screen, E 18 discrete (atomic)
actions, questo è, the set of Atari console paddle directions including “no action,” in com-
bination with/without the fire button. Each game screen is defined by a 210 × 160 pixel
matrix with 128 potential colours per pixel, refreshed at a frame rate of 60 Hz. In prac-
tice, the raw screen frames are preprocessed prior to being presented to an RL agent
(see Section 2.2 for a summary of approaches assumed to date, and Section 4.1 for the
specific approach assumed in this work).
È interessante notare, important game entities often appear intermittently over sequential
frames, creating visible screen flicker. This is a common technique game designers used
to work around memory limitations in the original Atari hardware. Tuttavia, it presents
a challenge for RL because it implies that Atari game environments are partially observ-
able. That is to say, a single frame rarely depicts the complete game state.
Inoltre, agents stochastically skip screen frames with probability p = 0.25, con
the previous action being repeated on skipped frames (Bellemare, Naddaf et al., 2012;
Hausknecht and Stone, 2015). This is a default setting in ALE, and aims to limit artificial
agents to roughly the same reaction time as a human player as well as introducing an
additional source of stochasticity. A single episode of play lasts a maximum of 18,000
frames, not including skipped frames.
2.2 RL under ALE Tasks
Historically, approaches to RL have relied on a priori designed task specific state rep-
resentations (attributes). This changed with the introduction of the Deep Q-Network
(DQN) (Mnih et al., 2015). DQN employs a deep convolutional neural network archi-
tecture to encode a representation directly from screen capture (thus a task-specific rep-
resentation). A multilayer perceptron is simultaneously trained from this representation
350
Evolutionary Computation Volume 26, Numero 3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Emergent High-Dimensional Multitask Reinforcement Learning
to estimate a value function (the action selector) through Q-learning. Image preprocess-
ing was still necessary and took the form of down sampling the original 210 × 160 RGB
frame data to 84 × 84 and extracting the luminance channel. Inoltre, a temporal slid-
ing window was assumed in which the input to the first convolution layer was actually
a sequence of the four most recently appearing frames. This reduced the partial observ-
ability of the task, as all the game state should now be visible.
In assuming Q-learning, DQN is an off-policy method, for which one of the most
critical elements is support for replay memory. As such, performance might be sen-
sitive to the specific content of this memory (the “memories” replayed are randomly
sampled). The General Reinforcement Learning Architecture (Gorila) extended the ap-
proach of DQN with a massively parallel distributed infrastructure (100s of GPUs) A
support the simultaneous development of multiple DQN learners (Nair et al., 2015).
The contributions from the distributed learners periodically update a central “parame-
ter server” that ultimately represents the solution. Gorila performed better than DQN
on most game titles, but not in all cases, indicating that there are possibly still sensitiv-
ities to replay memory content.
Q-learning is also known to potentially result in action values that are excessively
high. Such “overestimations” were recently shown to be associated with inaccuracies in
the action values, where this is likely to be the norm during the initial stages of training
(van Hasselt et al., 2016). The solution proposed by van Hasselt et al. (2016) for address-
ing this issue was to introduce two sets of weights, one for action selection and one for
policy evaluation. This was engineered into the DQN architecture by associating the
two roles with DQN’s online network and target network respectively.2 The resulting
Double DQN framework improved on the original DQN results for more than half of
IL 49 game titles from the ALE task.
Most recently, on-policy methods (per esempio., Sarsa) have appeared in which multiple in-
dependent policies are trained in parallel (Mnih et al., 2016). Each agent’s experience of
the environment is entirely independent (no attempt is made to enforce the centraliza-
tion of memory/experience). This means that the set of RL agents collectively experi-
ence a wider range of states. The resulting evaluation under the Atari task demonstrated
significant reductions to computational requirements3 and better agent strategies. Quello
said, in all cases, the deep learning architecture is specified a priori and subject to prior
parameter tuning on a subset of game titles.
Neuro-evolution represents one of the most widely investigated techniques within
the context of agent discovery for games. Hausknecht et al. (2014) performed a com-
parison of different neuro-evolutionary frameworks under two state representations:
game title specific objects versus screen capture. Preprocessing for screen capture took
the form of down sampling the original 210 × 160 RGB frame data to produce eight
“substrates” of dimension 16 × 21 = 336; each substrate corresponding to one of the
eight colours present in a SECAM representation (provided by the ALE). If the colour
is present in the original frame data, it appears at a corresponding substrate node.
Hausknecht et al. (2014) compared Hyper-NEAT, NEAT, and two simpler schemes for
evolving neural networks under the suite of Atari game titles. Hyper-NEAT provides
a developmental approach for describing large neural network architectures efficiently,
while NEAT provides a scheme for discovering arbitrary neural topologies as well as
2The “online” network in DQN maintains the master copy of the MLP, whereas the target network
is updated during “experience replay” (Mnih et al., 2015).
3A 16-core CPU as opposed to a GPU.
Evolutionary Computation Volume 26, Numero 3
351
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
S. Kelly and M. IO. Heywood
weight values, beginning with a single fully connected neuron. NEAT was more ef-
fective under the low-dimensional object representation, whereas Hyper-NEAT was
preferable for the substrate representation.
Finalmente, Liang et al. (2016) revisit the design of task specific state information using
a hypothesis regarding the action of the convolutional neuron in deep learning. This re-
sulted in a state space in the order of 110 million attributes when applied to Atari screen
capture, but simplified decision making to a linear model. Così, an RL agent could be
identified using the on-policy Temporal Difference method of Sarsa. In comparison to
deep learning, the computational requirements for training and deployment are con-
siderably lower, but the models produced are only as good as the ability to engineer
appropriate attributes.
2.3 Multitask RL under ALE
The approaches reviewed in Section 2.2 assumed that a single RL policy was trained
on each game title. Conversely, multitask RL (MTRL) attempts to take this further and
develop a single RL agent that is able to play multiple game titles. As such, MTRL is a
step towards “artificial general intelligence,” and represents a much more difficult task
for at least two reasons: (1) RL agents must not “forget” any of their policy for playing a
previous game while learning a policy for playing a new game, E (2) during test, an
RL agent must be able to distinguish between game titles without recourse to additional
state information.
To date, two deep learning approaches have been proposed for this task. Parisotto
et al. (2015) first learn each game title independently and then use this to train a single
architecture for playing multiple titles. More recently Kirkpatrick et al. (2016) proposed
a modification to Double DQN in which subsets of weights (particularly in the MLP)
are associated with different tasks and subject to lower learning rates than weights not
already associated with previously learned tasks. They were able to learn to play up
A 6 game titles at a level comparable with the original DQN (trained on each title in-
dependently), albeit when the game titles are selected from the set of games for which
DQN was known to perform well on.
3 Tangled Program Graphs
Modular task decomposition through teaming has been a recurring theme with ge-
netic programming. Previous studies examined methods for combining the contribu-
tion from individual team members (Brameier and Banzhaf, 2001), enforcing island
models (Imamura et al., 2003), or interchanging team-wise versus individual-wise selec-
zione (Thomason and Soule, 2007). A common limitation of such schemes was a require-
ment to pre-specify the number of programs appearing within a team. Inoltre, even
when team complement is evolved in an open ended way, it has previously been neces-
sary to define fitness at both the level of the individual program and team (per esempio., Wu and
Banzhaf, 2011). Such limitations need to be addressed in order to facilitate completely
open ended approaches to evolution.
3.1
Evolving Teams of Programs
Enabling the evolution of the number and complement of programs per team in an
open manner was previously addressed in part through the use of a bidding metaphor
(Lichodzijewski and Heywood, 2008UN), in which case programs represent action, UN, E
context, P, independently. Questo è, each program defines the context for a single discrete
action, or a ∈ {UN} where A denotes the set of task specific atomic actions. Actions are
352
Evolutionary Computation Volume 26, Numero 3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Emergent High-Dimensional Multitask Reinforcement Learning
assigned to the program at initialization and potentially modified by variation oper-
ators during evolution. A linear program representation is assumed4 in which a reg-
ister level transfer language supports the 4 arithmetic operators, cosine, logarithmic,
the exponential operation, and a conditional statement (see Algorithm 1). The linear
representation facilitates skipping “intron” code, where this can potentially represent
60–70% of program instructions (Brameier and Banzhaf, 2007). Naturally, determining
which of the available state variables are actually used in the program, as well as the
number of instructions and their operations, are both emergent properties of the evolu-
tionary process. After execution, register R[0] represents the “bid” or “confidence” for
the program’s action, UN, relative to the currently observed state, (cid:2)S(T ). A team maps each
state observation, (cid:2)S(T ), to a single action by executing all team members (programs) rel-
ative to (cid:2)S(T ), and then choosing the action of the highest bidder. If programs were not
organized into teams, in which case all programs within the same population would
compete for the right to suggest their action, it is very likely that degenerate individu-
COME (programs that bid high for every state), would disrupt otherwise effective bidding
strategies.
Adaptively building teams of programs is addressed here through the use of a
symbiotic relation between a team population and a program population; hereafter
“TeamGP” (Lichodzijewski and Heywood, 2008B). Each individual of the team popula-
tion represents an index to some subset of the program population (see Figure 2a). Team
individuals therefore assume a variable length representation in which each individual
is stochastically initialized with [2, . . . , ω] pointers to programs from the program pop-
ulation. The only constraint is that there must be at least two different actions indexed
by the complement of programs within the same team. The same program may appear
in multiple teams, but must appear in at least one team to survive between consecutive
generations.
Performance (cioè., fitness) is expressed only at the level of teams, and takes the form
of the task dependent objective(S). After evaluating the performance of all teams, IL
worst Rgap teams are deleted from the team population. After team deletion, any pro-
gram that fails to be indexed by any team must have been associated with the worst
performing teams, hence is also deleted. This avoids the need to make arbitrary deci-
sions regarding the definition of fitness at the team versus program level (which gener-
ally take the form of task specific heuristics, thus limiting the applicability of the model
4Any GP representation could be employed; the important innovation is that context and action are
represented independently.
Evolutionary Computation Volume 26, Numero 3
353
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
.
/
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
S. Kelly and M. IO. Heywood
Figura 2: Subplot (UN) illustration of the symbiotic relation between Team and Program
populations. Task fitness is only expressed at the level of a team. Each team defines
a unique set of pointers to some subset of individuals from the program population.
Multiple programs may have the same action, as the associated context for the action is
defined by the program. Legal teams must sample at least two different actions. Subplot
(B) atomic action mutated into an index to a team. There is now one less root team in
the Team population.
to specific application domains). Following the deletion of the worst teams, new teams
are introduced by sampling, cloning, and modifying Rgap surviving teams. Naturally, if
there is a performance benefit in smaller/larger teams and/or different program com-
plements, this will be reflected in the surviving team–program complements (Lichodzi-
jewski and Heywood, 2010), questo è, team–program complexity is a developmental trait.
3.2
Evolving Graphs of Teams
Evolution begins with a program population in which program actions are limited to the
task specific (atomic) actions (Figures 1a and 2a). In order to provide for the evolution
of hierarchically organized code under a completely open ended process of evolution
(cioè., emergent behavioural modularity), program variation operators are allowed to
introduce actions that index other teams within the team population. To do so, when a
program’s action is modified, it has a probability (patomic) of referencing either a different
atomic action or another team. Così, variation operators have the ability to incrementally
construct multiteam policy graphs (Figures 1b and 2b).
Each vertex in the graph is a team, while each team member, or program, repre-
sents one outgoing edge leading either to another team or an atomic action. Decision-
making in a policy graph begins at the root team, where each program in the team will
produce one bid relative to the current state, (cid:2)S(T ). Graph traversal then follows the pro-
gram/edge with the largest bid, repeating the bidding process for the same (cid:2)S(T ) at every
team/vertex along the path until an atomic action is encountered. Così, given some
state of the environment at time step t, the policy graph computes one path from root to
atomic action, where only a subset of programs in the graph (cioè., those in teams along
the path) require execution. Algorithm 2 details the process for evaluating the TPG indi-
vidual, which is repeated at every frame, (cid:2)S(T ), until an end-of-game state is encountered
and fitness for the policy graph can be determined.
As multiteam policy graphs emerge, an increasingly tangled web of connectivity
develops between the team and program populations. The number of unique solutions,
or policy graphs, at any given generation is equal to the number of root nodes (cioè., teams
354
Evolutionary Computation Volume 26, Numero 3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Emergent High-Dimensional Multitask Reinforcement Learning
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
that are not referenced as any program’s action) in the team population. Only these root
teams are candidates to have their fitness evaluated, and are subject to modification by
the variation operators.
In each generation, Rgap of the root teams are deleted and replaced by offspring of
the surviving roots. The process for generating team offspring uniformly samples and
clones a root team, then applies mutation-based variation operators to the cloned team
which remove, add, and mutate some of its programs.
The team generation process introduces new root nodes until the number of roots
in the population reaches Rsize. The total number of sampling steps for generating off-
spring fluctuates, as root teams (along with the lower policy graph) are sometimes “sub-
sumed” by a new team. Conversely, graphs can be separated, Per esempio, through
program action mutation, resulting in new root nodes/policies. This implies that after
initialization, both team and program population size varies. Inoltre, while the
number of root teams remains fixed, the number of teams that become “archived” as
internal nodes (cioè., a library of reusable code) fluctuates.
Limiting evaluation, selection, and variation to root teams only has two critical ben-
efits: (1) the cost of evaluation and the size of the search space remains low because only
a fraction of the team population (root teams) represent unique policies to be evaluated
and modified in each generation and (2) since only root teams are deleted, introduced,
or modified, policy graphs are incrementally developed from the bottom up. As such,
lower-level complex structures within a policy graph are protected as long as they con-
tribute to an overall strong policy.
In summary, the teaming GP framework of Lichodzijewski and Heywood (2010) È
extended to allow policy graphs to emerge, defining the inter-relation between teams.
As programs composing a team typically index different subsets of the state space (cioè.,
Evolutionary Computation Volume 26, Numero 3
355
S. Kelly and M. IO. Heywood
the screen in the case of ALE), the resulting policy graph will incrementally adapt, In-
dexing more or less of the state space and defining the types of decisions made in differ-
ent regions. Finalmente, Kelly et al. (2018) provide an additional pictorial summary of the
TPG algorithm.
3.2.1 Neutrality Test
When variation operators introduce changes to a program, there is no guarantee that
the change will: (1) result in a behavioural change, E (2) even if a behavioural change
risultati, it will be unique relative to the current set of programs. Point 1 is still useful as
it results in the potential for multiple code changes to be incrementally built up before
they appear, or neutral networks (Brameier and Banzhaf, 2007). Tuttavia, this can also
result in wasted evaluation cycles because there is no functional difference relative to
the parent. Given that fitness evaluation is expensive, we therefore test for behavioural
uniqueness. Specifically, 50 of the most recent state observations are retained in a global
archive, O (cid:2)S(T ) ∈ {tlast − 49, . . . , tlast }. When a program is modified or a new program is
created, its bid for each state in the archive is compared against the bid of every program
in the current population. As long as all 50 bid values from the new program are not
within τ of all bids from any other program in the current population, the new program
is accepted. If the new program fails the test, then another instruction is mutated and the
test repeated. We note that such a process has similarities with the motivation of novelty
search (Lehman and Stanley, 2011), questo è, a test for different outcomes. Tuttavia, as this
process appears at a program, there is no guarantee that this will result in any novel
behaviour when it appears in a team and it is still fitness at the level of team/agents
that determines survival.
4
Experimental Methodology
For comparative purposes, evaluation of TPG will assume the same general approach
as established in the original DQN evaluation (Mnih et al., 2015). Così, we assume the
same subset of 49 Atari game titles and, post training, test the champion TPG agent
under 30 test episodes initialized with a stochastically selected number of initial no-op
actions (described in Section 5.1). This will provide us with the widest range of previous
results for comparative purposes.5 Five independent TPG runs are performed per game
titolo, where this appears to reflect most recent practice for deep learning results.6
The same parameterization for TPG was used for all games (Sezione 4.2). The only
information provided to the agents was the number of atomic actions available for each
game, the preprocessed screen frame during play (Sezione 4.1), and the final game score.
Each policy graph was evaluated in 5 game episodes per generation, up to a maximum
Di 10 game episodes per lifetime. Fitness for each policy graph is simply the average
game score over all episodes. A single champion policy for each game was identified as
that with the highest training reward at the end of evolution.
4.1
State Space Screen Capture
Based on the observation that the visual input has a lot of redundant information (cioè.,
visual game content is designed for maximizing entertainment value, as opposed to a
5An alternative test scenario has also appeared in which the RL agent takes over from game state
identified by a human player in an attempt to introduce further diversity into RL agent start state
selection (Nair et al., 2015; Mnih et al., 2016).
6The original DQN results only reflected a single run per title (Mnih et al., 2015).
356
Evolutionary Computation Volume 26, Numero 3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Emergent High-Dimensional Multitask Reinforcement Learning
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
.
/
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Figura 3: Screen quantization steps, reducing the raw Atari pixel matrix (UN) A 1344 dec-
imal state variables (C) using a checkered subsampling scheme (B).
need to convey content with a minimal amount of information), we adopt a quantization
approach to preprocessing. The following two-step procedure is applied to every game
frame:
1. A checkered pattern mask is used to sample 50% of the pixels from the raw game
screen (see Figure 3b). Each remaining pixel assumes the 8-colour SECAM en-
coding. The SECAM encoding is provided by ALE as an alternative to the de-
fault NTSC 128-colour format. Uniformly skipping 50% of the raw screen pixels
improves the speed of feature retrieval while having minimal effect on the fi-
nal representation, since important game entities are usually larger than a single
pixel.
2. The frame is subdivided into a 42 × 32 grid.7 Each grid tile is described by a sin-
gle byte, in which each bit encodes the presence of one of eight SECAM colours
within that tile. The final quantized screen representation includes each tile byte
as a decimal value, so defining a state state space (cid:2)S(T ) Di 42 × 32 = 1,344 decimal
features in the range of 0–255, visualized in Figure 3c for the game Up ’N Down
at time step (frame) T.
This state representation is inspired by the Basic method defined in Bellemare, Naddaf
et al. (2012). Note, Tuttavia, that this method does not use a priori background detection
or pairwise combinations of features.
7Implies that the original 210 × 160 screen is divided by 5.
Evolutionary Computation Volume 26, Numero 3
357
S. Kelly and M. IO. Heywood
In comparison to the DQN approach (Mnih et al., 2015; Nair et al., 2015), no attempt
is made to design out the partially observable properties of game content (see discussion
of Section 2.2). Inoltre, the deep learning architecture’s three layers of convolution
filters reduce the down sampled 84 × 84 = 7,056 pixel space to a dimension of 3,136
before applying a fully connected multilayer perceptron (MLP).8 It is the combination
of convolution layer and MLP that represents the computational cost of deep learning.
Naturally, this imparts a fixed computational cost of learning as the entire DQN archi-
tecture is specified a priori (Sezione 6.3).
In contrasto, TPG evolves a decision-making agent from a 1,344 dimensional space.
In common with the DQN approach, no feature extraction is performed as part of the
preprocessing step, just a quantization of the original frame data. Implicit in this is an
assumption that the state space is highly redundant. TPG therefore perceives the state
spazio, (cid:2)S(T ) (Figure 3c), as read-only memory. Each TPG program then defines a po-
tentially unique subset of inputs from (cid:2)S(T ) for incorporation into their decision-making
processi. The emergent properties of TPG are then required to develop the complexity of
a solution, or policy graph, with programs organized into teams and teams into graphs.
Così, rather than assuming that all screen content contributes to decision making, IL
approach adopted by TPG is to adaptively subsample from the quantized image space.
The specific subset of state variables sampled within each agent policy is an emergent
property, discovered through interaction with the task environment alone. The impli-
cations of assuming such an explicitly emergent approach on computational cost will
be revisited in Section 6.3.
4.2
TPG Parameterization
Deploying population-based algorithms can be expensive on account of the number of
parameters and inter-relation between different parameters. In this work, no attempt
has been made to optimize the parameterization (Vedi la tabella 1); instead we carry over a
basic parameterization from experience with evolving single teams under a supervised
learning task (Lichodzijewski and Heywood, 2010).
Three basic categories of parameter are listed: Neutrality test (Sezione 3.2.1), Team
population, and Program population (Figura 2). In the case of the Team population, IL
biggest parameter decisions are the population size (how many teams to simultaneously
support), and how many candidate solutions to replace at each generation (Rgap). IL
parameters controlling the application of the variation operators common to earlier in-
stances of TeamGP (pmd , pma, pmm, pmn) also assume the values used under supervised
learning tasks (Lichodzijewski and Heywood, 2010). Conversely, patomic represents a
parameter specific to TPG, where this defines the relative chance of mutating an action
to an atomic action versus a pointer to a team (Sezione 3.2).
Likewise, the parameters controlling properties of the Program population as-
sume values used for TeamGP as applied to supervised learning tasks for all but
maxP rogSize. In essence this has been increased to the point where it is unlikely to be
encountered during evolution. The caption of Algorithm 1 summarizes the instruction
set and representation adopted for programs.
The computational limit for TPG is defined in terms of a computational resource
time constraint. Così, experiments ran on a shared cluster with a maximum runtime
Di 2 weeks per game title. The nature of some games allowed for >800 generations,
8For a tutorial on estimating the size of filters in deep learning architectures see http://cs231n
.github.io/convolutional-networks/
358
Evolutionary Computation Volume 26, Numero 3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
.
/
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Emergent High-Dimensional Multitask Reinforcement Learning
Tavolo 1: Parameterization of TPG.
Neutrality test (Sezione 3.2.1)
Number of historical samples in diversity test
Threshold for bid uniqueness (τ )
Team population
Number of (root) teams in initial population (Rsize)
Number of root nodes that can be replaced per generation (Rgap)
Probability of deleting or adding a program (pmd , pma)
Max. initial team size (ω)
Prob. of creating a new program (pmm)
Prob. of changing a program action (pmn)
Prob. of defining an action as a team or atomic action (patomic)
Program population
Total number of registers per program (numRegisters)
Max. number of instructions a program may take (maxProgSize)
Prob. of deleting or adding an instruction within a program (pdelete, padd )
Prob. of mutating an instruction within a program (pmutate)
Prob. of swapping a pair of instructions within a program (pswap)
50
10−4
360
50%
0.7
5
0.2
0.1
0.5
8
96
0.5
1.0
1.0
while others limited evolution to a few hundred. No attempt was made to parallelize
execution within each run (cioè., the TPG code base executes as a single thread), the clus-
ter merely enabled each run to be made simultaneously. Incidentally, the DQN results
required 12–14 days per game title on a GPU computing platform (Nair et al., 2015).
5
Single-Task Learning
This section documents TPG’s ability to build decision-making policies in the ALE from
the perspective of domain-independent AI, questo è, discovering policies for a variety of
ALE game environments with no task-specific parameter tuning. Before presenting de-
tailed results, we provide an overview of training performance for TPG on the suite of
49 ALE titles common to most benchmarking studies (Sezione 2.2). Figura 4 illustrates
average TPG training performance (across the 5 runs per game title) as normalized rel-
ative to DQN’s test score (100%) and random play (0%) (Mnih et al., 2015). The random
agent simply selects actions with uniform probability at each game frame.9 Under test
conditions, TPG exceeds DQN’s score in 27 games (Figure 4a), while DQN maintains the
highest score in 21 games (Figure 4b). Così, TPG and DQN are broadly comparable from
a performance perspective, each matching/beating the other in a subset of game envi-
ronments. Infatti, there is no statistical difference between TPG and DQN test scores
over all 49 games (Sezione 5.1). Tuttavia, TPG produces much simpler solutions in all
cases, largely due to its emergent modular representation, which automatically scales
through interaction with the task environment. That is to say, concurrent to learning a
9Normalized score is calculated as 100 × (TPG score − random play score)/(DQN score − random
play score). Normalizing scores makes it possible to plot TPG’s progress relative to multiple games
together regardless of the scoring scheme in different games, and facilitates making a direct comparison
with DQN.
Evolutionary Computation Volume 26, Numero 3
359
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
.
/
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
S. Kelly and M. IO. Heywood
Figura 4: TPG training curves, each normalized relative to DQN’s score in the same
game (100%) and random play (0%): (UN) shows curves for the 27 games in which TPG
ultimately exceeded the level of DQN under test conditions and (B) shows curves for the
21 games in which TPG did not reach DQN level during test. Note that in several games
TPG began with random policies (generation 1) that exceeded the level of DQN. Note
that these are training scores averaged over 5 episodes in the given game title, and are
thus not as robust as DQN’s test score used for normalization. Also, these policies were
often degenerate. Per esempio, in Centipede, it is possible to get a score of 12,890 by
selecting the “up-right-fire” action in every frame. While completely uninteresting, Questo
strategy exceeds the test score reported for DQN (8,390) and the reported test score for
a human professional video game tester (11,963) (Mnih et al., 2015). Regardless of their
starting point, TPG policies improve throughout evolution to become more responsive
and interesting. Note also that in Video Pinball, TPG exceeded DQN’s score during
training but not under test. The curve for Montezuma’s Revenge is not pictured, a game
in which neither algorithm scores any points.
strategy for gameplay, TPG explicitly answers the question of: (1) what to index from the
state representation for each game; E (2) what components from other candidate policies
to potentially incorporate within a larger policy. Conversely, DQN assumes a particu-
lar architecture, based on a specific deep learning–MLP combination, in which all state
information always contributes.
5.1 Competency under the Atari Learning Environment
The quality of TPG policies is measured under the same test conditions as used for DQN,
or the average score over 30 episodes per game title with different initial conditions
and a maximum of 18,000 frames per game (Mnih et al., 2015; Nair et al., 2015). Diverse
initial conditions are achieved by forcing the agent to select “no action” for the first no-
op frames of each test game, where no-op ∈ [0, 30], selected with uniform probability at
360
Evolutionary Computation Volume 26, Numero 3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
.
/
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Emergent High-Dimensional Multitask Reinforcement Learning
the start of each game.10 Since some game titles derive their random seed from initial
player actions, the stochastic no-op ensures a different seed for each test game. Stochastic
frame skipping, discussed in Section 2.1, implies variation in the random seeds and a
stochastic environment during gameplay. Both frame skipping and no-op are enforced
in this work to ensure a stochastic environment and fair comparison to DQN. Likewise,
the available actions per game are also assumed to be known.11
Two sets of comparator algorithm are considered:
• Screen capture state: construct models from game state, (cid:2)S(T ), defined in terms
of some form of screen capture input.12 These include the original DQN deep
learning results (Mnih et al., 2015), DQN as deployed through a massive dis-
tributed search (Nair et al., 2015), double DQN (van Hasselt et al., 2016), E
hyper-NEAT (Hausknecht et al., 2014). While the original DQN report empha-
sized comparison with a human professional game tester (Mnih et al., 2015),
we avoid such a comparison here primarily because the human results are not
reproducible.
• Engineered features: define game state, (cid:2)S(T ), in terms of features designed
a priori; così, significantly simplifying the task of finding effective policies
for game play, but potentially introducing unwanted biases. Specifically, IL
Hyper-NEAT and NEAT results use hand crafted “Object” features specific to
each game title in which different “substrates” denote the presence and loca-
tion of different classes of object (see Hausknecht et al., 2014 and the discussion
of Section 2.2). The Blob-PROST results assume features designed from an at-
tempt to reverse engineer the process performed by DQN (Liang et al., 2016).
The resulting state space is a vector of ≈110 × 106 attributes from which a lin-
ear RL agent is constructed (Sarsa). Finalmente, the best performing Sarsa RL agent
(Conti-Sarsa) is included from the DQN study (Mnih et al., 2015) where this as-
sumes the availability of “contingency awareness” features (Bellemare, Veness
et al., 2012B).
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
.
/
In each case TPG based on screen capture will be compared to the set of comparator
models across a common set of 49 Atari game titles. Statistical significance will be as-
sessed using the Friedman test, where this is a nonparametric form of ANOVA (Demšar,
2006; Japkowicz and Shah, 2011). Specifically, parametric hypothesis tests assume com-
mensurability of performance measures. This would imply that averaging results across
multiple game titles makes sense. Tuttavia, given that the score step size and types of
property measured in each title are typically different, then averaging Null test perfor-
mance across multiple titles is no longer commensurable. Conversely, the Friedman test
establishes whether or not there is a pattern to the ranks. Rejecting the Null hypothesis
implies that there is a pattern, and the Nemenyi post hoc test can be applied to assess
the significance (Demšar, 2006; Japkowicz and Shah, 2011).
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
10Some game titles will be more affected than others. Per esempio, titles such as Ms. Pac-Man play
a song for the first ≈70 games frames while the agent’s actions are ignored (thus no-op has no effect),
while the agent takes control immediately in other game titles.
11The study of Liang et al. (2016) questions this assumption, but finds that better performance re-
sulted when RL agents were constructed with the full action space.
12Reviewed in Section 2.2 for comparator algorithms and detailed in Section 4.1 for TPG.
Evolutionary Computation Volume 26, Numero 3
361
S. Kelly and M. IO. Heywood
In the case of RL agents derived from screen capture state information (Tavolo 7,
= 21.41 which for the purposes of the
Appendix A), the Friedman test returns a χ 2
F
Null hypothesis has an equivalent value from the F-distribution of FF = 5.89 (Demšar,
2006). The corresponding critical value F (α = 0.01, 4, 192) È 3.48, hence the Null hy-
pothesis is rejected. Applying the post hoc Nemenyi test (α = 0.05) provides a critical
difference of 0.871. Così, relative to the best ranked algorithm (Gorila), only Hyper-
NEAT is explicitly identified as outside the set of equivalently performing algorithms
(O 2.63 + 0.871 < 3.87). This conclusion is also borne out by the number of game ti-
tles for which each RL agent provides best case performance; Hyper-NEAT provides 4
best case game titles, whereas TPG, Double DQN and Gorila return at least 11 best title
scores each (Table 7, Appendix A).
Repeating the process for the comparison of TPG13 to RL agents trained under
= 80.59 and an equiv-
hand crafted features (Table 8), the Friedman test returns a χ 2
F
alent value from the F-distribution of FF = 33.52. The critical value is unchanged as the
number of models compared and game titles is unchanged, hence the Null hypothesis
is rejected. Likewise the critical difference from the post hoc Nemenyi test (α = 0.05) is
also unchanged, 0.871. This time only the performance of the Conti-Sarsa algorithm is
identified as significantly worse (or 2.16 + 0.871 < 4.76).
In summary, these results mean that despite TPG having to develop all the architec-
tural properties of a solution, TPG is still able to provide an RL agent that performs as
well as current state of the art. Conversely, DQN assumes a common prespecified deep
learning topology consisting of millions of weights. Likewise, Hyper-NEAT assumes a
pre-specified model complexity of ≈900,000 weights, irrespective of game title. As will
become apparent in the next section, TPG is capable of evolving policy complexities
that reflect the difficulty of the task.
6
Simplicity through Emergent Modularity
The simplest decision making entity in TPG is a single team of programs (Figure 1a),
representing a standalone behaviour which maps quantized pixel state to output (ac-
tion). Policies are initialized in their simplest form: as a single team with between 2 and
ω programs. Each initial team will subsample a typically unique portion of the available
(input) state space. Throughout evolution, search operators will develop team/program
complement and may incrementally combine teams to form policy graphs. However,
policies will complexify only when/if simpler solutions are outperformed. Thus, solu-
tion complexity is an evolved property driven by interaction with the task environment.
By compartmentalizing decision making over multiple independent modules (teams),
and incrementally combining modules into policy graphs, TPG is able to simultane-
ously learn which regions of the input space are important for decision making and
discover an appropriate decision-making policy.
6.1
Behavioural Modularity
Emergent behavioural modularity in the development of TPG solutions can be visual-
ized by plotting the number of teams incorporated into the champion policy graph as
a function of generation (see Figure 5a). Development is nonmonotonic, and the speci-
ficity of team compliment as a function of game environment is readily apparent. For
example, a game such as Asteroids may see very little in the way of increases to team
complement as generations increase. Conversely, Ms. Pac-Man, which is known to be
13TPG still assumes screen capture state.
362
Evolutionary Computation Volume 26, Number 3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
/
e
d
u
e
v
c
o
a
r
t
i
c
e
-
p
d
l
f
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
c
o
_
a
_
0
0
2
3
2
p
d
/
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Emergent High-Dimensional Multitask Reinforcement Learning
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
/
e
d
u
e
v
c
o
a
r
t
i
c
e
-
p
d
l
f
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
c
o
_
a
_
0
0
2
3
2
p
d
/
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 5: Emergent modularity. (a) Development of the number of teams per champion
policy graph as a function of generation and game title. The run labeled “Rand” reflects
the number of teams per policy when selection pressure is removed, confirming that
module emergence is driven by selective pressure rather than drift or other potential
biases. Black circles indicate the total number of teams in each champion policy, while
× symbols indicate the average number of teams visited to make each single decision
during testing. (b) Development of the proportion of input space indexed by champion
policies. Black circles indicate the total proportion indexed by each champion policy,
while × symbols indicate the average proportion observed to make each single decision
during testing. For clarity, only the 27 game titles with TPG agent performance ≥DQN
are depicted.
a complex task (Pepels and Winands, 2012; Schrum and Miikkulainen, 2016), saw the
development of a policy graph incorporating ≈200 teams. Importantly, making a deci-
sion in any single time step requires following one path from the root team to atomic
action. Thus, the cost in mapping a single game frame to an atomic action is not linearly
correlated to the graph size. For example, while the number of teams in the Alien policy
was ≈60, on average only 4 teams were visited per graph traversal during testing (see ×
symbols in Figure 5a). Indeed, while the total number of teams in champion TPG policy
graphs ranges from 7 (Asteroids) to 300 (Bowling), the average number of teams visited
per decision is typically less than 5 (Figure 5a).
6.2
Evolving Adapted Visual Fields
Each Atari game represents a unique graphical environment, with important events
occurring in different areas of the screen, at different resolutions, and from different
perspectives (e.g., global maze view versus first-person shooter). Part of the challenge
with high-dimensional visual input data is determining what information is relevant to
the task. Naturally, as TPG policy graphs develop, they will incrementally index more
of the state space. This is likely one reason why they grow more in certain environments.
Evolutionary Computation Volume 26, Number 3
363
S. Kelly and M. I. Heywood
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
/
e
d
u
e
v
c
o
a
r
t
i
c
e
-
p
d
l
f
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
c
o
_
a
_
0
0
2
3
2
p
d
/
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 6: Adapted Visual Field (AVF) of champion TPG policy graph in Up ’N Down.
Black regions indicate areas of the screen not indexed. (a) Shows the raw game screen.
(b) Shows the preprocessed state space, where each decimal state variable (0–255) is
mapped to a unique colour. (c) Shows the AVF for a single team along the active path
through the policy graph at this time step, while (d) shows the AVF for the policy graph
as a whole. Both AVFs exhibit patterns of sensitivity consistent with important regular-
ities of the environment, specifically the zigzagging track.
Figure 5b plots the proportion of input space indexed by champion policy graphs
throughout development, where this naturally correlates with the policy graph devel-
opment shown in Figure 5a. Thus, the emergent developmental approach to model
building in TPG can also be examined from the perspective of the efficiency with which
information from the state space (cid:2)s(t ) is employed. In essence, TPG policies have the ca-
pacity to develop their own Adapted Visual Fields (AVF). While the proportion of the
visual field (input space) covered by a policy’s AVF ranges from about 10% (Asteroids)
to 100% (Bowling), the average proportion required to make each decision remains low,
or less than 30% (see × symbols Figure 5b).
Figure 6 provides an illustration of the AVF as experienced by a single TPG team (c)
versus the AVF for an entire champion TPG policy graph (d) in the game “Up ‘N Down.”
This is a driving game in which the player steers a dune buggy along a track that zigzags
vertically up and down the screen. The goal is to collect flags along the route and avoid
hitting other vehicles. The player can smash opponent cars by jumping on them using
the joystick fire button, but loses a turn upon accidentally hitting a car or jumping off the
track. TPG was able to exceed the level of DQN in Up ’N Down (test games consistently
ended due to the 18,000 frame limit rather than agent error) with a policy graph that
indexed only 42% of the screen in total, and an average 12% of the screen per decision
364
Evolutionary Computation Volume 26, Number 3
Emergent High-Dimensional Multitask Reinforcement Learning
(see column %SP in Table 3). The zigzagging patterns that constitute important game
areas are clearly visible in the policy’s AVF. In this case, the policy learned a simplified
sensor representation well tailored to the task environment. It is also apparent that in
the case of the single TPG team, the AVF does not index state information from a specific
local region, but instead samples from a diverse spatial range across the entire image
(Figure 6c).
In order to provide more detail, column %SP in Table 3 gives the percent of state
space (screen) indexed by the policy as a whole. Maze tasks, in which the goal involves
directing an avatar through a series of 2-D mazes (e.g., Bank Heist, Ms. Pac-Man, Ven-
ture) typically require near-complete screen coverage in order to navigate all regions of
the maze, and relatively high-resolution is important to distinguish various game en-
tities from maze walls. However, while the policy as a whole may index most of the
screen, the modular nature of the representation implies that no more than 27% of the
indexed space is considered before making each decision (Table 3, column %SP), sig-
nificantly improving the runtime complexity of the policy. Furthermore, adapting the
visual field implies that extensive screen coverage is only used when necessary. Indeed,
in 10 of the 27 games for which TPG exceeded the score of DQN, it did so while index-
ing less that 50% of the screen, further minimizing the number of instructions required
per decision.
In summary, while the decision-making capacity of the policy graph expands
through environment-driven complexification, the modular nature of a graph repre-
sentation implies that the cost of making each decision, as measured by the number of
teams/programs which require execution, remains relatively low. Section 6.3 investi-
gates the issue of computational cost to build solutions, and Section 6.4 will consider
the cost of decision making post-training.
6.3 Computational Cost
The budget for model building in DQN was to assume a fixed number of decision mak-
ing frames per game title (50 million). The cost of making each decision in deep learn-
ing is also fixed a priori, a function of the preprocessed image (Section 6.3) and the
complexity of a multilayer perceptron (MLP). Simply put, the former provides an en-
coding of the original state space into a lower-dimensional space; the latter represents
the decision-making mechanism.
As noted in Section 4.2, TPG runs are limited to a fixed computational time of 2
weeks per game title. However, under TPG the cost of decision making is variable as
solutions do not assume a fixed topology. We can now express computational cost in
terms of the cost to reach the DQN performance threshold (27 game titles), and the
typical cost over the two-week period (remaining 21 game titles). Specifically, let T be
the generation at which a TPG run exceeds the performance of DQN. P (t ) denotes the
number of policies in the population at generation t. Let i(t ) be the average number of
instructions required by each policy to make a decision, and let f (t ) be the total number
of frames observed over all policies at generation t; then the total number of operations
t=1 P (t ) ×
required by TPG to discover a decision-making policy for each game is
i(t ) × f (t ). When viewed step-wise, this implies that computational cost can increase or
decrease relative to the previous generation, depending on the complexity of evaluating
TPG individuals (which are potentially all unique topologies).
(cid:2)
T
Figure 7 plots the number of instructions required for each game over all decision-
making frames observed by agents during training. Figure 7a characterizes computa-
tional cost in terms of solutions to the 27 game titles that reached the DQN performance
Evolutionary Computation Volume 26, Number 3
365
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
/
/
e
d
u
e
v
c
o
a
r
t
i
c
e
-
p
d
l
f
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
c
o
_
a
_
0
0
2
3
2
p
d
.
/
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
S. Kelly and M. I. Heywood
Figure 7: Number of operations per frame (y-axis) over all game frames observed dur-
ing training (x-axis). (a) Shows the subset of games up to the point where TPG exceeded
DQN test score. (b) Shows games for which TPG did not reach DQN test score. Black
diamonds denote the most complex cases, with text indicating the cumulative number
of operations required to train each algorithm up to that point. DQN’s architecture is
fixed a priori, thus cumulative computational cost at each frame is simply a sum over
the number of operations executed up to that frame. TPG’s complexity is adaptive, thus
producing a unique development curve and max operations for each game title. Frame
limit for DQN was 50 million (5 × 107). Frame limit for TPG, imposed by a cluster re-
source time constraint of 2 weeks, is only reached in (b).
threshold, that is, the computational cost of reaching the DQN performance threshold.
Conversely, Figure 7b illustrates the computational cost for games that never reached
the DQN performance threshold, that is, terminated at the 2-week limit. As such, this
is representative of the overall cost of model building in TPG for the ALE task given a
two-week computational budget. In general, cost increases with an increasing number
of (decision-making) frames, but the cost benefit of the nonmonotonic, adaptive nature
of the policy development is also apparent.
It is also readily apparent that TPG typically employed more than the DQN budget
for decision-making frames (5 × 107). However, the cost of model construction is also
a function of the operations per decision. For example, the parameterization adopted
by DQN results in an MLP hidden layer consisting of 1,605,632 weights, or a total com-
putational cost in the order of 0.8 × 1014 over all 50,000,000 training frames. The total
cost of TPG model building is 4 × 1011 in the worst case (Figure 7a). Thus, the cost of
the MLP step, without incorporating the cost of performing the deep learning convolu-
tion encoding (>3 million calculations at layer 1 for the parameterization of Mnih et al.,
2015), exceeds TPG by several orders of magnitude. Inoltre, this does not reflect the
additional cost of performing a single weight update.
366
Evolutionary Computation Volume 26, Numero 3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Emergent High-Dimensional Multitask Reinforcement Learning
Tavolo 2: Wall-clock time for making each decision and memory requirement.
Method
Decisions per sec
Frames per sec
Memory
DQN
Blob-PROST
TPG
5
56–300
758–2853
20
280–1500
3032–11412
9.8 GB
50 MB–9 GB
118 MB–146 MB†
†Values for TPG reflect the memory utilized to support the entire population whereas only one champion
agent is deployed post training, questo è, tens to hundreds of kilobytes. TPG wall-clock time is measured on a
2.2-GHz Intel Xeon E5-2650 CPU.
6.4 Cost of Real-Time Decision Making
Tavolo 2 summarizes the cost/resource requirement when making decisions post train-
ing, questo è, the cost of deploying an agent. Liang et al. (2016) report figures for the mem-
ory and wall clock time on a 3.2-GHz Intel Core i7-4790S CPU. Computational cost for
DQN is essentially static due to a fixed architecture being assumed for all games. Blob-
PROST complexity is a function of the diversity of colour pallet in the game title. Ap-
parently the 9 GB number was the worst case, con 3.7 GB representing the next largest
memory requirement. It is apparent that TPG solutions are typically 2 A 3 orders of
magnitude faster than DQN and an order of magnitude faster than Blob-PROST.
TPG model complexity is an evolved trait (Sezione 6) and only a fraction of the
resulting TPG graph is ever visited per decision. Tavolo 3 provides a characterization of
this in terms of three properties of champion teams (as averaged over the 5 champions
per game title, one champion per run):
• Teams (Tm)—both the average total number of teams per champion and cor-
responding average number of teams visited per decision.
• Instructions per decision (Ins)—the average number of instructions executed
per agent decision. Note that as a linear genetic programming representation
is assumed, most intron code can be readily identified and skipped for the pur-
poses of program execution (Brameier and Banzhaf, 2007). Così, “Ins” reflects
the code actually executed.
• Proportion of visual field (%SP)—the proportion of the state space (Sezione 4.1)
indexed by the entire TPG graph versus that actually indexed per decision. Questo
reflects the fact that GP individuals, unlike deep learning or Blob-PROST, are
never forced to explicitly index all of the state space. Instead the parts of the
state space utilized per program is an emergent property (discussed in detail
in Section 6.2).
It is now apparent that on average only 4 teams require evaluation per decision (val-
ues in parentheses in Tm column, Tavolo 3). This also means that decisions are typically
made on the basis of 3–27% of the available state space (values in parentheses in %SP
column, Tavolo 3). Likewise, the number of instructions executed is strongly dependent
on the game title. The TPG agent for Time Pilot executed over a thousand instructions
per action, whereas the TPG agent for Asteroids only executed 96. In short, piuttosto che
having to assume a fixed decision-making topology with hundreds of thousands of
Evolutionary Computation Volume 26, Numero 3
367
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
.
/
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
S. Kelly and M. IO. Heywood
Tavolo 3: Characterizing overall TPG complexity. Tm denotes the total number of teams
in champions versus the average number of teams visited per decision (value in paren-
theses). Ins denotes the average number of instructions executed to make each decision.
%SP denotes the total proportion of the state space covered by the policy versus (value
in parentheses).
Title
Tm
Ins
%SP
Title
Tm
Ins
%SP
Alien
Assault
Asteroids
Bank Heist
Beam Rider
Boxing
Centipede
Crazy Climber
Double Dunk
Fishing Derby
Frostbite
Gravitar
Ice Hockey
Kangaroo
Kung-Fu Master
Ms. Pac-Man
Pong
Q*Bert
Road Runner
Seaquest
Star Gunner
Time Pilot
Up ’N Down
Video Pinball
Zaxxon
67(4)
37(4)
7(2)
94(3)
115(4)
102(6)
36(4)
150(3)
10(2)
33(3)
45(4)
49(4)
29(4)
52(4)
31(2)
197(5)
12(4)
255(8)
86(7)
58(4)
17(4)
189(5)
28(3)
38(3)
81(4)
498
420
96
532
443
1156
587
1076
98
472
434
499
442
877
137
603
283
2346
1169
579
516
1134
425
399
613
68(13) Amidar
50(14) Asterix
Atlantis
10(4)
Battle Zone
75(15)
Bowling
83(13)
79(26)
Breakout
48(18) C. Command
99(28) Demon Attack
Enduro
20(3)
50(15)
Freeway
53(13) Gopher
62(14) H.E.R.O.
39(14)
64(21) Krull
44(5) M’s Revenge
95(19) Name This Game
20(10)
Private Eye
99(46) River Raid
74(27) Robotank
60(15)
35(15)
95(27)
42(12) Venture
55(13) Wizard of Wor
68(16)
Space Invader
Tennis
Tutankham
James Bond
132(6)
77(5)
39(4)
15(2)
300(4)
6(3)
49(4)
19(3)
24(3)
18(4)
4(2)
96(5)
41(4)
62(4)
403(2)
93(3)
71(7)
7(3)
42(2)
68(4)
3(2)
36(2)
77(6)
23(4)
1066
739
939
191
927
158
464
311
381
296
156
979
973
376
722
361
761
286
252
624
71
464
1262
433
87(23)
73(20)
64(22)
24(6)
100(26)
8(6)
54(15)
26(8)
37(10)
26(11)
8(5)
75(24)
59(22)
58(10)
100(22)
77(13)
64(17)
15(9)
47(8)
65(17)
5(3)
58(14)
74(23)
36(12)
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
parameters, TPG is capable of discovering emergent representations appropriate for
each task.
7 Multitask Learning
Up to this point we have demonstrated the ability of TPG to build strong single-task
decision-making policies, where a unique policy graph was trained from scratch for
each Atari game title. This section reports on TPG’s ability to learn multiple tasks si-
multaneously, or multitask reinforcement learning (MTRL). MTRL operates with the
same state representation as single-task learning. Questo è, state variables consist of raw
screen capture with no additional information regarding which game is currently being
played. Inoltre, the full Atari action set is available to agents at all times.
When the TPG population is trained on multiple Atari games simultaneously, UN
single run can produce multiple champion policies, one for each game title, that match
or exceed the level of play reported for DQN. In some cases, a multitask policy (cioè., UN
single policy graph capable of playing multiple titles) also emerges that plays all games
368
Evolutionary Computation Volume 26, Numero 3
Emergent High-Dimensional Multitask Reinforcement Learning
Tavolo 4: Task groups used in multitask reinforcement learning experiments. Each group
represents a set of games to be learned simultaneously (see Section 7.1).
3-Title Groups
Game
5-Title Groups
3.1
3.2
3.3
3.4
3.5
Alien
Asteroids
Bank Heist
Battle Zone
Bowling
Centipede
Chopper Command
Fishing Derby
Frostbite
Kangaroo
Krull
Kung-Fu Master
Ms. Pac-Man
Private Eye
Time Pilot
5.1
5.2
5.3
at the level of DQN. Inoltre, the training cost for TPG under MTRL is no greater
than task-specific learning, and the complexity of champion multitask TPG policies is
still significantly less than task-specific solutions from deep learning.
7.1
Task Groups
While it is possible to categorize Atari games by hand in order to support incremental
apprendimento (Braylan et al., 2015), no attempt was made here to organize game groups based
on perceived similarity or multitask compatibility. Such a process would be labour in-
tensive and potentially misleading, as each Atari game title defines its own graphical
ambiente, colour scheme, physics, objective(S), and scoring scheme. Inoltre,
joystick actions are not necessarily correlated between game titles. Per esempio, IL
“down” joystick position generally causes the avatar to move vertically down the screen
in maze games (per esempio., Ms. Pac-Man, Alien), but might be interpreted as “pull-up” in fly-
ing games (Zaxxon), or even cause a spaceship avatar to enter hyperspace, disappearing
and reappearing at a random screen location (Asteroids).
In order to investigate TPG’s ability to learn multiple Atari game titles simultane-
ously, a variety of task groupings, questo è, specific game titles to be learned simultane-
ously, are created from the set of games for which single-task runs of TPG performed
BENE. Relative to the four comparison algorithms which use a screen capture state rep-
resentation, TPG achieved the best reported test score in 15 del 49 Atari game titles
considered (Tavolo 7, Appendix A). Così, task groupings for MTRL can be created in an
unbiased way by partitioning the list of 15 titles in alphabetical order. Specifically, Tavolo
4 identifies 5 groups of 3 games each, E 3 groups of 5 games each.
7.2
Task Switching
As in single-task learning, each policy is evaluated in 5 episodes per generation. How-
ever, under MTRL, new policies are first evaluated in one episode under each game title
in the current task group. Thereafter, the game title for each training episode is selected
Evolutionary Computation Volume 26, Numero 3
369
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
S. Kelly and M. IO. Heywood
with uniform probability from the set of titles in the task group. The maximum training
episodes for each policy is 5 episodes under each game title. For each consecutive block
Di 10 generations, one title is selected with uniform probability to be the active title for
which selective pressure is applied. Così, while a policy may store the final score from
up to 5 training episodes for each title, fitness at any given generation is the average
score over up to 5 episodes in the active title only. Così, selective pressure is explicitly
applied only relative to a single game title. Tuttavia, stochastically switching the active
title at regular intervals throughout evolution implies that a policy’s long-term survival
is dependent on a level of competence in all games.
7.3
Elitism
There is no multiobjective fitness component in the formulation of MTRL proposed
in this work. Tuttavia, a simple form of elitism is used to ensure the population as
a whole never entirely forgets any individual game title. As such, the single policy with
the best average score in each title is protected from deletion, regardless of which ti-
tle is currently active for selection. Note that this simple form of elitism does not pro-
tect multitask policies, which may not have the highest score for any single task, Ma
are able to perform relatively well on multiple tasks. Failing to protect multitask poli-
cies became problematic under the methodology of our first MTRL study (Kelly and
Heywood, 2017B). Così, a simple form of multitask elitism is employed in this work.
The elite multitask team is identified in each generation using the following two-step
procedure:
1. Normalize each policy’s mean score on each task relative to the rest of the cur-
rent population. Normalized score for team tmi on task tj , or scn(tmi, tj ), is cal-
culated as (sc(tmi, tj ) − scmin(tj ))/(scmax (tj ) − scmin(tj )), where sc(tmi, tj ) is the
mean score for team tmi on task tj and scmin,max (tj ) are the population-wide min
and max mean scores for task tj .
2. Identify the multitask elite policy as that with the highest minimum normal-
ized score over all tasks. Relative to all root teams in the current population, R,
the elite multitask team is identified as tmi ∈ R | ∀tmk ∈ R : min(scn(tmi, T{1..N}) >
min(scn(tmk, T{1..N}), where min(scn(tmi, T{1..N}) is the minimum normalized score
for team tmi over all tasks in the game group and n denotes the number of titles
in the group.
Così, in each generation, elitism identifies 1 champion team for each game title and 1
multitask champion, where elite teams are protected from deletion in that generation.
7.4
Parameterization
The parameterization used for TPG under multitask reinforcement learning is identical
to that described in Table 1 with the exception of Rsize parameter, or the number of root
teams to maintain in the population. Under MTRL, the population size was reduced
A 90 (1/4 of the size used under single-task learning) in order to speed up evolution
and allow more task switching cycles to occur throughout the given training period.14
A total of 5 independent runs were conducted for each task group in Table 4. Multitask
elite teams represent the champions from each run at any point during development.
14As under single-task experiments, the computational limit for MTRL is defined in terms of a com-
putational time constraint. Experiments ran on a shared cluster with a maximum runtime of 1 week
per run.
370
Evolutionary Computation Volume 26, Numero 3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
.
/
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Emergent High-Dimensional Multitask Reinforcement Learning
Tavolo 5: Summary of multitask learning results over all task groups. MT and ST report
test scores for the single best multitask (MT) and single-task (ST) policy for each game
group over all 5 independent runs. Scores that match or exceed the test score reported
for DQN in Mnih et al. (2015) are highlighted in grey (the MT score for Krull in group
5.3 È 90% of DQN’s score, and is considered a match).
MT
ST
Group
Game
Group
MT
ST
864
2176
1085
36166
197
13431.2
3173.3
−66.6
2900.7
11200
4921.3
25600
3067.7
14665
7846.7
1494.3
2151
1085
37100
197
22374.6
3266.7
−38.4
4216
10940
17846.7
42480
3164.7
14734.7
8193.3
3.1
3.2
3.3
3.4
3.5
Alien
Asteroids
Bank Heist
Battle Zone
Bowling
Centipede
Chopper Command
Fishing Derby
Frostbite
Kangaroo
Krull
Kung-Fu Master
Ms. Pac-Man
Private Eye
Time Pilot
5.1
5.2
5.3
346.7
1707
724
11800
107
9256.9
1450
24.967
2087.3
11200
3644.3
25393.3
3312.3
4000
7270
759
2181.3
724
30466.7
212
20480.2
2716.7
27.7
4283.3
11893.3
6099.7
34273.3
3430
15000
8570
Post training, the final champions from each run are subject to the same test procedure
as identified in Section 4 for each game title.
7.5 MTRL Performance
Figura 8 reports the MTRL training and test performance for TPG relative to game group
5.3, where all TPG scores are normalized relative to scores reported for DQN in Mnih
et al. (2015). By generation ≈750, the best multitask policy is able to play all 5 game titles
at the level reported for DQN.15 Under test, the multitask champion (cioè., a single policy
that plays all game titles at a high level) exceeds DQN in 4 del 5 games, while reaching
Sopra 90% of DQN’s score in the remaining title (Krull) (Figure 8b). Note that in the case
of task group 5.3, only one run produced a multitask policy capable of matching DQN
in all 5 compiti.
While the primary focus of MTRL is to produce multitask policies, a byproduct of
the methodology employed here (cioè., task switching and elitism rather than multiobjec-
tive methods) is that each run also produces high-quality single-task policies (cioè., poli-
cies that excel at one game title). Test results for these game-specific specialists, Quale
are simply the 5 elite single-task policies at the end of evolution, is reported in Figure
8C. While not as proficient as policies trained on a single task (Sezione 5.1), at least one
single-task champion emerges from MTRL in task group 5.3 that matches or exceeds
the score from DQN in each game title.
Tavolo 5 provides a summary overview of test scores for the champion multi-task
and single-task policy relative to each game group. Test scores that match or exceed
15Note that training scores reported for TPG in Figure 8a are averaged from a max of 5 episodes in
each game title, and are thus not as robust as the test scores reported in Figure 8b.
Evolutionary Computation Volume 26, Numero 3
371
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
.
/
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
S. Kelly and M. IO. Heywood
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
Figura 8: TPG multitask reinforcement learning results for game group 5.3. Each run
identifies one elite multitask policy per generation. The training performance of this
policy relative to each game title is plotted in (UN), where each curve represents the mean
score in each game title for the single best multitask policy over all 5 independent runs.
Note that multitask implies that the scores reported at each generation are all from the
same policy. Test scores for the final multitask champion from each of 5 runs is plotted
In (B), with the single best in black. Test scores for the single-task champions from each
run are plotted in (C). Note that single-task implies the scores are potentially all from
different policies. All TPG scores are normalized relative to DQN’s score in the same
game (100%) and a random agent (0%). Training scores in (UN) represent the policy’s
average score over a max of 5 episodes in each title. Test scores in (B) E (C) are the
average game score over 30 test episodes in the given game title. (The line connecting
points in (B) emphasizes that scores are from the same multi-task policy.) DQN scores
are from Mnih et al. (2015).
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
that of DQN are highlighted in grey. For the 3-title groups, TPG produced multitask
champions capable of playing all 3 game titles in groups 3.2, 3.4, E 3.5, while the
multitask champions learned 2/3 titles in group 3.1 and only 1/3 titles in group 3.3.
372
Evolutionary Computation Volume 26, Numero 3
Emergent High-Dimensional Multitask Reinforcement Learning
For the 5-title groups, TPG produced multitask champions capable of playing all 5 ti-
tles in group 5.3, 4/5 titles in group 5.2, E 3/5 titles in group 5.1. It seems that Alien
and Chopper Command are two game titles that TPG had difficulty learning under the
MTRL methodology adopted here (neither multitask nor single-task policies emerged
for either game title). È interessante notare, while Fishing Derby was difficult to learn when
grouped with Frostbite and Chopper Command (group 3.3), adding 2 additional game
titles to the task switching procedure (cioè., group 5.2) seems to have been helpful to
learning Fishing Derby. Note that test scores from policies developed under the MTRL
methodology are generally not as high as scores achieved through single-task learning
for the same game titles (Sezione 5.1). This is primarily due to the extra challenge of
learning multiple task simultaneously. Tuttavia, it is important to note that the popu-
lation size for MTRL experiments was 1/4 of that used for single-task experiments and
the computational budget for MTRL was half that of single-task experiments. Infatti,
the MTRL results here represent a proof of concept for TPG’s multitask ability rather
than an exhaustive study of its full potential.
7.6 Modular Task Decomposition
Problem decomposition takes place at two levels in TPG: (1) program level, in which
individual programs within a team each define a unique context for deploying a single
action; E (2) team level, in which individual teams within a policy graph each define
a unique program complement, and therefore represent a unique mapping from state
observation to action. Inoltre, since each program typically indexes only a small por-
tion of the state space, the resulting mapping will be sensitive to a specific region of
the state space. This section examines how modularity at the team-level supports the
development of multitask policies.
As TPG policy graphs develop, they will subsume an increasing number of stan-
dalone decision-making modules (teams) into a hierarchical decision-making structure.
Recall from Section 3.2 that only root teams are subject to modification by variation op-
erators. Così, teams that are subsumed as interior nodes of a policy graph undergo no
modification. This property allows a policy graph to avoid (quickly) unlearning tasks
that were experienced in the past under task switching but are not currently the ac-
tive task. This represents an alternative approach to avoiding “catastrophic forgetting”
(Kirkpatrick et al., 2016) during the continual, sequential learning of multiple tasks. IL
degree to which individual teams specialize relative to each objective experienced dur-
ing evolution (questo è, the game titles in a particular game group) can be characterized
by looking at which teams contribute to decision making at least once during testing,
relative to each game title.
Figura 9 shows the champion multitask TPG policy graph from the group 3.2 exper-
iment. The Venn diagram indicates which teams are visited at least once while playing
each game, over all test episodes. Naturally, the root team contributes to every decision
(Node marked ABC in the graph, center of Venn diagram). Five teams contribute to
playing both Bowling and Centipede (Node marked AB in the graph), while the rest of
the teams specialize for a specific game title (Node marked A in the graph). In short,
both generalist and specialist teams appear within the same policy and collectively de-
fine a policy capable of playing multiple game titles.
7.6.1 Complexity of Multitask Policy Graphs
Tavolo 6 reports the average number of teams, instructions, and proportion of state space
contributing to each decision for the multitask champion during testing. È interessante notare,
Evolutionary Computation Volume 26, Numero 3
373
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
.
/
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
S. Kelly and M. IO. Heywood
Figura 9: Champion multitask TPG policy graph from the group 3.2 experiment. De-
cision making in a policy graph begins at the root node (ABC) and follows one path
through the graph until an atomic action (joystick position) is reached (see Algorithm
2). The Venn diagram indicates which teams are visited while playing each game, Sopra
all test episodes. Note that only graph nodes (teams and programs) that contributed to
decision making during test are shown.
Tavolo 6: Complexity of champion multitask policy graphs from each game group in
which all tasks were covered by a single policy. The cost of making each decision is
relative to the average number of teams visited per decision (Tm), average number of
instructions executed per decision (Ins), and proportion of state space indexed per de-
cision (%SP). TPG wall-clock time is measured on a 2.2-GHz Intel Xeon E5-2650 CPU.
Group
Title
Tm Ins %SP Decisions per sec
3.2
3.4
3.5
5.3
Battle Zone
Bowling
Centipede
Kangaroo
Krull
Kung-Fu Master
Ms. Pac-Man
Private Eye
Time Pilot
Krull
Kung-Fu Master
Ms. Pac-Man
Private Eye
Time Pilot
3
4
2
2
2
2
3
4
5
5
2
5
3
4
413
499
595
200
394
512
532
804
869
782
455
673
481
657
11
15
15
6
11
12
14
18
19
18
13
16
13
16
2687
2922
2592
3141
2502
2551
2070
1857
1982
1832
2342
1989
2192
2306
374
Evolutionary Computation Volume 26, Numero 3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Emergent High-Dimensional Multitask Reinforcement Learning
even for an evolved multitask policy graph (cioè., post-training), the number of instruc-
tions executed depends on the game in play, Per esempio, ranging from 200 in Kangaroo
A 512 in Kung-Fu Master for the Group 3.4 champion. While the complexity/cost of
decision making varies depending on the game in play, the average number of instruc-
tions per decision for the group 5.3 champion is 610, not significantly different from
the average of 602 required by task-specific policies when playing the same games (Vedere
Tavolo 3). Inoltre, the group 5.3 champion multitask policy averaged 1832–2342 de-
cisions per second during testing, which is significantly faster than single-task policies
from both DQN and Blob-PROST (Vedi la tabella 2). Finalmente, as the parameterization for TPG
under MTRL is identical to task-specific experiments with a significantly smaller popu-
lation size (90 vs. 360), and the number of generations is similar in both cases,16 we can
conclude that the cost of development is not significantly greater under MTRL.
8 Conclusione
Applying RL directly to high-dimensional decision-making tasks has previously been
demonstrated using both neuro-evolution and multiple deep learning architectures.
To do so, neuro-evolution assumed an a priori parameterization for model complex-
ity whereas deep learning had the entire architecture pre-specified. Inoltre, evolving
the deep learning architectures only optimizes the topology. The convolution operation
central to providing the encoded representation remains, and it is this operation that
results in the computational overhead of deep learning architectures. In this work, an
entirely emergent approach to evolution, or Tangled Program Graphs, is proposed in
which solution topology, state space indexing, and the types of action actually utilized
are all evolved in an open ended manner.
We demonstrate that TPG is able to evolve solutions to a suite of 49 Atari game titles
that generally match the quality of those discovered by deep learning at a fraction of the
model complexity. To do so, TPG begins with single teams of programs and incremen-
tally discovers a graph of interconnectivity, potentially linking hundreds of teams by the
time competitive solutions are found. Tuttavia, as each team can only have one action
(per state), very few of the teams composing a TPG solution are evaluated in order to
make each decision. This provides the basis for efficient real-time operation without re-
course to specialized computing hardware. We also demonstrate a simple methodology
for multitask learning with the TPG representation, in which the champion agent can
play multiple games titles from direct screen capture, all at the level of deep learning,
without incurring any additional training cost or solution complexity.
Future work is likely to continue investigating MTRL under increasingly high-
dimensional task environments. One promising development is that TPG seems to be
capable of policy discovery in VizDoom and ALE directly from the frame buffer (cioè.,
without the quantization procedure in Section 4.1) (per esempio., Kelly et al., 2018; Smith and
Heywood, 2018). That said, there are many more open issues, such as finding the rel-
evant diversity mechanisms for tasks such as Montezuma’s Revenge and providing
efficient memory mechanisms that would enable agents to extend beyond the reactive
models they presently assume.
16MTRL runs lasted 200–750 generations, which is roughly the range of generations reached for the
single-task runs (see Figure 4a).
Evolutionary Computation Volume 26, Numero 3
375
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
S. Kelly and M. IO. Heywood
Ringraziamenti
S. Kelly gratefully acknowledges support from the Nova Scotia Graduate Scholarship
program. M. Heywood gratefully acknowledges support from the NSERC Discovery
program. All runs were completed on cloud computing infrastructure provided by
ACENET, the regional computing consortium for universities in Atlantic Canada. IL
TPG code base is not in any way parallel, but in adopting ACENET the five independent
runs for each of the 49 games were conducted in parallel.
Riferimenti
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2012UN). The arcade learning environ-
ment: An evaluation platform for general agents. Journal of Artificial Intelligence Research,
47:253–279.
Bellemare, M. G., Veness, J., and Bowling, M. (2012B). Investigating contingency awareness us-
ing Atari 2600 games. In Proceedings of the AAAI Conference on Artificial Intelligence, pag. 864–
871.
Brameier, M., and Banzhaf, W. (2001). Evolving teams of predictors with linear genetic program-
ming. Genetic Programming and Evolvable Machines, 2(4):381–407.
Brameier, M., and Banzhaf, W. (2007). Linear genetic programming. 1st ed. New York: Springer.
Braylan, A., Hollenbeck, M., Meyerson, E., and Miikkulainen, R. (2015). Reuse of neural modules
for general video game playing. Retrieved from arXiv:1512.01537.
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine
Learning Research, 7(1):1–30.
Doucette, J. A., Lichodzijewski, P., and Heywood, M. IO. (2012). Hierarchical task decomposition
through symbiosis in reinforcement learning. In Proceedings of the ACM Genetic and Evolu-
tionary Computation Conference, pag. 97–104.
Hausknecht, M., Lehman, J., Miikkulainen, R., and Stone, P. (2014). A neuroevolution approach to
general Atari game playing. IEEE Transactions on Computational Intelligence and AI in Games,
6(4):355–366.
Hausknecht, M., and Stone, P. (2015). The impact of determinism on learning Atari 2600 games.
In AAAI Workshop on Learning for General Competency in Video Games.
Imamura, K., Soule, T., Heckendorn, R. B., and Foster, J. UN. (2003). Behavioural diversity and
probabilistically optimal GP ensemble. Genetic Programming and Evolvable Machines, 4(3):235–
254.
Japkowicz, N., and Shah, M. (2011). Evaluating learning algorithms. Cambridge: Cambridge Uni-
versity Press.
Kelly, S., and Heywood, M. IO. (2014UN). Genotypic versus behavioural diversity for teams of pro-
grams under the 4-v-3 keepaway soccer task. In Proceedings of the AAAI Conference on Artificial
Intelligenza, pag. 3110–3111.
Kelly, S., and Heywood, M. IO. (2014B). On diversity, teaming, and hierarchical policies: Obser-
vations from the keepaway soccer task. In European Conference on Genetic Programming, pag.
75–86. Lecture Notes in Computer Science, Vol. 8599.
Kelly, S., and Heywood, M. IO. (2017UN). Emergent tangled graph representations for Atari game
playing agents. In European Conference on Genetic Programming, pag. 64–79. Lecture Notes in
Computer Science, Vol. 10196.
376
Evolutionary Computation Volume 26, Numero 3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
.
/
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Emergent High-Dimensional Multitask Reinforcement Learning
Kelly, S., and Heywood, M. IO. (2017B). Multi-task learning in Atari video games with emergent
tangled program graphs. In Proceedings of the Genetic and Evolutionary Computation Conference,
pag. 195–202.
Kelly, S., Lichodzijewski, P., and Heywood, M. IO. (2012). On run time libraries and hierarchical
symbiosis. In IEEE Congress on Evolutionary Computation, pag. 3245–3252.
Kelly, S., Smith, R., and Heywood, M. IO. (2018). Emergent policy discovery for visual reinforce-
ment learning through tangled program graphs: A tutorial. In Genetic programming theory
and practice XVI. New York: Springer.
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, UN. A., Milan, K.,
Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., E
Hadsell, R. (2016). Overcoming catastrophic forgetting in neural networks. Retrieved from
arXiv:1612.00796.
Kober, J., and Peters, J. (2012). Reinforcement learning in robotics: A survey. In M. Wiering and
M. van Otterio (Eds.), Reinforcement learning, pag. 579–610. New York: Springer.
Lehman, J., and Stanley, K. O. (2011). Abandoning objectives: Evolution through the search for
novelty alone. Evolutionary Computation, 19(2):189–223.
Liang, Y., Machado, M. C., Talvitie, E., and Bowling, M. (2016). State of the art control of Atari
games using shallow reinforcement learning. In Proceedings of the ACM International Confer-
ence on Autonomous Agents and Multiagent Systems, pag. 485–493.
Lichodzijewski, P., and Heywood, M. IO. (2008UN). Coevolutionary bid-based genetic programming
for problem decomposition in classification. Genetic Programming and Evolvable Machines,
9:331–365.
Lichodzijewski, P., and Heywood, M. IO. (2008B). Managing team-based problem solving with sym-
biotic bid-based genetic programming. In Proceedings of the ACM Genetic and Evolutionary
Computation Conference, pag. 863–870.
Lichodzijewski, P., and Heywood, M. IO. (2010). Symbiosis, complexification and simplicity un-
der GP. In Proceedings of the ACM Genetic and Evolutionary Computation Conference, pag. 853–
860.
Lichodzijewski, P., and Heywood, M. IO. (2011). The Rubik cube and GP temporal sequence learn-
ing: An initial study. In Genetic programming theory and practice VIII, chapter 3, pag. 35–54.
New York: Springer.
Mnih, V., Badia, UN. P., Mirza, M., Graves, A., Harley, T., Lillicrap, T. P., Silver, D., and Kavukcuoglu,
K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd
International Conference on Machine Learning, pag. 1928–1937.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, UN. A., Veness, J., Bellemare, M. G., Graves, A., Ried-
miller, M., Fidjeland, UN. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I.,
King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control
through deep reinforcement learning. Nature, 518(7540):529–533.
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., de Maria, A., Panneershelvam, V.,
Suleyman, M., Beattie, C., Petersen, S., Legg, S., Mnih, V., Kavukcuoglu, K., and Silver, D.
(2015). Massively parallel methods for deep reinforcement learning. In International Confer-
ence on Machine Learning—Deep Learning Workshop.
Nolfi, S. (1997). Using emergent modularity to develop control systems for mobile robots. Adaptive
Behavior, 5(3–4):343–363.
Parisotto, E., Ba, l. J., and Salakhutdinov, R. (2015). Actor-mimic: Deep multitask and transfer
reinforcement learning. Retrieved from arXiv:1511.06342.
Evolutionary Computation Volume 26, Numero 3
377
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
.
/
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
S. Kelly and M. IO. Heywood
Pepels, T., and Winands, M. H. M. (2012). Enhancements for Monte-Carlo tree search in Ms Pac-
Man. In IEEE Symposium on Computational Intelligence in Games, pag. 265–272.
Schrum, J., and Miikkulainen, R. (2016). Discovering multimodal behavior in Ms. Pac-Man
through evolution of modular neural networks. IEEE Transactions on Computational Intelli-
gence and AI in Games, 8(1):67–81.
Smith, R. J., and Heywood, M. IO. (2018). Scaling tangled program graphs to visual reinforcement
learning in Vizdoom. In European Conference on Genetic Programming, pag. 135–150. Lecture
Notes in Computer Science, Vol. 10781.
Szita, IO. (2012). Reinforcement learning in games. In M. Wiering and M. van Otterio (Eds.), Rein-
forcement learning, pag. 539–577. New York: Springer.
Thomason, R., and Soule, T. (2007). Novel ways of improving cooperation and performance in en-
semble classifiers. In Proceedings of the ACM Genetic and Evolutionary Computation Conference,
pag. 1708–1715.
van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double Q-
apprendimento. In Proceedings of the AAAI Conference on Artificial Intelligence, pag. 2094–2100.
Wu, S., and Banzhaf, W. (2011). Rethinking multilevel selection in genetic programming. Nel professionista-
ceedings of the ACM Genetic and Evolutionary Computation Conference, pag. 1403–1410.
Appendix A: Comparator Tables
Test performance over all 49 game titles is split between two comparator groups (Sezione
5.1), those assuming screen capture state information (Vedi la tabella 7) and those assuming
hand-crafted state (Vedi la tabella 8). TPG uses screen capture state in both cases.
Tavolo 7: Average game score of best agent under test conditions for TPG along with
comparator algorithms in which screen capture represents state information. Numbers
in bold represent best score on each game title. Source information for comparitor al-
gorithms is as follows: DQN (Mnih et al., 2015), Gorila (Nair et al., 2015), Double DQN
(van Hasselt et al., 2016), and Hyper-NEAT (Hausknecht and Stone, 2015).
Game
TPG
DQN
Gorila
Double DQN
Hyper-NEAT
Alien
Amidar
Assault
Asterix
Asteroids
Atlantis
Bank Heist
Battle Zone
Beam Rider
Bowling
Boxing
Breakout
Centipede
C. Command
Crazy Climber
Demon Attack
3,382.7
398.4
2,422
2,400
3,050.7
89,653
1,051
47,233.4
1,320.8
223.7
76.5
12.8
34,731.7
7,070
8,367
2,920.4
3,069.3
739.5
3,359.6
6,011.7
1,629.3
85,950
429.7
26,300
6,845.9
42.4
71.8
401.2
8,309.4
6,686.7
114,103.3
9,711.2
2,621.53
1,189.7
1,450.4
6,433.3
1,047.7
100,069
609
25,266.7
3,302.9
54
94.9
402.2
8,432.3
4,167.5
85,919.1
13,693.1
2,907.3
702.1
5,022.9
15,150
930.3
64,758
728.3
25,730
7,654
70.5
81.7
375
4,139.4
4,653
101,874
9,711.9
1,586
184.4
912.6
2,340
1,694
61,260
214
36,200
1,412.8
135.8
16.4
2.8
25,275.2
3,960
0
3,590
378
Evolutionary Computation Volume 26, Numero 3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Emergent High-Dimensional Multitask Reinforcement Learning
Game
TPG
DQN
Gorila
Double DQN
Hyper-NEAT
Tavolo 7: Continued.
Double Dunk
Enduro
Fishing Derby
Freeway
Frostbite
Gopher
Gravitar
H.E.R.O.
Ice Hockey
James Bond
Kangaroo
Krull
Kung-Fu Master
M’s Revenge
Ms. Pac-Man
Name This Game
Pong
Private Eye
Q*Bert
River Raid
Road Runner
Robotank
Seaquest
Space Invader
Star Gunner
Tennis
Time Pilot
Tutankham
Up ’N Down
Venture
Video Pinball
Wizard of Wor
Zaxxon
Avg. Rank (Ri)
2
125.9
49
28.9
8,144.4
581.4
786.7
16,545.4
10
3,120
14,780
12,850.4
43,353.4
0
5,156
3,712
6
15,028.3
2,245
3,884.7
27,410
22.9
1,368
1,597.2
1,406.7
0
13,540
128
34,416
576.7
37,954.4
5,196.7
6,233.4
2.74
−18.1
301.8
−0.8
30.3
328.3
8,520
306.7
19,950.3
−1.6
576.7
6,740
3,804.7
23,270
0
2,311
7,256.7
18.9
1,787.6
10,595.8
8,315.7
18,256.7
51.6
5,286
1,975.5
57,996.7
−1.6
5,946.7
186.7
8,456.3
380
42,684.1
3,393.3
4,976.7
3.11
−10.6
114.9
20.2
11.7
605.2
5,279
1,054.6
14,913.9
−0.6
605
2,547.2
7,882
27,543.3
4.3
3,233.5
6,182.2
18.3
748.6
10,815.6
8,344.8
51,008
36.4
13,169.1
1,883.4
19,145
10.9
10,659.3
245
12,561.6
1,245
157,550.2
13,731.3
7,129.3
2.63
−6.3
319.5
20.3
31.8
241.5
8,215.4
170.5
20,357
−2.4
438
13,651
4,396.7
29,486
0
3,210
6,997.1
21
670.1
14,875
12,015
48,377
46.7
7,995
3,154
65,188
1.7
7,964
190.6
16,769.9
0
70,009
5,204
10,182
2.64
2
93.6
−49.8
29
2,260
364
370
5,090
10.6
5,660
800
12,601.4
7,720
0
3,408
6,742
−17.4
10,747.4
695
2,616
3,220
43.8
716
1,251
2,720
0
7,340
23.6
43,734
1,188
0
3,360
3,000
3.87
Tavolo 8: Average game score of best agent under test conditions for TPG (screen capture)
along with comparator algorithms based on prior object/feature identification. Numbers in
bold represent best score on each game title. Source information for comparitor algo-
rithms is as follows: Blob-PROST (Liang et al., 2016), Hyper-NEAT (Hausknecht and
Stone, 2015), NEAT (Hausknecht and Stone, 2015), and Conti-Sarsa (Mnih et al., 2015).
Game
TPG
Blob-PROST
Hyper-NEAT
NEAT
Conti-Sarsa
Alien
Amidar
Assault
Asterix
3,382.7
398.4
2,422
2,400
4,886.6
825.6
1,829.3
2,965.5
2,246
218.8
2,396
2,550
4,320
325.2
2,717.2
1,490
103.2
183.6
537
1,332
Evolutionary Computation Volume 26, Numero 3
379
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
/
.
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
S. Kelly and M. IO. Heywood
Game
TPG
Blob-PROST
Hyper-NEAT
NEAT
Conti-Sarsa
Tavolo 8: Continued.
Asteroids
Atlantis
Bank Heist
Battle Zone
Beam Rider
Bowling
Boxing
Breakout
Centipede
C. Command
Crazy Climber
Demon Attack
Double Dunk
Enduro
Fishing Derby
Freeway
Frostbite
Gopher
Gravitar
H.E.R.O.
Ice Hockey
James Bond
Kangaroo
Krull
Kung-Fu Master
M’s Revenge
Ms. Pac-Man
Name This Game
Pong
Private Eye
Q*Bert
River Raid
Road Runner
Robotank
Seaquest
Space Invader
Star Gunner
Tennis
Time Pilot
Tutankham
Up ’N Down
Venture
Video Pinball
Wizard of Wor
Zaxxon
Avg. Rank (Ri)
3,050.7
89,653
1,051
47,233.4
1,320.8
223.7
76.5
12.8
34,731.7
7,070
8,367
2,920.4
2
125.9
49
28.9
8,144.4
581.4
786.7
16,545.4
10
3,120
14,780
12,850.4
43,353.4
0
5,156
3,712
6
15,028.3
2,245
3,884.7
27,410
22.9
1,368
1,597.2
1,406.7
0
13,540
128
34,416
576.7
3,794.4
5,196.7
6233.4
2.81
2,229.9
42,937.7
793.6
37,850
2,965.5
91.1
98.3
190.3
21,137
4,898.9
81,016
2,166
−4.1
299.1
−28.8
32.6
4,534
7,451.1
1,709.7
20,273.1
22.8
1,030.5
9,492.8
33,263.4
51,007.6
2,508.4
5,917.9
7,787
20.5
100.3
14,449.4
14,583.3
41,828
34.4
2,278
889.8
1,651.6
0
5,429.5
217.7
41,257.8
1,397
21,313
5,681.2
11,721.8
2.16
220
44,200
1,308
37,600
1,443.2
250.4
91.6
40.8
33,326.6
8,120
12,840
3,082
4
112.8
−37
29.6
2,226
6,252
1,990
3,638
9
12,730
4,880
23,890.2
47,820
0
3,830
8,346
4
15,045.2
810
4,736
14,420
42.4
2,508
1,481
4,160
0.2
15,640
110
6,818
400
82,646
3,760
4,680
2.81
4,144
126,260
380
45,000
1,900
231.6
92.8
43.6
22,469.6
4,580
25,060
3,464
10.8
133.8
−43.8
30.8
1,452
6,029
2,840
3,894
3.8
2,380
12,800
20,337.8
87,340
340
4,902
7,084
15.2
1,926.4
1,935
4,718
9,600
18
944
1,481
9,580
1.2
14,320
142.4
10,220
340
253,986
17,700
6,460
2.47
89
852.9
67.4
16.2
1,743
36.4
9.8
6.1
4,647
16.9
149.8
0
−16
159.4
−85.1
19.7
180.9
2,368
429
7,295
−3.2
354.1
8.8
3,341
29,151
259
1,227
2,247
−17.4
86
960.3
2,650
89.1
12.4
675.5
267.9
9.4
0
24.9
98.2
2,449
0.6
19,761
36.9
21.4
4.76
380
Evolutionary Computation Volume 26, Numero 3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
/
/
e
D
tu
e
v
C
o
UN
R
T
io
C
e
–
P
D
l
F
/
/
/
/
2
6
3
3
4
7
1
5
5
2
3
7
5
e
v
C
o
_
UN
_
0
0
2
3
2
P
D
.
/
F
B
sì
G
tu
e
S
T
T
o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3