A Survey of Text Games for Reinforcement Learning Informed by - Ricerca sull'intelligenza artificiale specializzata al MIT

A Survey of Text Games for Reinforcement Learning Informed by
Natural Language

Philip Osborne
Department of Computer Science
University of Manchester
United Kingdom
philiposbornedata@gmail.com

Heido N˜omm
Department of Computer Science
University of Manchester
United Kingdom
heidonomm@gmail.com

Andr´e Freitas
Department of Computer Science
University of Manchester
United Kingdom

andre.freitas@manchester.ac.uk

Astratto

Reinforcement Learning has shown success in
a number of complex virtual environments.
Tuttavia, many challenges still exist towards
solving problems with natural language as a
core component. Interactive Fiction Games
(or Text Games) are one such problem type
that offer a set of safe, partially observable
environments where natural language is re-
quired as part of the Reinforcement Learning
solution. Therefore, this survey’s aim is to
assist in the development of new Text Game
problem settings and solutions for Reinforce-
ment Learning informed by natural language.
Specifically, this survey: 1) introduces the chal-
lenges in Text Game Reinforcement Learn-
ing problems, 2) outlines the generation tools
for rendering Text Games and the subsequent
environments generated, E 3) compares the
agent architectures currently applied to provide
a systematic review of benchmark methodolo-
gies and opportunities for future researchers.

introduzione

Language is often used by humans to abstract,
transfer, and communicate knowledge of their de-
cision making when completing complex tasks.
Tuttavia,
traditional Reinforcement Learning
(RL) metodi, such as the prominent neural agents
introduced by Mnih et al. (2013) and Silver et al.
(2016), are limited to single task environments
defined and solved without any language.

Luketina et al. (2019) specified that further
studies are required for improving solutions on
problems that necessitate the use of language.

873

The authors also note that language may improve
solutions on problems that are traditionally solved
without language. This includes the use of neural
agents pre-trained on natural language corpora
transferring syntactic and semantic information
to future tasks. An example is the use of the
similarities between the tasks of chopping and
cutting a carrot in CookingWorld (Trischler et al.,
2019)—that is, completing one should allow you
to transfer the decision making to the other.

To aid in developing solutions on these prob-
lems, we pose Text Games as a testing environ-
ment as they simulate complex natural language
problems in controllable settings. In other words,
researchers can generate the Text Games with
limitations to evaluate any number of the specific
challenges given by Dulac-Arnold et al. (2019)
and Luketina et al. (2019). Detailed descriptions of
the possible controls (defined as ‘handicaps’) are
provided in Section 2.4. Likewise, the challenges
with currently posed solutions are provided in
Sezione 2.3 and are summarized by the following:

• Partial observability – observed information
is only a limited view of the underlying truth
(shown in Figure 1).

• Large state space – balancing exploration to
new states against exploiting known paths.
• Large and sparse action space – lingua
increases the number of actions as there are
multiple ways to describe the same input.
• Long-term credit assignment – apprendimento
which actions are important when reward

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 10, pag. 873–887, 2022. https://doi.org/10.1162/tacl a 00495
Redattore di azioni: Marco Baroni. Lotto di invio: 12/2021; Lotto di revisione: 3/2022; Pubblicato 8/2022.
C(cid:2) 2022 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Inoltre, we also provide details on the cur-
rently generated environments and the resultant
RL agent architectures used to solve them. Questo
acts as a complement to prior studies in the chal-
lenges of real-world RL and RL informed by nat-
ural language by Dulac-Arnold et al. (2019) E
Luketina et al. (2019), rispettivamente.

2 Text Games and Reinforcement

Apprendimento

Text Games are turn-based games that interpret
and execute player commands to update the cur-
rent position within the game environment. IL
current state is provided to the player in language
and the player must take actions by entering tex-
tual commands that are parsed and confirmed to
be valid or not. Many of these games were de-
signed by human developers for human players
based on real-world logic with a clear definition
of how a player may win the game (per esempio., Zork
[Anderson et al., 1980]). In many cases, common
knowledge is used to complete the task efficiently
COME, Per esempio, a key encountered at one point
will be needed to progress through locked doors
later in the game.

This creates a complex set of decisions that are
connected by contextual knowledge of how ob-
jects interact with each other to reach a clear goal.
Reinforcement Learning is a natural solution, COME
the problem can be defined by a sequence of states
dependent on the player’s actions with a reward
signal defined on the game’s win condition. How-
ever, an agent will also need to find an efficient
solution to the linguistic challenges that a human
player would experience when exploring through
the text-based game.

In this section, we first formalize the Reinforce-
ment Learning model before introducing the Text
Game generators, challenges, and possible con-
trols to provide the essential background to the
published methods introduced in later sections.

2.1 Environment Model

Reinforcement Learning is a framework that
enables agents to reason about sequential deci-
sion making problems as an optimization process
(Sutton and Barto, 1998). Insegnamento rafforzativo
requires a problem to be formulated as a Markov
Decision Process (MDP) defined by a tuple < S, A, T, R, γ > where S is the set of states, A is
the set of actions, T is the transition probability

Figura 1: Sample gameplay from a fantasy Text Game
as given by Narasimhan et al. (2015) where the player
takes the action ‘Go East’ to cross the bridge.

signals might only be received at the end
of many successive steps.

• Understanding parser feedback and lan-
guage acquisition – how additional language
may be grounded into the problem itself.
• Commonsense reasoning – how to utilize
contextual knowledge of the problem to
improve the solutions.

• Knowledge representation – using graph
representation of knowledge for improved
planning.

Text Games are also safe in that we are not
making decisions on a ‘live’ system (cioè., one that
affects the real-world). This allows agents to ex-
plore freely without limitations and is often a
requirement of training any Reinforcement Learn-
ing agent. Inoltre, this has also been used to
evaluate methods with unique goals such as Yuan
et al. (2018), who wanted to maximize exploration
to new states and achieved this by adding a reward
to the first encounter of an unseen state.

So far, research has mostly been performed in-
dependently, with many authors generating their
own environments to evaluate their proposed ar-
chitectures. This lack of uniformity in the envi-
ronments makes comparisons between authors
challenging and a need for structuring recent work
is essential for systematic comparability.

Formalmente, this survey provides the first system-
atic review of the challenges posed by the gener-
ation tools designed for Text Game evaluation.

874

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

function, R the reward signal, and γ the dis-
count factor.

Given an environment defined by an MDP, IL
goal of an agent is to determine a policy π(UN|S)
specifying the action to be taken in any state that
maximizes the expected discounted cumulative
return

(cid:2)∞

k=0 γkrk+1.

In Text Games however, the environment states
are never directly observed, but rather, textual
feedback is provided after entering a command.
As specified by Cˆot´e et al. (2018), a Text Game is a
discrete-time Partially Observed Markov Decision
Process defined by < S, A, T, Ω, O, R, γ > Dove
we now have the addition of the set of observa-
zioni (Ω) and a set of conditional observed prob-
abilities O. Specifically, the function O selects
from the environment state what information to
show to the agent given the command entered to
produce each observation ot ∈ Ω.

Per esempio, instead of being provided with
[X, sì] coordinates or the room’s name to define
the current state, you are instead given a descrip-
tion of your location as ‘You an in a room with
a toaster, a sink, fridge and oven.’. One might
deduce in this example that the player is likely
in a kitchen but it raises an interesting idea that
another room may contain one of these objects
and not be the kitchen (per esempio., fridges/freezers are
also often located in utility rooms and sinks are in
bathrooms). The challenge of partial observability
is expanded on in Section 2.3.

2.2 Text Game Generation

Before we introduce the challenges and handicaps
it is important to understand how Text Games are
generated as this defines much of the game’s com-
plexity in both decision making and the language
used. There are two main generation tools for
applying agents to interactive fiction games: Testo-
World and Jericho.

TextWorld (Cˆot´e et al., 2018) is a logic engine
to create new game worlds, populating them with
objects and generating quests that define the goal
stati (and subsequent reward signals). It has been
used as the generation tool for Treasure Hunter
(Cˆot´e et al., 2018), Coin Collector (Yuan et al.,
2018), CookingWorld (Trischler et al., 2019), E
the QAit Dataset (Yuan et al., 2019).

Jericho (Hausknecht et al., 2019UN) was more
recently developed as a tool for supporting a set
of human-made interactive fiction games that

875

cover a range of genres. These include titles such
as Zork and Hitchhiker’s Guide to the Galaxy.
Unsupported games can also be played through
Jericho but will not have the point-based scoring
system that defines the reward signals. Jericho has
been used as the generator tool for CALM (Yao
et al., 2020) and Jericho QA (Ammanabrolu et al.,
2020UN).

Environments built from the TextWorld gener-
ative system binds the complexity by the set of
objects available. Per esempio, Cˆot´e et al. (2018)
introduce 10 objects including the logic rules for
doors, containers, and keys, where complexity can
be increased by introducing more objects and rules
into the generation process. The most challenging
environments are defined in Jericho, as these con-
tain 57 real Interactive Fiction games that have
been designed by humans, for humans. Specifi-
cally, these environments contain more complex-
ity in forms of stochasticity, unnatural interactions
and unknown objectives—difficulties originally
created to trick and hamper players.

More detailed descriptions of the environments

generated from these are provided in Section 3.

2.3 Challenges and Posed Solutions

The design and partially observed representation
of Text Games creates a set of natural challenges
related to Reinforcement Learning. Inoltre,
a set of challenges specific to language under-
standing and noted by both Cˆot´e et al. (2018) E
Hausknecht et al. (2019UN) are given in detail in
this section.

Partial Observability The main challenge for
agents solving Textual Games is the environ-
ment’s partial observability; when observed infor-
mation is only representative of a small part of
the underlying truth. The connection between the
two is often unknown and can require extensive
exploration and failures for an agent to learn from
observations and how this relates to its actions.

A related additional challenge is that of causal-
ità, which is when an agent moves away from a
state to the next without completing prerequisites
of future states. Per esempio, an agent is required
to use a lantern necessary to light its way but may
have to backtrack to previous states if this has not
been obtained yet; an operation that becomes more
complex as the length of the agent’s trajectory
increases the further into the game they have to
backtrack.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Handcrafted reward functions have been proved
to work for easier settings, like CoinCollector
(Yuan et al., 2018), but more challenging Text
Games can require more nuanced approaches. Go-
Explore has been used to find high-reward trajec-
tories and discover under-explored states (Madotto
et al., 2020; Ammanabrolu et al., 2020UN) Dove
more advanced states are given higher priority over
states seen earlier on in the game by a weighted
random exploration strategy. Ammanabrolu et al.
(2020UN) have expanded on this with a modular pol-
icy approach aimed at noticing bottleneck states
with a combination of a patience parameter and
intrinsic motivation for new knowledge. The agent
would learn a chain of policies and backtrack to
previous states upon getting stuck. Heuristic-based
approaches have been used by Hausknecht et al.
(2019B) to restrict navigational commands to only
after all other interactive commands have been
exhausted.

Leveraging past information has been proven to
improve model performance as it limits the partial
observability aspect of the games. Ammanabrolu
and Hausknecht (2020) propose using a dynami-
cally learned Knowledge Graph (KG) with a novel
graph mask to only fill out templates with entities
already in the learned KG.

Large State Space Whereas Textual Games,
like all RL problems, require some form of explo-
focusration to find better solutions, some papers
focus specifically on countering the natural over-
fitting of RL by actively encouraging exploration
to unobserved states in new environments. For
esempio, Yuan et al. (2018) achieved this by set-
ting a reward signal with a bonus for encountering
a new state for the first time. This removes the
agent’s capability for high-level contextual knowl-
edge of the environment in favor of simply search-
ing for unseen states.

Subsequent work by Cˆot´e et al. (2018)—Treasure
Hunter—has expanded on this by increasing the
complexity of the environment with additional
obstacles such as locked doors that need color
matching keys requiring basic object affordance
and generalization ability. In a similar vein to
Treasure Hunter, where in the worst case agents
have to traverse all states to achieve the objective,
the location and existence settings of QAit (Yuan
et al., 2019) require the same with addition of
stating the location or existence of an object in the
generated game. These solutions are also related

to the challenge of Exploration vs Exploitation
that is commonly referenced in all RL literature
(Sutton and Barto, 1998).

Large, Combinatorial, and Sparse Action
Spaces Without any restrictions on length or se-
mantics, RL agents aiming to solve games in this
domain face the problem of an unbounded action
spazio. Early works limited the action phrases to
two word sentences for a verb-object, more re-
cently combinatory action spaces are considered
that include action phrases with multiple verbs and
objects. A commonly used method for handling
combinatory action spaces has been to limit the
agent to picking a template T and then filling in the
blanks with entity extraction (Hausknecht et al.,
2019UN; Ammanabrolu et al., 2020UN; Guo et al.,
2020).

Two other approaches have been used: (io) ac-
tion elimination (Zahavy et al., 2018; Jain et al.,
2020), E (ii) generative models (Tao et al., 2018;
Yao et al., 2020). The first aims to use Deep Re-
inforcement Learning with an Action Elimination
Network for approximation of the admissibility
function: whether the action taken changes the un-
derlying game state or not. The second has been
used in limited scope with pointer softmax mod-
els generating commands over a fixed vocabulary
and the recent textual observation. The CALM
generative model, leveraging a fine-tuned GPT-2
for textual games, has proved to be competitive
against models using valid action handicaps.

Long-Term Credit Assignment Assigning re-
wards to actions can be difficult in situations when
the reward signals are sparse. Specifically, positive
rewards might only be obtained at the successful
completion of the game. Tuttavia, environments
where an agent is unlikely to finish the game
through random exploration provide rewards for
specific subtasks such as in Murugesan et al.
(2020UN), Trischler et al. (2019), and Hausknecht
et al. (2019UN). The reward signal structured in this
way also aligns with hierarchical approaches such
as in Adolphs and Hofmann (2020). Lastly, A
overcome the challenges presented with reward-
sparsity, various hand-crafted reward signals have
been experimented with (Yuan et al., 2018;
Ammanabrolu et al., 2020UN).

Understanding Parser Feedback and Language
Acquisition LIGHT (Urbanek et al., 2019) is a
crowdsourced platform for the experimentation of

876

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

grounded dialogue in fantasy settings. It differs
from the previous action-oriented environments
by requiring dialogue with humans, embodied
agents, and the world itself as part of the quest
completion. The authors design LIGHT to in-
vestigate how ‘a model can both speak and act
grounded in perception of its environment and
dialogue from other speakers’.

Ammanabrolu et al. (2020B) extended this by
providing a system that incorporates ‘1) large-
scale language modelling based commonsense
reasoning pre-training to imbue the agent with
relevant priors and 2) a factorized action space
of commands and dialogue’. Inoltre, evalu-
ation can be performed against a dataset collected
of held-out human demonstrations.

Commonsense Reasoning and Affordance
Extraction As part of their semantic interpreta-
zione, Textual Games require some form of com-
monsense knowledge to be solved. Per esempio,
modeling the association between actions and
associated objects (opening doors instead of cut-
ting them, or the fact that taking items allows the
agent to use them later on in the game). Various
environments have been proposed for testing
procedural knowledge in more distinct domains
and to assess the agent’s generalization abilities.
Per esempio, Trischler et al. (2019) proposed
the ‘First TextWorld Problems’ competition with
the intention of setting a challenge requiring more
planning and memory than previous benchmarks.
To achieve this, the competition featured ‘thou-
sands of unique game instances generated using
the TextWorld framework to share the same over-
arching theme—an agent is hungry in a house and
has a goal of cooking a meal from gathered in-
gredients’. The agents therefore face a task that is
more hierarchical in nature as cooking requires
abstract
instructions that entail a sequence of
high-level actions on objects that are solved as
sub-problems. Inoltre, TW-Commonsense
(Murugesan et al., 2020UN) is explicitly built around
agents leveraging prior commonsense knowledge
for object-affordance and detection of out of
place objects.

Two pre-training datasets have been proposed
that form the evaluation goal for specialized mod-
ules of RL agents. The ClubFloyd dataset1 provides

1http://www.allthingsjacq.com/interactive

fiction.html#clubfloyd.

human playthroughs of 590 different text-based
games, allowing to build priors and pre-train gen-
erative action generators. Likewise, the Jericho-
QA (Ammanabrolu et al., 2020UN) dataset provides
context at a specific timestep in various classi-
cal IF games supported by Jericho, and a list of
questions, enabling pre-training of QA systems in
the domain of Textual Games. They also used the
dataset for fine-tuning a pre-trained LM for build-
ing a question-answering-based KG.

Lastly, ALFWorld (Shridhar et al., 2020) offers
a new dimension by enabling the learning of a
general policy in a textual environment and then
testing and enhancing it in an embodied environ-
ment with a common latent structure. The general
tasks also require commonsense reasoning for the
agent to make connections between items and
attributes, Per esempio, (sink, ‘‘clean’’), (lamp,
‘‘light’’). Likewise, the attribute setting of the
QAit (Yuan et al., 2019) environment demands
agents to understand attribute affordances (cut-
table, edible, cookable) to find and alter (cut, eat,
cook) objects in the environment.

Safety constraints are often required to ensure
that an agent does not make: costly, damaging,
and irreparable mistakes when learning (Dulac-
Arnold et al., 2019). A common method is to
utilize Constrained-MDPs that restrict the agent’s
exploration but, defining the restriction requires a
specification on which states need to be avoided.
Alternatively, Hendrycks et al. (2021) adjust the
reward function to avoid bad actions as well as
encourage positive actions for more humanistic
decision making behaviors. They achieve this by
training the agent on a dataset labelled with a
numeric scale to denote the morality value of any
action, thus providing commonsense knowledge
to the agent. The annotations were completed by
a group of computer science graduate and under-
graduate students over a period of 6 months em-
phasizing the scale in the challenge of labelling
data such as this.

Knowledge Representation Although this is
not a challenge of the environment itself, knowl-
edge representation has been a focus of many so-
lutions and therefore we include it here as there are
challenges in accurately representing the game’s
underlying truth in this way based on the nature
of the partially observed state representation.

It can be specified that at any given time step,
the game’s state can be represented as a graph

877

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

that captures observation entities (player, objects,
locations, eccetera) as vertices and the relationship be-
tween them as edges. As Text Games are partially
observable, an agent can track its belief of the en-
vironment into a knowledge graph as it discovers
Esso, eventually converging to an accurate represen-
tation of the entire game state (Ammanabrolu and
Riedl, 2019).

In contrast to methods that encode their entire
KG into a single vector (as shown in Ammanabrolu
et al., 2020, and Ammanabrolu et al., 2020UN),
Xu et al. (2020) suggest an intuitive approach of
using multiple sub-graphs with different seman-
tic meanings for multi-step reasoning. Previous
approaches have relied on predefined rules and
Stanford’s Open Information Extraction (Angeli
et al., 2015) for deriving information from ob-
servations for KG construction. Adhikari et al.
(2020) have instead built an agent that is capable
of designing and updating it’s belief graph with-
out supervision.

Guo et al. (2020) re-frames the Text Games
problem by considering observations as pas-
sages in a multi-passage RC task. They use an
object-centric past-observation retrieval
to en-
hance current state representations with relevant
past information and then apply attention to draw
focus on correlations for action-value prediction.

2.4 Handicaps

With the challenges introduced, we may now
consider the limitations that can be imposed by
the generation tools to reduce the complexity of
each problem. These handicaps are typically used
to limit the scope of the challenges being faced at
any one time for more rigorous comparisons and
for simplicity.

It has been noted that TextWorld’s (Cˆot´e et al.,
2018) generative functionality explicit advantage
is that it can be used to focus on a desired subset
of challenges. Per esempio, the size of the state
space can be controlled and how many commands
are required in order to reach the goal. Evaluation
of specific generalizability measures can also be
improved by controlling the training vs testing
variations.

The partial observability of the state can also
be controlled by augmenting the agent’s observa-
zioni. It is possible for the environment to provide
all information about the current game state and
therefore reduce the amount an agent must explore

to determine the world, relationships, and objects
contained within.

Inoltre, the complexity of the language
itself can be reduced by restricting the agent’s
vocabulary to in-game words only or the verbs to
only those understood by the parser. The gram-
mar can be further simplified by replacing object
names with symbolic tokens. It is even possible
for the language generation to be avoided com-
pletely by converting every generated game into a
choice-based game where actions at each timestep
are defined by a list of pre-defined commands to
choose from.

Rewards can simplified with more immediate
rewards during training based on the environment
state transitions and the known ground truth win-
ning policy rather than simply a sparse reward at
the end of a quest as normally provided.

Actions are defined by text commands of at
least one word. The interpreter can accept any
sequence of characters but will only recognize a
tiny subset and moreso only a fraction of these will
change the state of the world. The action space
is therefore enormous and so two simplifying
assumptions are made:

– Word-level Commands are sequences of at

most L words taken from a fixed vocabulary V.

– Syntax Commands have the following struc-
ture – verb[noun phrase [adverb phrase]] Dove
[. . . ] indicates that the sub-string is optional.

Jericho (Hausknecht et al., 2019UN) similarly has
a set of possible simplifying steps to reduce the
environment’s complexity. Most notably, each en-
vironment provides agents with the set of valid
actions in each game’s state. It achieves this by
executing a candidate action and looking for the
resulting changes to the world-object-tree. To fur-
ther reduce the difficulty of the games, optional
handicaps can be used:

• Fixed random seed to enforce determinism

• Use of load, save functionality

• Use of game-specific templates and vocabu-

lary

• Use of world object tree as an auxiliary state
representation or method for detection player
location and objects

• Use of world-change-detection to identify

valid actions

878

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Nome

Task description

Collect the Twenty Treasures of Zork
Collect varying items in a Home
Collect a coin in a Home

Zork I (Anderson et al., 1980)
Treasure Hunter (Cˆot´e et al., 2018)
Coin Collector (Yuan et al., 2018)
FTWP/CookingWorld (Trischler et al., 2019) Select & Combine varying items in a KitchenJ, ZS
S
Jericho’s SoG (Hausknecht et al., 2019UN)
ZS
QAit (Yuan et al., 2019)
TW-Home (Ammanabrolu and Riedl, 2019) Find varying objects in a home
ZS
TW-Commonsense (Murugesan et al., 2020UN) Collect and Move varying items in a Home ZS
TW-Cook (Adhikari et al., 2020)
TextWorld KG (Zelinka et al., 2019)
ClubFloyd (Yao et al., 2020)
Jericho-QA (Ammanabrolu et al., 2020UN)

Gather and process cooking ingredients
Constructing KGs from TG observations
Human playthroughs of various TGs
QA from context strings

A set of classical TGs
Three QA type settings

J, ZS
−
−
−

110
20
90
12
NS
12
20
2
9
N/A
N/A
N/A

N Various
N
Y
Y
Y
Y
N
N
N

3
1
2
3
5
1
1
1

251
2
1
56
221avg
27
40
7
34.1
N/A
N/A
N/A

|V | max |quest|

697
NS
NS
NS
64 ± 9
NS
97 ± 49
20,000
42, 2avg 762avg
1,647
819
NS
NS
NS
39,670

237
∼ 4
2
18
NS
93.1
17
94
NS
∼ 4
NS
NS
28.4
29.3
N/A
NS
NS
N/A 223.2avg NS

396
NS
30
72
98avg
NS
10
17
3
N/A
360avg
N/A

Eval GEN #Diffic. max #roomsmax #objects|AV S| len(ot)
N
S
ZS
Y
S,J,ZS Y

1
30
3

Tavolo 1: Environments for Textual-Games. Eval. (S)Single, (J) Joint, (ZS) Zero-Shot; GEN: engine
support for generation of new games; #Diffic.: number of difficulty settings; #rooms & #objects:
number of rooms and objects per game; AVS: size of Action-Verb Space (NS=Not Specified); len(ot):
mean number of tokens in the observation ot; |V |: Vocabulary size; E |quest|: length of optimal
trajectory.

Jericho also introduces a set of possible re-
strictions to the action-space.

• Template-based action spaces separate the
problem of picking actions into two (io) pick-
from ’’);
ing a template (per esempio., ‘‘take
(ii) filling the template with objects (per esempio.,
‘‘apple’’,
‘‘fridge’’). Essentially this re-
duces the issue to verb selection and con-
textualised entity extraction.

• Parser-Based action spaces require the agent
to generate a command word-per-word,
sometimes following a pre-specified struc-
A (verb, object 1, modifier,
tures
object 2).

similar

• Choice-based requires agents to rank prede-
fined set of actions without any option for
‘‘creativity’’ from the model itself.

Lastly, the observation space may be enchanced
with the outputs of bonus commands such as
‘‘look’’ and ‘‘inventory’’. These are commands
that the agent can issue on its own but are not
considered an actual step in the exploration pro-
cess that could be costly and produce risks in the
real-world when asked for.

3 Benchmark Environments and Agents

Thus far, the majority of researchers have in-
dependently generated their own environments
with TextWorld (Cˆot´e et al., 2018). As the field
moves towards more uniformity in evaluation, UN
clear overview of which environments are already
generated, their design goals and the benchmark
approach is needed.

Tavolo 1 shows the publicly available environ-
ments and datasets. Much of the recent research

has been published within the past few years
(2018–2020). Jericho’s Suite of Games (SoG)
(Hausknecht et al., 2019UN) is a collection of 52
games and therefore has its results averaged across
all games included in evaluation.

We find that many of the environments focus
on exploring increasingly complex environments
to ‘collect’ an item of some kind ranging from
a simple coin to household objects. This is due
to the TextWorld generator’s well defined logic
rules for such objects but can also be a limitation
on the possible scope of the environment.

When evaluating an agent’s performance,
three types of evaluation settings are typically
considered:

• Single Game evaluate agents in the same

game under the same conditions,

• Joint settings evaluate agents trained on the
same set of games that typically share some
similarities in the states seen,

• Zero-Shot settings evaluate agents on games

completely unseen in training.

The difficulty settings of the environments are
defined by the complexity of the challenges that
the agents are required to overcome. In most
cases, these have been limited to just a few levels.
Tuttavia, CookingWorld defines the challenge
by a set of key variables that can be changed
separately or in combination and therefore does
not offer clear discrete difficulty settings.

The max number of rooms and objects depends
heavily on the type of task. Coin Collecter, for ex-
ample, has only 1 object to find as the complexity
comes from traversing a number of rooms. Al-
ternatively, CookingWorld has a limited number

879

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figura 2: Overview of the Architecture Structure of Agents Applied to a Simple Text Game Example.

Nome

Encoder

Action Selector

PTF

Pre-Training AS

Tasks

GRU
Ammanabrolu et al. (2020UN)
Xu et al. (2020)
GRU
Ammanabrolu and Hausknecht (2020) GRU
Murugesan et al. (2020B)
GRU
Yao et al. (2020)
GRU
Adolphs and Hofmann (2020)
Bi-GRU
Guo et al. (2020)
Bi-GRU
Xu et al. (2020)
TF
Yin and May (2020)
TF,LSTM
Adhikari et al. (2020)
R-GCN, TF
Zahavy et al. (2018)
CNN
Ammanabrolu and Riedl (2019)
LSTM
He et al. (2016)
BoW
Narasimhan et al. (2015)
LSTM
BERT
Yin et al. (2020)
Madotto et al. (2020)
LSTM

A2C
A2C
A2C
A2C
DRRN
A2C
DQN
DRRN
DSQN
DDQN
DQN
DQN
DRRN
DQN
DQN
Seq2Seq

ALBERT
DL
none
DL
DL
none
DL+CS none
none
none
none
none
none
DL
none
DL
none
none
DL
none

GPT-2
none
none
none
none
none
none
none
none
none
BERT
none

TB
J-QA
TB
none
TB
ClubFloyd
CB
Guanto
ClubFloyd
CB
TS, GloVe100 TB
TB
GloVe100
TB
none
NS
none
CB
TS
CB
word2vec300
TS, GloVe100 CB
CB
none
PB
none
CB
none
PB
GloVe100,300

Zork1
JSoG
JSoG
TW-Commonsense
JSoG
CW
JSoG
JSoG
CW, TH
TW-Cook
Zork1
TW-Home
other
other
FTWP, TH
CC, CW

Tavolo 2: Overview of recent architectural trends. ENC, state/action encoder; KG, knowledge graph
(DL: dynamically learned; CS: commonsense); PTF, pretrained Transformer; PreTr, pretraining (TS:
task specific); AS, Action space (TB: template-based; PB: parser based; CB: choice-based).

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

of rooms and instead focuses on a large number
of objects to consider. Zork is naturally the most
complex in this regards as an adaptation of a game
designed for human players. Likewise, the com-
plexity of the vocabulary depends heavily on the
task but it is clear to see that the environments
limit the number of Action-Verbs for simplicity.

3.1 Agent Architectures

A Deep-RL agent’s architecture (Guarda la figura 2)
consists of two core components: (io) state encoder
E (ii) and action scorer (Mnih et al., 2013).
The first is used to encode game information such
as observations, game feedback, and KG repre-
sentations into state approximations. The encoded
information is then used by an action selection
agent to estimate the value of actions in each state.

Tavolo 2 provides an overview of recent archi-
tectural trends for comparison. We find that the
initial papers in 2015 used the standard approaches
of LSTM or Bag of Words for the encoder and a
Deep Q-Network (DQN) for the action selector.
More recent developments have been experiment-
ing with both parts towards improved results.
Notably, a range of approaches for the encoding
have been introduced, with Gated Recurrent Units
(GRUs) having become the most common in 2020.
Note that there have been fewer variations in the
choice of action selector where either Actor-Critic
(A2C) or a DQN is typically used. Inoltre,
the use of KGs and Pre-trained Transformers is
limited and many of the works that use these were
published in 2020.

Most of the agents were applied to either Text-
World/CookingWorld or Jericho. We typically

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

880

find that alternative environments align to con-
sistent setups; either the authors create a new
game that mimics simple real-world rules (simi-
lar to TextWorld) or they apply their methods to
well know pre-existing games (similar to Jericho).
Specifically, Narasimhan et al. (2015) used two
self generated games themselves: ‘Home World’
to mimic the environment of a typical house and
‘Fantasy World’ that is more challenging and akin
to a role-playing game. Alternatively, He et al.
(2016) used a deterministic Text Game ‘Saving
John’ and and a larger-scale stochastic Text Game
‘Machine of Death’, both pre-existing from a
public library.

Encoders used include both simplistic state en-
codings in the form of Bag of Words (BoW),
but also recurrent modules like: GRU (Cho
et al., 2014), Long Short-Term Memory (LSTM)
(Hochreiter e Schmidhuber, 1997), Transformer
(TF) (Vaswani et al., 2017), and Relational Graph
Convolutional Network (R-GCN) (Schlichtkrull
et al., 2018). Recentemente, Advantage Actor Critic
(A2C) (Mnih et al., 2016) has gained popularity for
the action selection method, with variants of the
Deep Q-Network (Mnih et al., 2013), such as the
Deep Reinforcement Relevance Network (DRRN)
(He et al., 2016), Double DQN (DDQN) (Hasselt
et al., 2016), and Deep Siamese Q-Network
(DSQN) (Yin and May, 2020).

Task Specific Pre-Training entails heuristi-
cally establishing a setting in which a submod-
ule learns priors before interacting with training
dati. Per esempio, Adolphs and Hofmann (2020)
pretrain on a collection of food items to improve
generalizability to unseen objects and Adhikari
et al. (2020) pretrain a KG constructor on trajec-
tories of similar games the agent is trained on.
Chaudhury et al. (2020) showed that training an
agent on pruned observation space, where the se-
mantically least relevant tokens are removed in
each episode to improve generalizability to un-
seen domains whilst also improving the sample
efficiency due to requiring less training games.
Inoltre, Jain et al. (2020) propose learning
different action-value functions for all possible
scores in a game, thus effectively learning separate
value functions for each subtask of a whole.

Lastly, the action-space typically varies be-
tween template- or choice-based depending on the
type of task. Only two papers have considered a
parser based approach: Narasimhan et al. (2015)
and Madotto et al. (2020).

3.2 Benchmark Results

The following section summarizes some of the
results published thus far as a means to review
the performance of the baselines as they are pre-
sented by the original papers to be used for future
comparisons.

Treasure Hunter was introduced by Cˆot´e et al.
(2018) as part of the TextWorld formalization and
inspired by a classic problem to navigate a maze
to find a specific object. The agent and objects are
placed in a randomly generated map. A coloured
object near the agent’s start location provides an
indicator of which object to obtain provided in
the welcome message. A straightforward reward
signal is defined as positive for obtaining the
correct object and negative for an incorrect object
with a limited number of turns available.

Increasing difficulties are defined by the num-
ber of rooms, quest length, and number of locked
doors and containers. Levels 1 A 10 have only 5
rooms, no doors, and an increasing quest length
from 1 A 5. Levels 11 A 20 Avere 10 rooms, include
doors and containers that may need to be opened,
and quest length increasing from 2 A 10. Lastly,
levels 21 A 30 Avere 20 rooms, locked doors
and containers that may need to be unlocked and
opened, and quest length increasing from 3 A 20.
For the evaluation, two state-of-the-art agents
(BYU [Fulda et al., 2017] and Golovin [Kostka
et al., 2017]) were compared to a choice-based
random agent in a zero-shot evaluation setting.
For completeness, each was applied to 100 gen-
erated games at varying difficulty levels up to a
maximum of 1,000 steps. The results of this are
shown in Table 3, where we note that the ran-
dom agent performs best but this is due to the
choice-based method removing the complexity
of the compositional properties of language. IL
authors noted that may not be directly compara-
ble to the other agents but provides an indication
of the difficulty of the tasks.

CookingWorld,

the second well-established
environment developed using TextWorld by
is used in joint and
Trischler et al. (2019),
zero-shot settings, which both enable testing for
generalization. The former entails training and
evaluating the agent on the same set of games,
while the latter uses an unseen test set upon eval-
uazione. The fixed dataset provides 4,400 training,
222 validation, E 514 test games with 222 dif-
ferent types of games in various game difficulty

881

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Difficulty Avg. Score Avg. Steps Avg. Score Avg. Steps Avg. Score Avg. Steps

Random

BYU

Golovin

level 1
level 5
level 10
level 11
level 15
level 20
level 21
level 25
level 30

0.35
−0.16
−0.14
0.30
0.27
0.21
0.39
0.26
0.26

9.85
19.43
20.74
43.75
63.78
74.80
91.15
101.67
108.38

0.75
−0.33
−0.04
0.02
0.01
0.02
0.04
0.00
0.04

85.18
988.72

1000

992.10
998
962.27
952.78
974.14
927.37

0.78
−0.35
−0.05
0.04
0.03
0.04
0.09
0.04
0.74

18.16
135.67
609.16
830.45
874.32
907.67
928.83
931.57
918.88

Tavolo 3: Results of agents applied to Treasure Hunter’s one-life tasks (Cˆot´e et al., 2018).

DSQN

LeDeepChef

Go-Explore Seq2Seq

BERT-NLU-SE

LSTM-DQN

DRRN

Single
Joint
Zero-Shot
Treasure Hunter (OoD)
% won games (Zero-Shot)
avg #steps (Zero-Shot)

#training steps
Admissable Actions
Curriculum Learning
Imitation Learning
Training time

–
–
58%
42%
–
–/ 100

107
NS
N
N
NS

–
–
69.3%
–
–
43.9 / 100

13,200
N
N
N
NS

88.1%
56.2%
51%
–
46.6%
24.3 / 50

NS
N
Y
Y
NS

–
–
77%
57%
71%
–/ 100
T chr, 5 ∗ 105
Y
Y
Y
Tchr: 35 days, Stdn: 5 days

107

stdn

52.1%
3.1%
2%
–
3.2%
48.5 / 50

NS
Y
N
N
NS

80.9%
22.9%
22.2%
–
8.3%
38.8 / 50

NS
Y
N
N
NS

Tavolo 4: Results on the public CookingWorld dataset. OoD, out of domain.

settings; the current best results are shown in
Tavolo 4. Tuttavia, there are some variances in
the split of training and test games, making it
challenging to compare the results. Most models
were trained using the games split of 4,400/514
specified before, but some split it as 3,960/440 E
some do not state the split at all. Tavolo 4 shows
the results of this comparison with a percentage
score relative to the maximum reward.

Jericho’s Suite of Games was defined by
Hausknecht et al. (2019UN) and has been used for
testing agents in a single game setting. The suite
provides 57 games, of which a variation of 32 are
used due to excessive complexity, therefore the
agents are initialized, trained, and evaluated on
each of them separately. Agents are assessed over
the final score, che è, by convention, averaged
over the last 100 episodes (which in turn is usually
averaged over 5 agents with different initializa-
tion seeds). The current state of the art results on
the Jericho game-set can be seen in Table 5 Dove
numeric results are shown relative to the Max-

imum Reward with the aggregated average per-
centage shown in the final row.2

The difficulty of each game has been summa-
rized in full by Hausknecht et al. (2019UN) but not
included within this paper due to its size. IL
authors categorize difficulty with the features of
template action space size, solution length, aver-
age steps per reward, stochastic, dialog, darkness,
nonstandard actions, and inventory limit. How-
ever, it has been noted that some games have
specific challenges not captured by these alone
that make them more difficult. For example ‘‘9:05
poses a difficult exploration problem as this game
features only a single terminal reward indicating
success or failure at the end of the episode’’.

4 Opportunities for Future Work

The previous sections summarized the generation
tools for Text Games, the subsequent environments

2NAIL is a rule-based agent and is noted to emphasize

the capabilities of learning based models.

882

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

905
acorncourt
advent
advland
affilicted
anchor
awaken
balances
deephome
detective
dragon
enchanter
gold
inhumane
jewel
karn
library
ludicorp
moonlit
omniquest
pentari
reverb
snacktime
sorcerer
spellbrkr
spirit
temple
tryst205
yomomma
zenon
zork1
zork3
ztuu

MaxR

CALM-DRRN

SHA-KG

Trans-v-DRRN

MPRC-DQN

KG-A2C

TDQN

DRRN

NAIL

|T |

|V |

1
30
350
100
75
100
50
51
300
360
25
400
100
90
90
170
30
150
1
50
70
50
50
400
600
250
35
350

35 –
20
350
7
100

0
0
36
0
–
0
0
9.1
1
289.7
0.1
19.1
–
25.7
0.3
2.3
9.0
10.1
0
6.9
0
–
19.4
6.2
40

1.4
0
–
–
0
30.4
0.5
3.7

–
1.6
–
–
–
–
–
10.0
–
208.0
0.2

20
–
5.4
1.8
–
15.8
17.8
–
–
51.3
10.6
–
29.4
40

3.8
7.9
6.9
–
3.9
34.5
0.7
25.2

0
10
–
25.6
2.0
–
–
–
–
288.8
–
20
–
–
–
–
17
16
–
–
34.5
10.7
–
–
40
–
7.9
9.6
–
–
36.4

0.19
4.8

0
10
63.9
42.2
8.0
0
0
10
1
317.7

0.04

20
0
0
4.46
10
17.7
19.7
0
10
44.4
2.0
0
38.6
25

3.8
8.0

10
1
0
38.3

3.63
85.4

0
0.3

36
0
–
0
0
10
1
207.9
0
1.1
–
3
1.8
0
14.3
17.8
0
3
50.7
–
0
5.8
21.3
1.3
7.6
–
–
3.9

0.1
9.2

0
1.6
36
0
1.4
0
0
4.8
1
169
−5.3
8.6
4.1
0.7
0
0.7
6.3
6
0
16.8
17.4
0.3
9.7
5
18.7
0.6
7.9
0
0
0
9.9
0
4.9

0
10
20.6
20.6
2.6
0
0
10
1
197.8
−3.5
20
0
0
1.6
2.1

17
13.8
0
5
27.2
8.2
0
20.8
37.8
0.8
7.4
9.6
0.4
0
32.6
0.5
21.6

0
0
36
0
0
0
0
10
13.3
136.9
0.6
0
3
0.6
1.6
1.2
0.9
8.4
0
5.6
0
0
0
5
40
1
7.3
2
0
0
10.3
1.8
0

82
151
189
156
146
260
159
156
173
197
177
290
200
141
161
178
173
187
166
207
155
183
201
288
333
169
175
197
141
149
237
214
186

296
343
786
398
762
2257
505
452
760
344
1049
722
728
409
657
615
510
503
669
460
472
526
468
1013
844
1112
622
871
619
401
697
564
607

Best Agent % / count
Avg. norm / # games

21.2% / 7
9.4% / 28

18.2% / 6
19.6% / 20

15.2% / 5
22.3% / 15

69.7% / 23
17% / 33

18.2% / 6
10.8% / 28

18.2% / 6
6% / 33

21.2% / 7
10.7% / 32

21.2% / 7

4.8% / 33

Handicaps
Train steps
Train time

{1}
106
NS

{1, 2, 4}
106
NS

NS
105
NS

{1, 2, 4}
105
8h-30h per game

{1, 2, 4}
1.6 × 106
NS

{1, 2, 4}
106
NS

{1, 4}
1.6 × 106
NS

N/A
N/A
N/A

Tavolo 5: The current state of the art results on the Jericho game-set. Blue: likely to be solved in
near future. Orange: progress is likely, significant scientific progress necessary to solve. Red: very
difficult even for humans, unthinkable for current RL. MaxR, Maximum possible reward; |T |, number
of templates (per esempio., put the

); |V |, size of vocabulary set.

created thus far, and the architectures posed as
solutions to these problems. The most clear im-
provement opportunities come from an approach
that directly addresses the challenges that are yet
to be well solved and are given in detail alongside
the current benchmark environments and solutions
in Section 2.3.

The primary challenge of partial observability
exists in all Text Games, with some solutions using
handcrafted reward functions (Yuan et al., 2018)
or knowledge graphs to leverage past information
(Ammanabrolu and Hausknecht, 2020). Addition-
alleato, many recent works address this challenge
in their agent architecture by utilizing trans-
former models pre-trained on natural language
corpora to introduce background knowledge, così
reducing the amount that needs to be observed
directly. This has been supported by work pub-
lished in the NLP community with methods such
as BERT (Yin et al., 2020), GPT-2 (Yao et al.,
2020), and ALBERT (Ammanabrolu et al., 2020UN)
and continued developments from this community
will support the advancements of future agent’s
architectures.

One of the most interesting challenges is how
agents learn to use commonsense reasoning about
the problems and whether this can be used to gen-
eralize across tasks, other games, and even into the
real world. Per esempio, an agent that can learn
how a key is used in a Text Game setting would
have reasoning beyond any agent simply trained
to color match keys based on reward signals alone
(per esempio., a robot using visual inputs). This has been
noted as an important part of the planning stage
and could further take advantage of work on dyna-
mically learned knowledge graphs that have be-
come common in recent works (such as Das et al.,
2019). Specifically, this enables the use of pre-
trained knowledge graphs on readily available text
corpora alongside the commonsense understand-
ing of previously seen tasks before training on a
new environment on which it may be expensive
to collect data (per esempio., implementing a robot into a
new user’s home). Tuttavia, language acquisition
in these environments is still an open problem
but with developments on Urbanek et al.’s (2019)
environment specifically designed for grounding
language this will become closer to being solved.

883

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Another benefit of language understanding and
knowledge graphs (particularly in the planning
stage) is that they can used for interpretability.
This is not a challenge unique to Text Games or
Insegnamento rafforzativo, as calls for research
have been made in similar machine learning do
mains (Chu et al., 2020). A notable methodology
that is worth considering is Local Interpretable
Model-Agnostic Explanations (LIME), introduced
by Ribeiro et al. (2016). Recent work has been
published that specifically considers interpretabil-
ity for RL (Peng et al., 2021) whereby the authors
designed their agent to ‘think out loud’ as it was
making decisions on Text Games and evaluated
with human participants.

Lastly, from the contributions analyzed in this
survey, only 5 papers report the amount of time and
resources their methods needed for training. Contro-
tinuing the trend of publishing these specifications
is essential in making results more reproducible
and applicable in applied settings. Reducing the
amount of resources and training time required
as a primary motive allows for the domain to be
practical as a solution and also accessible given
that not all problems allow for unlimited training
samples.

5 Conclusione

Many of the challenges faced within Text Games
also exist in other language-based Reinforcement
Learning problems, highlighted by Luketina et al.
(2019). Likewise, these challenges also exist in
‘real-world’ settings noted by Dulac-Arnold et al.
(2019) including: limited data, safety constraints,
and/or a need for interpretability.

We find that most Text Games methodologies
focus on either an agent’s ability to learn effi-
ciently or generalize knowledge. A continued de-
velopment in this direction is the primary motive
for much of the recent research and could lead
to solutions that work sufficiently well on these
real-world problem settings for the primary chal-
lenge of limited and/or expensive data. Likewise,
specific work on interpretability (such as that of
Peng et al., 2021) and language acquisition may
transfer to problems that interact in the real-world
directly with humans.

A tal fine, this study has summarized re-
cent research developments for interactive fiction
games that have generated a set of environments
and tested their architectures for generalization

and overfitting capabilities. With the comparisons
made in this work and a uniform set of envi-
ronments and baselines, new architectures can be
developed and then systematically evaluated for
improving results within Text Games and subse-
quently other Reinforcement Learning problems
that include language.

Riferimenti

Ashutosh Adhikari, Xingdi Yuan, Marc-
Alexandre Cˆot´e, Mikul´aˇs Zelinka, Marc-
Antoine Rondeau, Romain Laroche, Pascal
Poupart, Jian Tang, Adam Trischler, E
Will Hamilton. 2020. Learning dynamic be-
lief graphs to generalize on text-based games.
Advances in Neural Information Processing
Sistemi, 33.

Leonard Adolphs

and T. Hofmann. 2020.
Ledeepchef: Deep reinforcement
apprendimento
agent for families of text-based games. In
AAAI. https://doi.org/10.1609/aaai
.v34i05.6228

Prithviraj Ammanabrolu

and Matthew J.
Hausknecht. 2020. Graph constrained rein-
forcement learning for natural language action
spazi. International Conference on Learning
Representations.

Prithviraj Ammanabrolu and Mark O. Riedl.
2019. Playing text-adventure games with
graph-based deep reinforcement learning. Pro-
ceedings of the 2019 Conference of the North
the Association for
American Chapter of
Linguistica computazionale: Human Language
Technologies, Volume 1 (Long and Short Pa-
pers), pages 3557–3565. https://doi.org
/10.18653/v1/N19-1358

Prithviraj Ammanabrolu, Ethan Tien, Matthew
Hausknecht, and Mark O. Riedl. 2020UN. How
to avoid being eaten by a grue: Structured ex-
ploration strategies for textual worlds. abs/2006
.07409.

Prithviraj Ammanabrolu, Jack Urbanek, Margaret
Li, Arthur Szlam, Tim Rocktaschel, and J.
Weston. 2020B. How to motivate your dragon:
Teaching goal-driven agents to speak and
act in fantasy worlds. ArXiv, abs/2010.00685.
https://doi.org/10.18653/v1/2021
.naacl-main.64

884

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Tim Anderson, Marc Blank, Bruce Daniels, E
Dave Lebling. 1980. Zork: The great under-
ground empire – part i.

Gabor Angeli, Melvin Jose Johnson Premkumar,
e Christopher D. Equipaggio. 2015. Leveraging
linguistic structure for open domain informa-
tion extraction. In Proceedings of the 53rd
Annual Meeting of the Association for Com-
putational Linguistics and the 7th International
Joint Conference on Natural Language Process-
ing (Volume 1: Documenti lunghi), pages 344–354,
Beijing, China. Associazione per il calcolo
Linguistica. https://doi.org/10.3115
/v1/P15-1034

Subhajit Chaudhury, Daiki Kimura, Kartik
Talamadupula, Michiaki Tatsubori, Asim
Munawar, and Ryuki Tachibana. 2020. Boot-
strapped q-learning with context relevant ob-
servation pruning to generalize in text-based
IL 2020 Contro-
games.
ference on Empirical Methods in Natural
Language Processing, EMNLP 2020, Online,
pages 3002–3008. https://doi.org/10
.18653/v1/2020.emnlp-main.241

Negli Atti di

Kyunghyun Cho, B. V. Merrienboer, C¸ aglar
G¨ulc¸ehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. 2014.
Learning phrase representations using rnn
encoder-decoder for statistical machine transla-
zione. Empirical Methods in Natural Language
in lavorazione, abs/1406.1078.

Eric Chu, Nabeel Gillani, and Sneha Priscilla
Makini. 2020. Games for fairness and in-
terpretability. In Companion Proceedings of
the Web Conference 2020, WWW ’20,
pages 520–524, New York, NY, USA. Asso-
ciation for Computing Machinery. https://
doi.org/10.1145/3366424.3384374

Marc-Alexandre Cˆot´e,

´Akos K´ad´ar, Xingdi
Yuan, Ben Kybartas, Tavian Barnes, Emery
Fine, J. Moore, Matthew J. Hausknecht, Layla
El Asri, Mahmoud Adada, Wendy Tay, E
Adam Trischler. 2018. Textworld: A learning
environment for text-based games. Workshop
on Computer Games, pages 41–75. https://
doi.org/10.1007/978-3-030-24337-1 3

Rajarshi Das, Tsendsuren Munkhdalai, Xingdi
Yuan, Adam Trischler, and Andrew McCallum.
2019. Building dynamic knowledge graphs

from text using machine reading comprehen-
sion. ICLR.

Gabriel Dulac-Arnold, Daniel Mankowitz, E
Todd Hester. 2019. Challenges of real-world
IL
rinforzo
36th International Conference on Machine
Apprendimento.

apprendimento. Proceedings of

Nancy Fulda, Daniel Ricks, Ben Murdoch, E
David Wingate. 2017. What can you do with
a rock? Affordance extraction via word em-
beddings. arXiv preprint arXiv: 1703.03429.
https://doi.org/10.24963/ijcai.2017
/144

Xiaoxiao Guo, M. Yu, Yupeng Gao, Chuang Gan,
Murray Campbell, and S. Chang. 2020. Interac-
tive fiction game playing as multi-paragraph
reading comprehension with reinforcement
apprendimento. EMNLP.

H. V. Hasselt, UN. Guez, and D. Silver. 2016. Deep
reinforcement learning with double q-learning.
AAAI. https://doi.org/10.1609/aaai
.v30i1.10295

Matthew Hausknecht, Prithviraj Ammanabrolu,
Marc-Alexandre Cˆot´e, and Xingdi Yuan.
2019UN. Interactive fiction games: A colossal
adventure. ArXiv, abs/1909.05398.

Matthew Hausknecht, Ricky Loynd, Greg Yang,
Adith Swaminathan, and Jason D. Williams.
2019B. Nail: A general interactive fiction agent.
arXiv preprint arXiv:1902.04259.

Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao,
Lihong Li, Li Deng, and Mari Ostendorf. 2016.
learning with a natural
Deep reinforcement
language action space. Association for Com-
Linguistica putazionale (ACL). https://doi
.org/10.18653/v1/P16-1153

Dan Hendrycks, Mantas Mazeika, Andy Zou,
Sahil Patel, Christine Zhu, Jesus Navarro, Dawn
Song, Bo Li, and Jacob Steinhardt. 2021. Che cosa
would jiminy cricket do? Towards agents that
behave morally. In Proceedings of the Neu-
ral Information Processing Systems Track on
Datasets and Benchmarks, volume 1.

Sepp Hochreiter e Jürgen Schmidhuber. 1997.
Memoria a lungo termine. Neural Computa-
zione, 9(8):1735–1780. https://doi.org
/10.1162/neco.1997.9.8.1735

Vishal Jain, William Fedus, Hugo Larochelle,
Doina Precup, and Marc G. Bellemare. 2020.

885

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Algorithmic improvements for deep reinforce-
ment learning applied to interactive fiction.
In AAAI, pages 4328–4336. https://doi
.org/10.1609/aaai.v34i04.5857

Bartosz Kostka,

Jaroslaw Kwiecieli,

Jakub
Kowalski, and Pawel Rychlikowski. 2017.
Text-based adventures of
the Golovin AI
agent. Computational Intelligence and Games
(CIG), pagine 181–188. https://doi.org
/10.1109/CIG.2017.8080433

Jakob Foerster,

Jelena Luketina, Nantas Nardelli, Gregory
Farquhar,
Jacob Andreas,
Edward Grefenstette, Shimon Whiteson, E
Tim Rockt¨aschel. 2019. A survey of rein-
forcement learning informed by natural lan-
guage. arXiv e-prints. https://doi.org
/10.24963/ijcai.2019/880

Andrea Madotto, Mahdi Namazifar,

Joost
Huizinga, Piero Molino, Adrien Ecoffet,
Huaixiu Zheng, Alexandros Papangelis, Dian
Yu, Chandra Khatri,
and Gokhan Tur.
2020. Exploration based language learning
for text-based games. IJCAI. https://doi
.org/10.24963/ijcai.2020/207

Volodymyr Mnih, Adria Puigdomenech Badia,
Mehdi Mirza, Alex Graves, Timothy Lillicrap,
and Koray
Tim Harley, David Silver,
Kavukcuoglu. 2016. Asynchronous methods for
deep reinforcement learning. Negli Atti di
The 33rd International Conference on Machine
Apprendimento, PMLR 2016, pages 1928–1937.

Volodymyr Mnih, Koray Kavukcuoglu, David
Silver, Alex Graves, Ioannis Antonoglou, Daan
Wierstra, and Martin A. Riedmiller. 2013.
Playing Atari with deep reinforcement learning.
CoRR, abs/1312.5602.

Pushkar
Gerald

Shukla,
Tesauro,

Keerthiram Murugesan, Mattia Atzeni, Pavan
Sadhana
Kapanipathi,
Kumaravel,
Kartik
Talamadupula, Mrinmaya Sachan, and Murray
Campbell. 2020UN. Text-based RL agents with
commonsense knowledge: New challenges,
environments and baselines. ArXiv, abs/2010
.03790.

Keerthiram Murugesan, Mattia Atzeni, Pushkar
Shukla, Mrinmaya Sachan, Pavan Kapanipathi,
and Kartik Talamadupula. 2020B. Enhanc-
ing text-based reinforcement learning agents
with commonsense knowledge. arXiv preprint
arXiv:2005.00811.

886

Karthik Narasimhan, Tejas Kulkarni, and Regina
Barzilay. 2015. Language understanding for
text-based games using deep reinforcement
apprendimento. EMNLP. https://doi.org/10
.18653/v1/D15-1001

Xiangyu Peng, Mark O. Riedl, and Prithviraj
Ammanabrolu. 2021. Inherently explainable
rinforzo
lingua.
ArXiv, abs/2112.08907.

learning in natural

Marco Tulio Ribeiro, Sameer Singh, and Carlos
Guestrin. 2016. ‘‘why should I trust you?’’:
Explaining the predictions of any classifier.
In Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Dis-
covery and Data Mining, San Francisco, CA,
USA, agosto 13-17, 2016, pages 1135–1144.
https://doi.org/10.1145/2939672
.2939778

Michael Schlichtkrull, Thomas N. Kipf, Peter
Bloem, Rianne Van Den Berg, Ivan Titov,
and Max Welling, Springer. 2018. Modeling
relational data with graph convolutional net-
works. In European Semantic Web Conference,
pages 593–607. https://doi.org/10.1007
/978-3-319-93417-4 38

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre
Cˆot´e, Yonatan Bisk, Adam Trischler, E
Matthew J. Hausknecht. 2020. Alfworld: Align-
ing text and embodied environments for inter-
active learning. ArXiv, abs/2010.03768.

D. Silver, Aja Huang, Chris J. Maddison, UN. Guez,
l. Sifre, George van den Driessche, Julian
Schrittwieser, Ioannis Antonoglou, Vedavyas
Panneershelvam, Marc Lanctot, S. Dieleman,
Dominik Grewe, John Nham, Nal Kalchbrenner,
Ilya Sutskever, T. Lillicrap, M. Leach, K.
Kavukcuoglu, T. Graepel, and Demis Hassabis.
2016. Mastering the game of go with deep
neural networks and tree search. Nature,
529:484–489. https://doi.org/10.1038
/nature16961

Richard S. Sutton and Andrew G. Barto. 1998.
Introduction to Reinforcement Learning, 1st
edition. CON Premere, Cambridge, MA, USA.

Ruo Yu Tao, Marc-Alexandre Cˆot´e, Xingdi Yuan,
and Layla El Asri. 2018. Towards solving
text-based games by producing adaptive action
spazi. ArXiv, abs/1812.00855.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Adam Trischler, Marc-Alexandre Cˆot´e, and Pedro
Lima. 2019. First textworld problems, the com-
petition: Using text-based games to advance
capabilities of ai agents.

Jack Urbanek, Angela Fan, Siddharth Karamcheti,
Saachi Jain, Samuel Humeau, Emily Dinan,
Tim Rockt¨aschel, Douwe Kiela, Arthur Szlam,
and J. Weston. 2019. Learning to speak and
act in a fantasy text adventure game. ArXiv,
abs/1903.03094. https://doi.org/10.18653
.18653/v1/D19-1062

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017. A-
tention is all you need. Advances in Neural In-
formation Processing Systems, 30:5998–6008.

Y. Xu, l. Chen, M. Fang, Y. Wang, and C. Zhang.
2020. Deep reinforcement learning with trans-
formers for text adventure games. In 2020 IEEE
Conference on Games (CoG), pages 65–72.
https://doi.org/10.1109/CoG47356
.2020.9231622

Yunqiu Xu, Meng Fang, Ling Chen, Yali Du,
Joey Tianyi Zhou, and Chengqi Zhang. 2020.
Deep reinforcement
learning with stacked
hierarchical attention for text-based games.
Advances in Neural Information Processing
Sistemi, 33.

Shunyu Yao, Rohan Rao, Matthew Hausknecht,
and Karthik Narasimhan. 2020. Keep calm and
explore: Language models for action generation
in text-based games. In Empirical Methods in
Elaborazione del linguaggio naturale (EMNLP).

Xusen Yin and Jonathan May. 2020. Zero-shot
learning of text adventure games with sentence-

level semantics. arXiv preprint arXiv:2004
.02986.

Xusen Yin, R. Weischedel, and Jonathan May.
2020. Learning to generalize for sequential de-
cision making. EMNLP. https://doi.org
/10.18653/v1/2020.findings-emnlp
.273

Xingdi Yuan, Marc-Alexandre Cˆot´e, Jie Fu,
Zhouhan Lin, Christopher Pal, Yoshua Bengio,
and Adam Trischler. 2019. Interactive language
learning by question answering. Negli Atti
del 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Join Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 2796–2813. https://doi.org/10
.18653/v1/D19-1280

Xingdi Yuan, Marc-Alexandre Cˆot´e, Alessandro
Sordoni, Romain Laroche, Remi Tachet des
Combes, Matthew Hausknecht, and Adam
Trischler. 2018. Counting to explore and gen-
eralize in text-based games. 35th International
Conference on Machine Learning, Exploration
in Reinforcement Learning Workshop.

Tom Zahavy, Matan Haroush, Nadav Merlis,
Daniel J. Mankowitz, and Shie Mannor. 2018.
Learn what not to learn: Action elimination with
deep reinforcement learning. In Advances in
Neural Information Processing Systems 2018,
pages 3562–3573.

Mikul´a Zelinka, Xingdi Yuan, Marc-Alexandre
Cˆot´e, R. Laroche, and Adam Trischler. 2019.
Building dynamic knowledge graphs from text-
based games. ArXiv, abs/1910.09532.

887

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
4
9
5
2
0
4
0
8
4
9

/
T

UN
C
_
UN
_
0
0
4
9
5
P
D

B
sì
G
tu
e
S
T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3 A Survey of Text Games for Reinforcement Learning Informed by image

Scarica il pdf