Game Instructions
User Manual:
Open the PDF directly: View PDF
.
Page Count: 6
| Download | |
| Open PDF In Browser | View PDF |
©
Fundamentals of Machine Learning
Winter Semester 2018/2019
Final Pro ject
U. Köthe, Jakob Kruse
ullrich.koethe@iwr.uni-heidelberg.de
Final Project
Reinforcement Learning for Bomberman
Deadline: 25.3.2019
In this year's nal project you will use reinforcement learning techniques to train an agent to play
the classical game Bomberman.
In our setting, the game is played by up to four agents in discrete, but simultaneous time steps.
Your agent can move around, drop bombs or stand still. Crates can be cleared by well-placed bombs
and will sometimes drop coins, which you can collect for points. The deciding factor for the nal
score, however, is to blow up opposing agents, and to avoid to get blown up yourself. To keep things
simple, special items and power-ups are not available in this version of the game.
After the project deadline, we will hold a tournament between all trained agents with real prizes.
Tournament performance will be a factor in the nal grade as well, although the quality of your
approach (as described in the report and code) will carry more weight.
Regulations
Submission
Your submission for the nal project should consist of the following:
The agent code to be used in the competition, including all trained weights, in a subdirectory
of
agent_code
(see details below).
A PDF report of your approach. Aim for a length of about 10 pages per team member and
indicate after headings who is responsible for each subsection.
The URL of a public repository containing your entire code base, which must be mentioned
in the report.
Solutions not involving machine learning will be rejected. Zip all les into a single archive
with naming convention (sorted alphabetically by last names)
lastname1-firstname1_lastname2-firstname2_final-project.zip
or (if you work in a team of three)
lastname1-firstname1_lastname2-firstname2_lastname3-firstname3_final-project.zip
and upload it to Moodle before the given deadline.
1/6
©
Fundamentals of Machine Learning
Winter Semester 2018/2019
Final Pro ject
U. Köthe, Jakob Kruse
ullrich.koethe@iwr.uni-heidelberg.de
Development
As will be detailed later on, your agent's code will be run in its own separate process. To not
interfere with interprocess communication, and to share resources fairly between competing agents,
you are not allowed to use multiprocessing in your nal agent. However, multiprocessing during
training is perfectly ne.
We will distinguish classical and neural-network solutions. The choice among the two is entirely up
to you. Neural networks will be executed on the CPU during ocial games, but you may use GPUs
during training. If agent performance between these approaches diers too much, we will play the
tournament in two separate leagues.
Playing around with the provided framework is allowed, and may in fact be necessary to facilitate
fast training of your agent. However, be aware that the nal agent code will simply be plugged into
our
original
version of the framework any changes you made to other parts will not be present
during games.
Discussions with other teams are very much encouraged, and trained agents (not training code!) may
be exchanged among teams to serve as sparring partners. Just keep in mind that in the tournament
you will compete for prizes, so you may want to keep your best ideas to yourself :)
The internet provides plenty of material on all aspects of reinforcement learning. Study some of
it to learn more about RL and get inspiration. The use of free software libraries (e.g.
allowed, if they can be installed from an ocial repository like
pip
or
conda,
pytoch)
is
but you must not
copy-paste any existing solution in whole or in part. Plagiarism will lead to a failed grade.
1
Setup
You can nd the framework for this project on
Github:
git clone https :// github . com / ukoethe / bomberman_rl
It contains the game environment with all APIs you need for reinforcement learning, as well as
example code for a simple rule-based agent that does not learn anything. Running
main.py
from
the command line should let you watch a set of games between four of these simple agents. Direct
any questions about the framework to Jakob Kruse (jakob.kruse@iwr.uni-heidelberg.de).
As a game engine we are using the
environment using
pip:
pygame
package, which you can install into your existing
conda
pip install pygame
Other than that, we assume a standard Python 3 installation with
numpy, scipy
and
sklearn.
Clearly indicate at the beginning of your report which additional libraries we need to install.
2
General setting and rules of the game
The game is played in discrete steps by one to four agents, who are represented by robot heads in
dierent colors. Because of the random elements, multiple episodes of the game will be played to
determine a winner by total score. In one step of a game episode, each robot can either move one
tile horizontally or vertically, drop a bomb or wait. Movement is restricted to empty (i.e. black)
tiles stone walls, crates, bombs and other agents block movement. Coins are an exception, as they
are collected when moving onto their tile. While the placement of stone walls is the same for every
round, the distribution of crates and the hiding places of coins dier each time. Agents always start
in one of the board's corners, but it is randomly determined which.
Once a bomb is dropped, it will detonate after four steps and create an explosion that extends three
tiles up, down, left and right (so you can just outrun the explosion). The explosion destroys crates
2/6
©
Fundamentals of Machine Learning
Winter Semester 2018/2019
U. Köthe, Jakob Kruse
ullrich.koethe@iwr.uni-heidelberg.de
Final Pro ject
and agents, but will stop at stone walls and does not reach around corners. It lingers for two time
steps, and agents running into it during that period are still defeated. Agents can only drop a new
bomb after their previous one has exploded.
A xed number of coins is hidden at random positions in each episode. When the crate concealing
a coin is blown up, the coin becomes visible and can be collected. Collecting a coin gains the agent
one point, while blowing up an opponent is worth ve points.
Every episode ends after 400 steps. The agent taking the highest average thinking time during an
episode is deducted one point to encourage ecient implementations. In addition, there is a xed
1 per step for agents to arrive at their decisions. After this time, tardy
time limit of 0.5 seconds
agent processes are interrupted and whatever actions they have picked by this time are performed.
The exact numbers used for all these rules can be found in
settings.py.
They may be subject to
change until seven days before the deadline, in case you inform us about major problems with the
current settings. You will be notied if any of the rules are adapted.
3
Tasks your agents will have to solve
We are trying this project setting for the rst time and don't know how dicult learning of the full
game will be. We thus dene preliminary tasks to help your method evolve from simple to complex
and ease debugging. The tasks are subsets of each other, so an agent that can handle task 3 should
also be able to solve 1 and 2. Conguration instructions for these tasks are given in section 5.
1. On a game board without any crates, collect a number of revealed coins as quickly as possible.
This task does not require dropping any bombs. The agent should learn how to navigate the
board eciently.
2. On a game board with randomly placed crates, nd all hidden coins and collect them within
the step limit. The agent must drop bombs to destroy the crates. It should learn how to use
bombs without killing itself, while not forgetting ecient navigation.
3. On a game board with crates, hold your own against one or more opposing agents and ght
for the highest score.
4
Framework structure and interface for your agent
In reinforcement learning, we typically distinguish between the agent and the environment it has
to interact with. Let us start with the environment here.
Environment
The game world and logic is dened in
environment.py in the class BombeRLeWorld. It keeps track
of the board and all game objects, can run a step of the game, start a new round and render
everything to the screen. What's most interesting for you is that it keeps track of the agents playing
the game via objects of the
each
Agent
Agent
class dened in
agents.py.
In addition to position, score etc.,
object contains a handle to a separate process which will run your custom code. This
process is also dened in
agents.py
as
AgentProcess.
Agent
The agent process runs a loop that interacts with the main game loop. Before it enters the loop, it
imports your custom code from a le called
callbacks.py
within a subdirectory of
agent_code.
Your script must provide two functions which will be called by the agent process at the appropriate
times:
1 Assume
that your agent will have exclusive access to one core of an
to 8 GB of RAM when we run the tournament.
3/6
Intel i7-8700K
processor and can use up
©
Fundamentals of Machine Learning
Winter Semester 2018/2019
The function
Final Pro ject
U. Köthe, Jakob Kruse
ullrich.koethe@iwr.uni-heidelberg.de
setup(self) is called once, before the rst round starts, to give you a place to initialize
self argument is the same as in an instance method a persistent object
everything you need. The
that will be passed to all subsequent callbacks as well. That means you can assign values or objects
to it which you may need later on:
def setup ( self ):
self . model = MyModel ()
...
self . model . set_initial_guess (x )
The function
act(self)
is called once every step to give your agent the opportunity to gure
'UP', 'DOWN', 'LEFT', 'RIGHT', 'BOMB' and
'WAIT', with the latter being the default. To choose an action, simply assign the respective string
to self.next_action within your act(self) implementation:
out the best next move. The available actions are
def act ( self ):
... # think
self . next_action = ' RIGHT '
... # think more , if time limit has not been reached
self . next_action = ' BOMB ' # seems to be better
... # you can update your decision as often as you want
act(self).
Once your agent is sure about the best decision, simply return from
If
act(self)
doesn't return within the time limit, the game engine will interrupt its execution. In either case,
the current (possibly default) value of
self.next_action
will be executed in the next step (this
strategy is generally called any-time algorithm).
In order to make informed decisions on what to do, you need to know about the agent's environment.
The current state of the game world is stored within each agent before
dictionary, which you can access as
'step'
'arena'
act(self)
is called. It is a
and has the following entries:
1.
numpy array describing the tiles of the game board. Its entries are 1 for
crates, −1 for stone walls and 0 for free tiles.
A tuple (x, y, n, b) describing your own agent. x and y are its coordinates on
the board, n its name and b ∈ {0, 1} a ag indicating if the 'BOMB' action is
The number of steps in the episode so far, starting at
A 2D
'self'
'others'
'bombs'
'explosions'
'coins'
self.game_state,
possible (i.e. no own bomb is currently ticking).
A list of tuples like the one above for all opponents that are still in the game.
A list of tuples
A 2D
numpy
(x, y, t)
of coordinates and countdowns for all active bombs.
array stating, for each tile, for how many steps an explosion will
be present. Where there is no explosion, the value is
A list of coordinates
(x, y)
0.
for all currently collectable coins.
When the game is run in training mode, there are two additional callbacks:
reward_update(self)
is called once after each but the nal step for an agent, i.e. after the actions have been executed
and their consequences are known. This information is needed to collect training data and ll
an experience buer. The other callback is
end_of_episode(self),
which is very similar to the
previous, but only called once per agent after the last step of a episode. This is where most of your
learning should take place, as you have knowledge of the whole episode. Note that neither of these
functions has a time limit.
In both callbacks,
self.events
will hold a list of game events that transpired in the previous step
and are relevant to your agent. You can use these as a basis for auxilliary rewards and penalties
to speed-up training. They will not be available when the game is run out of training mode. All
available events are stored in the constant
e
which you can import from
from settings import e
They are dened as follows:
4/6
settings.py:
©
Fundamentals of Machine Learning
Winter Semester 2018/2019
Final Pro ject
e.MOVED_LEFT
e.MOVED_RIGHT
e.MOVED_UP
e.MOVED_DOWN
e.WAITED
e.INTERRUPTED
e.INVALID_ACTION
e.BOMB_DROPPED
e.BOMB_EXPLODED
e.CRATE_DESTROYED
e.COIN_FOUND
e.COIN_COLLECTED
e.KILLED_OPPONENT
e.KILLED_SELF
e.GOT_KILLED
e.OPPONENT_ELIMINATED
e.SURVIVED_ROUND
5
U. Köthe, Jakob Kruse
ullrich.koethe@iwr.uni-heidelberg.de
Successfully moved one tile to the left.
Successfully moved one tile to the right.
Successfully moved one tile up.
Successfully moved one tile down.
Intentionally didn't act at all.
Got interrupted for taking too much time.
Picked a non-existent action or one that couldn't be executed.
Successfully dropped a bomb.
Own bomb dropped earlier on has exploded.
A crate was destroyed by own bomb.
A coin has been revealed by own bomb.
Collected a coin.
Blew up an opponent.
Blew up self.
Got blown up by an opponent's bomb.
Opponent got blown up by someone else.
End of round reached and agent still alive.
Putting it all together
In order to train a new agent, you must create a subdirectory within
agent_code with your agent's
name this name will also identify your agent during the tournament. Let's go with my_agent
as an example. Within
agent_code/my_agent/,
you put your script
callbacks.py
as described
above. Other custom les, such as trained model parameters, must also be stored in this directory!
You can put the agent into the game by passing its name to the game world's constructor in
main.py. To have it play against two instances of our rule-based example agent and one agent that
chooses random actions, change the code like this:
world = BombeRLeWorld ([
( ' my_agent ', True ) ,
( ' simple_agent ', False ) ,
( ' simple_agent ', False ) ,
( ' random_agent ', False )
], save_replay = False )
The Boolean values after each agent name indicate whether this agent will be run in training mode,
i.e. whether its
reward_update(self)
and
end_of_episode(self)
functions will be called. You
can also specify several agents of your own, or even the same agent multiple times, in order to train
your agents by self-play. If you choose this strategy, it may be helpful to add a random element to
the choice of actions to avoid quick convergence to a bad local optimum.
For ecient training, you should of course turn o the graphical user interface. To do so, simply
change the value of
'gui'
in
settings.py
to
False.
The game will then skip the rendering and
run as fast as the agents can make decisions. Many of the other settings should not be changed,
but you can safely play around with the following:
'update_interval'
'turn_based'
'n_rounds'
'save_replay'
'make_video_from_replay'
'crate_density'
'max_steps'
'stop_if_not_training'
'timeout'
'log_agent_code'
Minimum time for a step of the game, in seconds.
If
True,
wait for a key press before executing each step.
Number of episodes to play in one session.
If
If
True,
True,
record the game to a le in
replays/
(see below).
automatically render a video during a replay.
What fraction of the board should be occupied by crates.
Maximum number of steps per episode.
If
True, end episode as soon as all agents running in training
mode have been eliminated.
Maximum time for agents to decide on an action, in seconds.
Logging level for your agent's code, see below.
5/6
©
Fundamentals of Machine Learning
Winter Semester 2018/2019
U. Köthe, Jakob Kruse
ullrich.koethe@iwr.uni-heidelberg.de
Final Pro ject
These settings can be accessed from the code by again importing from
settings.py:
from settings import s
print ( s. timeout )
only that agent to the game world's
'crate_density' in settings.py to 0.
To test an agent on the tasks 1 or 2 outlined in section 3, pass
constructor in
6
main.py.
For task 1, additionally set
Some hints to get you started
The
self
object passed to your callback functions comes with logging functionality that you can
use to monitor and debug what your agent is up to:
self . logger . info (f" Choosing an action for step { self . game_state [ ' step ']}... ")
self . logger . debug (" This is only logged if s. log_agent_code == logging . DEBUG ")
All logged messages will appear in a le named after your agent in your subdirectory. Continuing
with the example from earlier, that would be
agent_code/my_agent/logs/my_agent.log.
agent_code/simple_agent/ contains code for a rule-based agent
well. Look at this agent's act(self) callback to see an example
that plays the game reasonably
of how reading the game state,
logging and choosing an action are done. You can adapt this agent as a training opponent, or even
use it to create training data while your own agent is still struggling.
If you have
s.save_replay
enabled, a new replay le will be saved in
To watch it back, change the world initialization in
main.py
replays/
for each episode.
to
world = ReplayWorld ( ' replay_filename_without_extension ')
fps dierent from
settings.py. If s.make_video_from_replay is enabled
and you have ffmpeg installed with libx264 or libvpx-vp9 codecs, what you see will automatically
be encoded as a video le and saved to screenshots/.
and run the game like normal. You can watch the playback at a step speed and
the original game by changing the values in
There is also an option to control one of the agents via keyboard. To do so, change one of the agents
passed to the game world to
( ' user_agent ', False )
and use the arrow keys to move around,
Space
to drop a bomb and
Return
to wait. This is clearly
not an ecient way to create training data for your agent, but helps you develop an intuition for the
game's behaviour. This might, for example, be useful for designing auxilliary rewards or learning
curricula.
7
Report and code repository
Put all your code into a public
Github
or
Bitbucket
repository and include the URL into your
report. The zip-le to be uploaded on Moodle shall only contain the subdirectory of
agent_code
containing your fully trained player, along with the report.
You report should consist of roughly ten pages per team member, not counting title page etc.
The rst section shall describe the reinforcement learning method and regression model you nally
implemented, including all crucial design choices. You may also describe approaches you tried and
abandoned later, including the reasons. The second section should decribe your training process,
including all tricks employed to speed it up (e.g. self play strategy, design of auxilliary rewards,
prioritization of experience replay and so on). The third section shall report experimental results
(e.g. training progress diagrams), describe interesting observations, and discuss the diculties you
faced and how you overcame them. The nal section shall give an outlook on how you would improve
your agent if you had more time, and how we can improve the game setup for next year.
6/6
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 6 Page Mode : UseOutlines Author : Title : Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.18 Create Date : 2019:02:15 18:28:06+01:00 Modify Date : 2019:02:15 18:28:06+01:00 Trapped : False PTEX Fullbanner : This is MiKTeX-pdfTeX 2.9.6354 (1.40.18)EXIF Metadata provided by EXIF.tools