Game Instructions
User Manual:
Open the PDF directly: View PDF .
Page Count: 6
Download | |
Open PDF In Browser | View PDF |
© Fundamentals of Machine Learning Winter Semester 2018/2019 Final Pro ject U. Köthe, Jakob Kruse ullrich.koethe@iwr.uni-heidelberg.de Final Project Reinforcement Learning for Bomberman Deadline: 25.3.2019 In this year's nal project you will use reinforcement learning techniques to train an agent to play the classical game Bomberman. In our setting, the game is played by up to four agents in discrete, but simultaneous time steps. Your agent can move around, drop bombs or stand still. Crates can be cleared by well-placed bombs and will sometimes drop coins, which you can collect for points. The deciding factor for the nal score, however, is to blow up opposing agents, and to avoid to get blown up yourself. To keep things simple, special items and power-ups are not available in this version of the game. After the project deadline, we will hold a tournament between all trained agents with real prizes. Tournament performance will be a factor in the nal grade as well, although the quality of your approach (as described in the report and code) will carry more weight. Regulations Submission Your submission for the nal project should consist of the following: The agent code to be used in the competition, including all trained weights, in a subdirectory of agent_code (see details below). A PDF report of your approach. Aim for a length of about 10 pages per team member and indicate after headings who is responsible for each subsection. The URL of a public repository containing your entire code base, which must be mentioned in the report. Solutions not involving machine learning will be rejected. Zip all les into a single archive with naming convention (sorted alphabetically by last names) lastname1-firstname1_lastname2-firstname2_final-project.zip or (if you work in a team of three) lastname1-firstname1_lastname2-firstname2_lastname3-firstname3_final-project.zip and upload it to Moodle before the given deadline. 1/6 © Fundamentals of Machine Learning Winter Semester 2018/2019 Final Pro ject U. Köthe, Jakob Kruse ullrich.koethe@iwr.uni-heidelberg.de Development As will be detailed later on, your agent's code will be run in its own separate process. To not interfere with interprocess communication, and to share resources fairly between competing agents, you are not allowed to use multiprocessing in your nal agent. However, multiprocessing during training is perfectly ne. We will distinguish classical and neural-network solutions. The choice among the two is entirely up to you. Neural networks will be executed on the CPU during ocial games, but you may use GPUs during training. If agent performance between these approaches diers too much, we will play the tournament in two separate leagues. Playing around with the provided framework is allowed, and may in fact be necessary to facilitate fast training of your agent. However, be aware that the nal agent code will simply be plugged into our original version of the framework any changes you made to other parts will not be present during games. Discussions with other teams are very much encouraged, and trained agents (not training code!) may be exchanged among teams to serve as sparring partners. Just keep in mind that in the tournament you will compete for prizes, so you may want to keep your best ideas to yourself :) The internet provides plenty of material on all aspects of reinforcement learning. Study some of it to learn more about RL and get inspiration. The use of free software libraries (e.g. allowed, if they can be installed from an ocial repository like pip or conda, pytoch) is but you must not copy-paste any existing solution in whole or in part. Plagiarism will lead to a failed grade. 1 Setup You can nd the framework for this project on Github: git clone https :// github . com / ukoethe / bomberman_rl It contains the game environment with all APIs you need for reinforcement learning, as well as example code for a simple rule-based agent that does not learn anything. Running main.py from the command line should let you watch a set of games between four of these simple agents. Direct any questions about the framework to Jakob Kruse (jakob.kruse@iwr.uni-heidelberg.de). As a game engine we are using the environment using pip: pygame package, which you can install into your existing conda pip install pygame Other than that, we assume a standard Python 3 installation with numpy, scipy and sklearn. Clearly indicate at the beginning of your report which additional libraries we need to install. 2 General setting and rules of the game The game is played in discrete steps by one to four agents, who are represented by robot heads in dierent colors. Because of the random elements, multiple episodes of the game will be played to determine a winner by total score. In one step of a game episode, each robot can either move one tile horizontally or vertically, drop a bomb or wait. Movement is restricted to empty (i.e. black) tiles stone walls, crates, bombs and other agents block movement. Coins are an exception, as they are collected when moving onto their tile. While the placement of stone walls is the same for every round, the distribution of crates and the hiding places of coins dier each time. Agents always start in one of the board's corners, but it is randomly determined which. Once a bomb is dropped, it will detonate after four steps and create an explosion that extends three tiles up, down, left and right (so you can just outrun the explosion). The explosion destroys crates 2/6 © Fundamentals of Machine Learning Winter Semester 2018/2019 U. Köthe, Jakob Kruse ullrich.koethe@iwr.uni-heidelberg.de Final Pro ject and agents, but will stop at stone walls and does not reach around corners. It lingers for two time steps, and agents running into it during that period are still defeated. Agents can only drop a new bomb after their previous one has exploded. A xed number of coins is hidden at random positions in each episode. When the crate concealing a coin is blown up, the coin becomes visible and can be collected. Collecting a coin gains the agent one point, while blowing up an opponent is worth ve points. Every episode ends after 400 steps. The agent taking the highest average thinking time during an episode is deducted one point to encourage ecient implementations. In addition, there is a xed 1 per step for agents to arrive at their decisions. After this time, tardy time limit of 0.5 seconds agent processes are interrupted and whatever actions they have picked by this time are performed. The exact numbers used for all these rules can be found in settings.py. They may be subject to change until seven days before the deadline, in case you inform us about major problems with the current settings. You will be notied if any of the rules are adapted. 3 Tasks your agents will have to solve We are trying this project setting for the rst time and don't know how dicult learning of the full game will be. We thus dene preliminary tasks to help your method evolve from simple to complex and ease debugging. The tasks are subsets of each other, so an agent that can handle task 3 should also be able to solve 1 and 2. Conguration instructions for these tasks are given in section 5. 1. On a game board without any crates, collect a number of revealed coins as quickly as possible. This task does not require dropping any bombs. The agent should learn how to navigate the board eciently. 2. On a game board with randomly placed crates, nd all hidden coins and collect them within the step limit. The agent must drop bombs to destroy the crates. It should learn how to use bombs without killing itself, while not forgetting ecient navigation. 3. On a game board with crates, hold your own against one or more opposing agents and ght for the highest score. 4 Framework structure and interface for your agent In reinforcement learning, we typically distinguish between the agent and the environment it has to interact with. Let us start with the environment here. Environment The game world and logic is dened in environment.py in the class BombeRLeWorld. It keeps track of the board and all game objects, can run a step of the game, start a new round and render everything to the screen. What's most interesting for you is that it keeps track of the agents playing the game via objects of the each Agent Agent class dened in agents.py. In addition to position, score etc., object contains a handle to a separate process which will run your custom code. This process is also dened in agents.py as AgentProcess. Agent The agent process runs a loop that interacts with the main game loop. Before it enters the loop, it imports your custom code from a le called callbacks.py within a subdirectory of agent_code. Your script must provide two functions which will be called by the agent process at the appropriate times: 1 Assume that your agent will have exclusive access to one core of an to 8 GB of RAM when we run the tournament. 3/6 Intel i7-8700K processor and can use up © Fundamentals of Machine Learning Winter Semester 2018/2019 The function Final Pro ject U. Köthe, Jakob Kruse ullrich.koethe@iwr.uni-heidelberg.de setup(self) is called once, before the rst round starts, to give you a place to initialize self argument is the same as in an instance method a persistent object everything you need. The that will be passed to all subsequent callbacks as well. That means you can assign values or objects to it which you may need later on: def setup ( self ): self . model = MyModel () ... self . model . set_initial_guess (x ) The function act(self) is called once every step to give your agent the opportunity to gure 'UP', 'DOWN', 'LEFT', 'RIGHT', 'BOMB' and 'WAIT', with the latter being the default. To choose an action, simply assign the respective string to self.next_action within your act(self) implementation: out the best next move. The available actions are def act ( self ): ... # think self . next_action = ' RIGHT ' ... # think more , if time limit has not been reached self . next_action = ' BOMB ' # seems to be better ... # you can update your decision as often as you want act(self). Once your agent is sure about the best decision, simply return from If act(self) doesn't return within the time limit, the game engine will interrupt its execution. In either case, the current (possibly default) value of self.next_action will be executed in the next step (this strategy is generally called any-time algorithm). In order to make informed decisions on what to do, you need to know about the agent's environment. The current state of the game world is stored within each agent before dictionary, which you can access as 'step' 'arena' act(self) is called. It is a and has the following entries: 1. numpy array describing the tiles of the game board. Its entries are 1 for crates, −1 for stone walls and 0 for free tiles. A tuple (x, y, n, b) describing your own agent. x and y are its coordinates on the board, n its name and b ∈ {0, 1} a ag indicating if the 'BOMB' action is The number of steps in the episode so far, starting at A 2D 'self' 'others' 'bombs' 'explosions' 'coins' self.game_state, possible (i.e. no own bomb is currently ticking). A list of tuples like the one above for all opponents that are still in the game. A list of tuples A 2D numpy (x, y, t) of coordinates and countdowns for all active bombs. array stating, for each tile, for how many steps an explosion will be present. Where there is no explosion, the value is A list of coordinates (x, y) 0. for all currently collectable coins. When the game is run in training mode, there are two additional callbacks: reward_update(self) is called once after each but the nal step for an agent, i.e. after the actions have been executed and their consequences are known. This information is needed to collect training data and ll an experience buer. The other callback is end_of_episode(self), which is very similar to the previous, but only called once per agent after the last step of a episode. This is where most of your learning should take place, as you have knowledge of the whole episode. Note that neither of these functions has a time limit. In both callbacks, self.events will hold a list of game events that transpired in the previous step and are relevant to your agent. You can use these as a basis for auxilliary rewards and penalties to speed-up training. They will not be available when the game is run out of training mode. All available events are stored in the constant e which you can import from from settings import e They are dened as follows: 4/6 settings.py: © Fundamentals of Machine Learning Winter Semester 2018/2019 Final Pro ject e.MOVED_LEFT e.MOVED_RIGHT e.MOVED_UP e.MOVED_DOWN e.WAITED e.INTERRUPTED e.INVALID_ACTION e.BOMB_DROPPED e.BOMB_EXPLODED e.CRATE_DESTROYED e.COIN_FOUND e.COIN_COLLECTED e.KILLED_OPPONENT e.KILLED_SELF e.GOT_KILLED e.OPPONENT_ELIMINATED e.SURVIVED_ROUND 5 U. Köthe, Jakob Kruse ullrich.koethe@iwr.uni-heidelberg.de Successfully moved one tile to the left. Successfully moved one tile to the right. Successfully moved one tile up. Successfully moved one tile down. Intentionally didn't act at all. Got interrupted for taking too much time. Picked a non-existent action or one that couldn't be executed. Successfully dropped a bomb. Own bomb dropped earlier on has exploded. A crate was destroyed by own bomb. A coin has been revealed by own bomb. Collected a coin. Blew up an opponent. Blew up self. Got blown up by an opponent's bomb. Opponent got blown up by someone else. End of round reached and agent still alive. Putting it all together In order to train a new agent, you must create a subdirectory within agent_code with your agent's name this name will also identify your agent during the tournament. Let's go with my_agent as an example. Within agent_code/my_agent/, you put your script callbacks.py as described above. Other custom les, such as trained model parameters, must also be stored in this directory! You can put the agent into the game by passing its name to the game world's constructor in main.py. To have it play against two instances of our rule-based example agent and one agent that chooses random actions, change the code like this: world = BombeRLeWorld ([ ( ' my_agent ', True ) , ( ' simple_agent ', False ) , ( ' simple_agent ', False ) , ( ' random_agent ', False ) ], save_replay = False ) The Boolean values after each agent name indicate whether this agent will be run in training mode, i.e. whether its reward_update(self) and end_of_episode(self) functions will be called. You can also specify several agents of your own, or even the same agent multiple times, in order to train your agents by self-play. If you choose this strategy, it may be helpful to add a random element to the choice of actions to avoid quick convergence to a bad local optimum. For ecient training, you should of course turn o the graphical user interface. To do so, simply change the value of 'gui' in settings.py to False. The game will then skip the rendering and run as fast as the agents can make decisions. Many of the other settings should not be changed, but you can safely play around with the following: 'update_interval' 'turn_based' 'n_rounds' 'save_replay' 'make_video_from_replay' 'crate_density' 'max_steps' 'stop_if_not_training' 'timeout' 'log_agent_code' Minimum time for a step of the game, in seconds. If True, wait for a key press before executing each step. Number of episodes to play in one session. If If True, True, record the game to a le in replays/ (see below). automatically render a video during a replay. What fraction of the board should be occupied by crates. Maximum number of steps per episode. If True, end episode as soon as all agents running in training mode have been eliminated. Maximum time for agents to decide on an action, in seconds. Logging level for your agent's code, see below. 5/6 © Fundamentals of Machine Learning Winter Semester 2018/2019 U. Köthe, Jakob Kruse ullrich.koethe@iwr.uni-heidelberg.de Final Pro ject These settings can be accessed from the code by again importing from settings.py: from settings import s print ( s. timeout ) only that agent to the game world's 'crate_density' in settings.py to 0. To test an agent on the tasks 1 or 2 outlined in section 3, pass constructor in 6 main.py. For task 1, additionally set Some hints to get you started The self object passed to your callback functions comes with logging functionality that you can use to monitor and debug what your agent is up to: self . logger . info (f" Choosing an action for step { self . game_state [ ' step ']}... ") self . logger . debug (" This is only logged if s. log_agent_code == logging . DEBUG ") All logged messages will appear in a le named after your agent in your subdirectory. Continuing with the example from earlier, that would be agent_code/my_agent/logs/my_agent.log. agent_code/simple_agent/ contains code for a rule-based agent well. Look at this agent's act(self) callback to see an example that plays the game reasonably of how reading the game state, logging and choosing an action are done. You can adapt this agent as a training opponent, or even use it to create training data while your own agent is still struggling. If you have s.save_replay enabled, a new replay le will be saved in To watch it back, change the world initialization in main.py replays/ for each episode. to world = ReplayWorld ( ' replay_filename_without_extension ') fps dierent from settings.py. If s.make_video_from_replay is enabled and you have ffmpeg installed with libx264 or libvpx-vp9 codecs, what you see will automatically be encoded as a video le and saved to screenshots/. and run the game like normal. You can watch the playback at a step speed and the original game by changing the values in There is also an option to control one of the agents via keyboard. To do so, change one of the agents passed to the game world to ( ' user_agent ', False ) and use the arrow keys to move around, Space to drop a bomb and Return to wait. This is clearly not an ecient way to create training data for your agent, but helps you develop an intuition for the game's behaviour. This might, for example, be useful for designing auxilliary rewards or learning curricula. 7 Report and code repository Put all your code into a public Github or Bitbucket repository and include the URL into your report. The zip-le to be uploaded on Moodle shall only contain the subdirectory of agent_code containing your fully trained player, along with the report. You report should consist of roughly ten pages per team member, not counting title page etc. The rst section shall describe the reinforcement learning method and regression model you nally implemented, including all crucial design choices. You may also describe approaches you tried and abandoned later, including the reasons. The second section should decribe your training process, including all tricks employed to speed it up (e.g. self play strategy, design of auxilliary rewards, prioritization of experience replay and so on). The third section shall report experimental results (e.g. training progress diagrams), describe interesting observations, and discuss the diculties you faced and how you overcame them. The nal section shall give an outlook on how you would improve your agent if you had more time, and how we can improve the game setup for next year. 6/6
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 6 Page Mode : UseOutlines Author : Title : Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.18 Create Date : 2019:02:15 18:28:06+01:00 Modify Date : 2019:02:15 18:28:06+01:00 Trapped : False PTEX Fullbanner : This is MiKTeX-pdfTeX 2.9.6354 (1.40.18)EXIF Metadata provided by EXIF.tools