Instructions
User Manual:
Open the PDF directly: View PDF .
Page Count: 5
Download | |
Open PDF In Browser | View PDF |
Artificial Intelligence – Programming Assignment 3 UNIZG FER, academic year 2017/18 Handed out: 14.5.2018. Due: 27.5.2018. at 23:59 Introduction In this lab assignment you will solve problems from the domain of reinforcement learning. The code you will be using can be downloaded as a zip archive at the web page of the university here or at the github repository of the class here. In the scope of this lab assignment, you can use the autograder, which will give you adequate feedback regarding the correctness of your implementation. You run the autograder by typing python autograder.py. If you want to run the autograder for a specific question, do it as follows: python autograder.py -q q1. The code for the project consists of Python files, a part of which you just need to read and understand, a part of which you need to modify yourselves and a part that you can ignore. After you download and unpack the zip archive, you will see a list of files and folders which we briefly describe below: Files you will edit: valueIterationAgents.py qlearningAgents.py Files you should study: mdp.py learningAgents.py util.py gridworld.py featureExtractors.py Helper files you can ignore: environment.py graphicsUtils.py graphicsGridworldDisplay.py crawler.py graphicsCrawlerDisplay.py autograder.py The logic of the value iteration algorithm The logic of Q-learning based algorithms Description of the general markov decision process Logic describing the agents Helper classes and methods GridWorld description File that extracts features for approximate Q-learning Abstract class for reinforcement learning problems Helper functions for graphic display Graphic display of the GridWorld Crawler logic Graphic display of the crawler Helper file for running tests The code for the lab assignments was adapted from the course ”Intro to AI” at Berkeley, which allowed the usage of their Pacman environment for other universities for educational purposes. The code is written in Python 2.7, which you require an interpreter for in order to run the lab assignment. After you’ve downloaded and unpacked the code and positioned your console in the subdirectory where you have unpacked the archive, you can test the GridWorld by manually controlling the movement with arrows by typing: 1 Artificial Intelligence – Programming Assignment 3 python gridworld.py -m GridWorld is a world where your movement is made difficult by the fact that you only have an 80% chance to actually do the action that you wanted to do, and a 20% chance to make a random action. The noise can be modified with the parameter ’-n’, such as: python gridworld.py -m -n 0.5 In which case we set the noise rather high, to 50%. This parameter as well as the rest of them can be seen by calling the parameter ’-h (help)’. The output is as follows: Options: -h, --help show this help message and exit -d DISCOUNT, --discount=DISCOUNT Discount on future (default 0.9) -r R, --livingReward=R Reward for living for a time step (default 0.0) -n P, --noise=P How often action results in unintended direction (default 0.2) -e E, --epsilon=E Chance of taking a random action in q-learning (default 0.3) -l P, --learningRate=P TD learning rate (default 0.5) -i K, --iterations=K Number of rounds of value iteration (default 10) -k K, --episodes=K Number of epsiodes of the MDP to run (default 1) -g G, --grid=G Grid to use (case sensitive; options are BookGrid, BridgeGrid, CliffGrid, MazeGrid, default BookGrid) -w X, --windowSize=X Request a window width of X pixels *per grid cell* (default 150) -a A, --agent=A Agent type (options are ’random’, ’value’ and ’q’, default random) -t, --text Use text-only ASCII display -p, --pause Pause GUI after each time step when running the MDP -q, --quiet Skip display of any learning episodes -s S, --speed=S Speed of animation, S > 1.0 is faster, 0.0 < S < 1.0 is slower (default 1.0) -m, --manual Manually control agent -v, --valueSteps Display each step of value iteration As we have covered in the lectures, the problem we are solving in GridWorld is the one of optimizing the total utility while moving through the map. The difficulty in this problem is that you have noise while moving, and the actions that you choose don’t need to necessarily take you to the same place. In the GridWorld, this manifests in the way that you have a fixed chance of 1 ´ n to take the action that you chose, and a chance of n to move randomly. As usual, the states are defined by coordinates (x,y), while the transitions are defined by the sides of the world towards which the agent is moving (south, east, north, west). The lab assignment consists of four subtasks, and the correct implementation of each one of them carries the same weight. To ease the grading, each of the subtasks carries five points, with a total sum of 20. Your final points will be scaled and the actual maximum for the third lab assignment is 6.25. 2 Artificial Intelligence – Programming Assignment 3 Problem 1: Value iteration (5 points) In the file valueIterationAgents.py fill in the code missing to implement value iteration. A reminder from the lectures - value iteration is an algorithm that tries to calculate the utility of every state of the map by using the recursive Bellman equation: ř V0 psq “ 10 Vk`1 psq Ð maxa s1 T ps, a, s qrRps, a, s1 q ` γVk ps1 qs As a parameter to the constructor of your ValueIterationAgent you receive an argument mdp, which is the definition of the underlying Markov Decision Process. In the case of GridWorld, this is the implementation of the abstract class mdp.MarkovDecisionProcess which is located in gridworld.py. The methods that describe the Markov Decision Process are: mdp.getStates(): Returns the list of all possible states as a list of tuples (int, int) mdp.getPossibleActions(state): Returns the list of strings representing all possible actions for a given state mdp.getTransitionStatesAndProbs(state, action): Returns a list of pairs(tuples) consisting of (nextState: tuple(int, int), transitionProbability: float) mdp.getReward(state, action, nextState): Returns the float precision reward for a given state, action, nextState triple mdp.isTerminal(state): Checks if a state is final (returns True or False). Final states don’t have any transitions (so you should handle the case where the list of transitions is empty)! The methods you need to complete are: init , computeQValueFromValues, computeActionFromValues. SInce you have pre-defined values of the transition and reward functions, most of your computation should be done in the initialization. While solving use the already initialized class util.Counter, which is an extension of the Python’s default dictionary with default values of all keys set to zero - so you don’t need to initialize the value for each state. Note 1.1 Make sure that while updating the values you use a helper structure where you will store the intermediate result - in the stap k ` 1 of value iteration, the values Vk`1 depend only on the values from the previous step Vk - make sure you don’t use the values that you have calculated in step k ` 1. In order to test your implementation, the result of running the following command: python gridworld.py -a value -i 5 should look exactly like this: 3 Artificial Intelligence – Programming Assignment 3 Figure 1: Value iteration example Problem 2: Q-learning (5 points) In the file qlearningAgents.py, fill in the code for the Q-learning algorithm. A reminder from the classes, the Q-learning algorithm works on the running average principle, and the update formula looks as follows: Q0 ps, aq “ 0 Qk`1 ps, aq Ð Qk ps, aq ` pαqrRps, a, s1 q ` γ maxa1 Qk ps1 , a1 q ´ Qk ps, aqs Where α is the learning rate, while γ is the decaying factor. The methods you need to fill in in the QLearningAgent class are: init , getQValue, computeValueFromQValues, computeActionFromQValues, getAction and update. The QlearningAgent class has variables that it inherits from the ReinforcementAgent class, which even though you can’t see explicitly, exist. The variables that were inherited, and you will need in the scope of the assignment are: self.epsilon - exploration probability for the -greedy approach (Only in assignment 3.), self.alpha - learning rate, and self.discount - the decaying factor. All the falues are in float precision Another useful function which is not explicitely seen is self.getLegalActions(state), which returns the list of possible actions for a given state. Note 2.1 Take care of the fact that if you visited just one Q-state of a state, and it has a negative value, the optimal move can be one of the moves that you have not explored yet (due to the implicit default value of zero). Take care to explicitly initialize the moves so you can handle these cases. In case of ties in the Q-values of multiple Q-states, you should solve the ties by taking a random action from the ones with the maximum values. For this, the method random.choice() is useful, as it takes a list of elements as an argument and returns a random one with uniform probability. Note 2.2 Take care of the fact that when accessing Q-values, you use the methodgetQValue, since it is used later on in approximate Q-learning. Note 2.3 You can use tuples as keys to your dictionaries and Counters! 4 Artificial Intelligence – Programming Assignment 3 Problem 3: -greedy (5 bodova) Modify your Q-learning algorithm by implementing the -greedy approach. As a reminder, the is a small probability of taking a random action while learning the Q-values. The probability is available as a class variable self.epsilon, while the method util.flipCoin(p), might also prove useful, as it returns True with probability p, and False with probability 1 ´ p. Without any additional changes to your code after adding the -greedy approach, you can test your implementation by running the crawler and playing around with the parameters: python crawler.py Problem 4: Approximate Q-learning (5 bodova) Implement approximate Q-learning in the file qlearningAgents.py in the class ApproximateAgent. As a reminder, an approximate Q-value is the form: ř Qps, aq “ ni“1 fi ps, aqwi Where W is the weight vector, while f ps, aq is the feature vector for a Q-state (s,a). Every weight wi from the vector is mapped to one feature fi . The functions that will calculate the features are already defined and implemented in the file featureExtractors.py, and the class variable self.featExtractor which is available in your ApproximateAgent has a method getFeatures(state, action) which returns a Counter with feature values for a given Q-state. At the beginning of the process, you should initialize the weight to zero (hint: use the Counter), and then update them by using the formula from the classes: wi Ð wi ` α ¨ dif f erence ¨ fi ps, aq dif f erence “ pr ` γ maxa1 Qps1 , a1 qq ´ Qps, aq If your implementation works, the learned pacman controller should without any problems solve the following map: python pacman.py -p ApproximateQAgent -a extractor=SimpleExtractor -x 50 -n 60 -l mediumGrid While with a bit longer training time, he should have no more problems with a larger one: python pacman.py -p ApproximateQAgent -a extractor=SimpleExtractor -x 50 -n 60 -l mediumClassic 5
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 5 Page Mode : UseOutlines Author : Title : Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.16 Create Date : 2018:05:14 15:49:01+02:00 Modify Date : 2018:05:14 15:49:01+02:00 Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian) kpathsea version 6.2.1EXIF Metadata provided by EXIF.tools