联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2020-12-11 11:22

Assignment 2

ECS7002P - Artificial Intelligence in Games

November 9, 2020

In this assignment, you will implement a variety of reinforcement learning algorithms to find policies for the

frozen lake environment. Please read this entire document before you start working on the assignment.

1 Environment

The frozen lake environment has two main variants: the small frozen lake (Fig. 1) and the big frozen lake (Fig.

2). In both cases, each tile in a square grid corresponds to a state. There is also an additional absorbing state,

which will be introduced soon. There are four types of tiles: start (grey), frozen lake (light blue), hole (dark

blue), and goal (white). The agent has four actions, which correspond to moving one tile up, left, down, or

right. However, with probability 0.1, the environment ignores the desired direction and the agent slips (moves

one tile in a random direction). An action that would cause the agent to move outside the grid leaves the state

unchanged.

Figure 1: Small frozen lake

Figure 2: Big frozen lake

The agent receives reward 1 upon taking an action at the goal. In every other case, the agent receives zero

reward. Note that the agent does not receive a reward upon moving into the goal (nor a negative reward upon

moving into a hole). Upon taking an action at the goal or in a hole, the agent moves into the absorbing state.

Every action taken at the absorbing state leads to the absorbing state, which also does not provide rewards.

Assume a discount factor of γ = 0.9.

1

For the purposes of model-free reinforcement learning (or interactive testing), the agent is able to interact

with the frozen lake for a number of time steps that is equal to the number of tiles.

Your first task is to implement the frozen lake environment. Using either Python or Java, try to mimic the

Python interface presented in Listing 1.

Listing 1: Frozen lake environment.

The class EnvironmentModel represents a model of an environment. The constructor of this class receives

a number of states, a number of actions, and a seed that controls the pseudorandom number generator. Its

subclasses must implement two methods: p and r. The method p returns the probability of transitioning from

state to next state given action. The method r returns the expected reward in having transitioned from state to

next state given action. The method draw receives a pair of state and action and returns a state drawn according

to p together with the corresponding expected reward. Note that states and actions are represented by integers

starting at zero. We highly recommend that you follow the same convention, since this will facilitate immensely

the implementation of reinforcement learning algorithms. You can use a Python dictionary (or equivalent data

structure) to map (from and to) integers to a more convenient representation when necessary. Note that, in

general, agents may receive rewards drawn probabilistically by an environment, which is not supported in this

simplified implementation.

The class Environment represents an interactive environment and inherits from EnvironmentModel. The

constructor of this class receives a number of states, a number of actions, a maximum number of steps for

interaction, a probability distribution over initial states, and a seed that controls the pseudorandom number

generator. Its subclasses must implement two methods: p and r, which were already explained above. This

class has two new methods: reset and step. The method reset restarts the interaction between the agent and

the environment by setting the number of time steps to zero and drawing a state according to the probability

distribution over initial states. This state is stored by the class. The method step receives an action and returns

a next state drawn according to p, the corresponding expected reward, and a flag variable. The new state is

stored by the class. This method also keeps track of how many steps have been taken. Once the number of steps

matches or exceeds the pre-defined maximum number of steps, the flag variable indicates that the interaction

should end.

The class FrozenLake represents the frozen lake environment. Your task is to implement the methods p and

r for this class. The constructor of this class receives a matrix that represents a lake, a probability that the

agent will slip at any given time step, a maximum number of steps for interaction, and a seed that controls

the pseudorandom number generator. This class overrides the method step to indicate that the interaction

should also end when the absorbing state is reached. The method render is capable of rendering the state of

the environment or a pair of policy and value function.

The function play can be used to test your implementation of the environment before you try the next tasks.

4

2 Tabular model-based reinforcement learning

Your next task is to implement policy evaluation, policy improvement, policy iteration, and value iteration.

You may follow the interface suggested in Listing 2.

Listing 2: Tabular model-based algorithms.

The function policy evaluation receives an environment model, a deterministic policy, a discount factor, a

tolerance parameter, and a maximum number of iterations. A deterministic policy may be represented by an

array that contains the action prescribed for each state.

The function policy improvement receives an environment model, the value function for a policy to be

improved, and a discount factor.

The function policy iteration receives an environment model, a discount factor, a tolerance parameter, a

maximum number of iterations, and (optionally) the initial policy.

The function value iteration receives an environment model, a discount factor, a tolerance parameter, a

maximum number of iterations, and (optionally) the initial value function.

3 Tabular model-free reinforcement learning

Your next task is to implement Sarsa control and Q-learning control. You may follow the interface suggested in

Listing 3. We recommend that you use the small frozen lake to test your implementation, since these algorithms

may need many episodes to find an optimal policy for the big frozen lake.

Listing 3: Tabular model-free algorithms.

The function sarsa receives an environment, a maximum number of episodes, an initial learning rate, a

discount factor, an initial exploration factor, and an (optional) seed that controls the pseudorandom number

generator. Note that the learning rate and exploration factor decrease linearly as the number of episodes

increases (for instance, eta[i] contains the learning rate for episode i).

The function q learning receives an environment, a maximum number of episodes, an initial learning rate,

a discount factor, an initial exploration factor, and an (optional) seed that controls the pseudorandom number

generator. Note that the learning rate and exploration factor decrease linearly as the number of episodes

increases (for instance, eta[i] contains the learning rate for episode i).

Important: The -greedy policy based on Q should break ties randomly between actions that maximize Q

for a given state. This plays a large role in encouraging exploration.

4 Non-tabular model-free reinforcement learning

In this task, you will treat the frozen lake environment as if it required linear action-value function approximation.

Your task is to implement Sarsa control and Q-learning control using linear function approximation.

In the process, you will learn that tabular model-free reinforcement learning is a special case of non-tabular

model-free reinforcement learning. You may follow the interface suggested in Listing 4.

Listing 4: Non-tabular model-free algorithms.

The class LinearWrapper implements a wrapper that behaves similarly to an environment that is given to

its constructor. However, the methods reset and step return a feature matrix when they would typically return

a state s. Each row a of this feature matrix contains the feature vector φ(s, a) that represents the pair of action

and state (s, a). The method encode state is responsible for representing a state by such a feature matrix.

More concretely, each possible pair of state and action is represented by a different vector where all elements

except one are zero. Therefore, the feature matrix has |S||A| columns. The method decode policy receives a

parameter vector θ obtained by a non-tabular reinforcement learning algorithm and returns the corresponding

greedy policy together with its value function estimate.

The function linear sarsa receives an environment (wrapped by LinearWrapper ), a maximum number of

episodes, an initial learning rate, a discount factor, an initial exploration factor, and an (optional) seed that

controls the pseudorandom number generator. Note that the learning rate and exploration factor decay linearly

as the number of episodes grows (for instance, eta[i] contains the learning rate for episode i).

The function linear q learning receives an environment (wrapped by LinearWrapper ), a maximum number

of episodes, an initial learning rate, a discount factor, an initial exploration factor, and an (optional) seed that

controls the pseudorandom number generator. Note that the learning rate and exploration factor decay linearly

as the number of episodes grows (for instance, eta[i] contains the learning rate for episode i).

The Q-learning control algorithm for linear function approximation is presented in Algorithm 1. Note that

this algorithm uses a slightly different convention for naming variables and omits some details for the sake of

simplicity (such as learning rate/exploration factor decay).

Algorithm 1 Q-learning control algorithm for linear function approximation

Input: feature vector φ(s, a) for all state-action pairs (s, a), number of episodes N, learning rate α, probability

of choosing random action ,

Important: The -greedy policy based on Q should break ties randomly between actions that maximize Q

(Algorithm 1, Line 9). This plays a large role in encouraging exploration.

5 Main function

Your final implementation task is to write a program that uses all the algorithms that you have implemented for

this assignment. Your main function should behave analogously to the function presented in Listing 5. Using

the small frozen lake as a benchmark, find and render an optimal policy using policy iteration, value iteration,

Sarsa control, Q-learning control, linear Sarsa control, and linear Q-learning. For grading purposes, if your

main function does not call one of these algorithms, we will assume that it is not implemented correctly.

Listing 5: Main function.

6 Submission instructions

This assignment corresponds to 40% of the final grade for this module. You will work in groups of 3 students. The

deadline for submitting this assignment is December 11th, 2020. Penalties for late submissions will be applied

in accordance with the School policy. The submission cut-off date is 7 days after the deadline. Submissions

should be made through QM+. Submissions by e-mail will be ignored. Please always check whether the files

were uploaded correctly to QM+. Cases of extenuating circumstances have to go through the proper procedure

in accordance with the School policy. Only cases approved by the School in due time will be considered.

You will find the group selection page in QM+. You must be part of a group in QM+ before submitting

your assignment. Plagiarism leads to irreversible non-negotiable failure in the module. If you are unsure about

what constitutes plagiarism, please ask.

This assignment requires a group submission and an individual submission, which are detailed in the next

sections.

6.1 Group submission

The group submission must be a single zip file. This file must contain a single folder named group[group id].

This folder must contain a report and a folder named code.

The code folder must contain a file named README.txt, which explains how to run your main function (see

Section 5). Based solely on the correctness and clarity of your code, you will receive the following number of

points for accomplishing each of the following tasks:

1. Implementing the frozen lake environment [10/100]

2. Implementing policy iteration [10/100]

3. Implementing value iteration [10/100]

4. Implementing Sarsa control [10/100]

5. Implementing Q-learning [10/100]

6. Implementing Sarsa control using linear function approximation [10/100]

7. Implementing Q-learning control using linear function approximation [10/100]

The report must be a single pdf file. Other formats are not acceptable. The report must be excellently

organized and identified with your names, student numbers, and module identifier. You will receive the following

number of points for answering each of the following questions:

1. Explain how your code for this assignment is organized. Did you make implementation decisions that

deviate significantly from what we suggested? [10/100]

2. How many iterations did policy iteration require to find an optimal policy for the big frozen lake? How

many iterations did value iteration require? Which algorithm was faster? [10/100]

3. How many episodes did Sarsa control require to find an optimal policy for the small frozen lake? How

many episodes did Q-learning control require? Hint: you may use policy evaluation to compare the value

of each policy obtained by these algorithms to the value of an optimal policy [10/100]

4. In linear action-value function approximation, how can each element of the parameter vector θ be interpreted

when each possible pair of state s and action a is represented by a different feature vector φ(s, a)

where all elements except one are zero? Explain why the tabular model-free reinforcement learning algorithms

that you implemented are a special case of the non-tabular model-free reinforcement learning

algorithms that you implemented. [Bonus 5/100]

5. Try to find an optimal policy for the big frozen lake by tweaking the parameters for Sarsa control and

Q-learning control (maximum number of episodes, learning rate, and exploration factor). You must use

policy evaluation to confirm that the resulting policy is optimal. Even if you fail, describe your experience.

[Bonus 5/100]

10

6.2 Individual submission

Each student must submit a text that describes in no more than 300 words the role that each member of their

group had in the assignment. Your grade may be penalized if the individual submissions reveal that you have

not contributed enough to the assignment. The individual submission must be a single pdf file. Other formats

are not acceptable.

11


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp