联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2020-10-26 10:48

Department of Computing

Reinforcement Learning

Assessed coursework 1

Version 1.00

To be returned as online submission.

The final document should be submitted in PDF format, preferably typeset, ideally in Latex.

Your answers should be yours, i.e., written by you, in your own words, showing your own

understanding. Written answers should be clear, complete and concise. Figures should be

clearly readable, labelled and visible. Poorly produced answers, irrelevant text not addressing

the point and unclear text may loose points. As a rule of thumb if the work is not presented

in such a way that the markers can easily verify your understanding, it will be difficult to get

marks for it. If you need to code your answers to produce results, please paste the completed

and annotated source code in the appendix of your submission. Note, that the coursework will

be graded on the written report, submitted code is treated as ancillary documentation of your

authorship, it not as a substitute for a written answer.

Your coursework submission should not be longer than 7 single sided A4 pages with at least 2

centimetre margins all around and 12pt font. Again, seven pages is a maximum length, shorter

courseworks are fine, the code appendix does not count towards page limit, other appendices

are not allowed. Coursework over the page limit may incur a penalty. We expect not more than

4 figures including explanatory captions per page.

You are encouraged to discuss the general coursework questions with other students, but

your answers should be yours, i.e., written by you, in your own words, showing your own

understanding. You should produce your own code to compute your specific answers. You

are welcome to reuse code you developed in the lab assignments and interactive computer

labs. If you have questions about the coursework please make use of the labs or Piazza, but

note that GTAs cannot provide you with answers that directly solve the coursework.

Marks are shown next to each question. Note, that the marks are only indicative.

All coursework is to be submitted on CATE by the specified end date.Please ensure that

you are familiar with the straightforward CATE submission process well before the coursework

deadline. Unfortunately, emailed or printed submissions cannot be accepted.

Question 1: Understanding of MDPs (20 points)

This question is personalised by your College ID (CID) number.

Consider the following observed traces for a Markov Decision Process with three states S = {s0,s1,s2,s3},

γ = 1, and immediate rewards specified after every state. There is only one action possible (”no

choice”), which we therefore omit from the trace. The traces are composed of state s(t) at time step t

and immediate rewards r(t) that are collected here upon departing from state s(t). We are going to

”observe” a state-reward trace computed from your CID as follows:

1

? Take your CID number and omit any leading 0s. Then we call the left-most digit C ID(1), the

subsequent digit C ID(2) etc.

? Going from the left most to the right most digit of your CID compute a sequence of states by

moving from t = 1, 2, . . .

s(t) = (CID(t) + 2) mod 4 (1)

and the sequence of corresponding rewards is

r(t) = CID(t) mod 4 (2)

Should you CID number without 0s be just 6 digits long (because you have been here for many years,

like Aldo), then please prepend a 1 to your CID.

For example, if your CID is 012345678, then the trace of states and rewards is

τ = s3 1 s0 2 s1 3 s2 0 s3 1 s0 2 s1 3 s2 0

We can omit considering the last reward (0 in the example above) as the trace finishes on the last state

s2, but we do not presume whether the last state is a terminal state or a transient state. The trace is

just the data that we have observed, we do not know more.

a. Write out your personalised trace (please use exactly the same format as in the example above).

(1 pts)

b. Draw a minimal MDP graph consistent with the data (do not add anything that is not in the

data). Please make sure to draw any self-connections between states (these are typically omitted,

but it will be beneficial for your learning experience to draw these in). Briefly explain your

rationale for the graph you drew. (6 pts)

c. Let us assume we do not know the transition matrix nor the reward matrix of this MDP, just as

is the case of a naive reinforcement learning agent. (13 pts)

1. Give an expression of the transition matrix you can infer from the data or the graph you

have drawn.

2. What can you infer about the structure of the reward function and the reward matrix?

Briefly give an explanation.

3. What can you compute about the value of state s(t = 0)? Name any method you used and

justify your choice.

Question 2: Understanding of Grid Worlds (80 points)

This question is personalised by your College ID (CID) number, specifically the last 3 digits (which

we call x, y, z).

Consider the following grid world. There are 11 states, corresponding to locations on a grid. This

Grid World has two terminal states. The first terminal state is the reward state as reaching it yields

+10 reward and ends the episode. The reward state is the state sj

, j = ((z + 1) mod 3) + 1, where

z is the last digit of your CID. The second terminal state is the penalty state s11, which yields -100

reward.

Possible actions in this grid world are N, E, S and W (North, East, South, West), which correspond

to moving in the four cardinal directions of the compass. The effects of actions are not deterministic

2

and only succeed in moving in the desired direction with probability p, in which case the action

leads to the remaining 3 cardinal directions with equal probability. After the movement direction is

determined, and if a wall blocks the agent’s path, then the agent will stay where it is, otherwise, it

will move. The agent receives a reward of ?1 for every transition (i.e. a movement cost), except those

movements ending in the terminal state (where you collect the reward for arriving at them).

Throughout this question we set p = 0.25 + 0.5 ×

(x+1)

10 and γ = 0.2 + 0.5 ×

y

10 , where x is the antepenultimate

digit of your CID and y is the penultimate digit of your CID.

a. State your personalised reward state, p, and γ (1 pts)

b. Dynamic Programming (with full world knowledge) (24 pts)

1. Compute the optimal value function and the optimal policy using Dynamic Programming.

Briefly state how you solved the problem, including any parameters that you set or assumptions

you made.

2. Report your optimal value function by ”writing in” the values into the grid world (handwritten

scan is ok).

3. Report your optimal policy function by ”drawing in” arrows reflecting your optimal action

for each grid world state (again, hand-drawn scan is ok).

4. Briefly discuss how the value of your γ and p have influenced the optimal value function

and optimal policy in your personal Grid World. In particular, you may investigate the

effect of having a value of p < 0.25, p = 0.25 or p > 0.25, and similarly γ < 0.5 or γ > 0.5.

Note: For those that want to do it by hand and scan: a grid world is reproduced on the last page, you can

print this out and write it an and photograph. If you do not use computer typeset answers, please write

and draw clearly.

3

c. Monte Carlo RL (15 pts)

1. Estimate the optimal value function using Monte Carlo (MC) reinforcement learning. Briefly

state how you solved the problem, including any parameters that you set or assumptions

you made.

2. Report your optimal value function and policy.

3. Plot the learning curve of your agent (number of episodes against reward). In general, a

single run will not be sufficient to estimate variability in the learning, thus run your agent

a ”sufficient” number of times. Briefly state what a sufficient number is and how you

chose it, and then plot the mean plus/minus the standard deviation of the reward.

4. How does varying the exploration parameter e and the learning rate α of your algorithm

impact your learning curves? Briefly explain what you find and discuss and relate it where

possible to the theory you learned.

d. Temporal Difference RL (15 pts)

1. Estimate the optimal value function using Temporal Difference (TD) reinforcement learning.

Briefly state how you solved the problem, including any parameters that you set or

assumptions you made.

2. Report your optimal value function and policy.

3. Plot the learning curve of your agent (number of episodes against reward). In general a

single run will not be sufficient to estimate variability in the learning, thus run your agent

a ”sufficient” number of times. Briefly state what a sufficient number is and how you

chose it, and then plot the mean plus/minus the standard deviation of the reward.

4. How does varying the exploration parameter e and the learning rate α of your algorithm

impact your learning curves? Briefly explain what you find and discuss and relate it where

possible to the theory you learned.

e. Comparison of learners (20 pts)

1. Compare the optimal value function estimation error (the root mean square error between

the optimal value function vector computed through Dynamic Programming with your

current estimate from reinforcement learning). On an episode by episode basis, plot estimation

error against episodes. Compare these estimation error plots between MC and TD

implementations. What differences do you see? In your answer relate your reasoning also

to the theory of the two approaches that you learned.

2. Train both your MC and TD learners. On an episode by episode basis, plot the value

function estimation error of the episode against the reward for that episode. Explain what

this plot characterises. Based on this plot also explain how important it is to have a good

value function estimate to obtain a good reward. What conclusions can you draw? In your

answer relate your reasoning also to the theory of the two approaches that you learned.

3. Is MC systematically better than TD or vice versa at learning the policy quickly and correctly?

Does changing the learning parameters or other factors in the algorithms alter this

performance relationships. Discuss and provide any evidence you simulated.

Figure 1: Optimal value function. Values for each state rounded to 1 decimal place. Arrows indicate

optimal action direction for each state (deterministic policy), multiple arrows from one state indicate

equiprobable choice between indicated directions (stochastic policy).

5


版权所有:留学生编程辅导网 2018 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。