代写0CCS0CSE编程、代做R，Java，Python程序语言-代写Java编程

联系方式

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-23:00
微信：codinghelp

您当前位置：首页 >> Java编程Java编程

代写0CCS0CSE编程、代做R，Java，Python程序语言

日期：2021-05-08 10:12

Introduction to CS & Engineering (0CCS0CSE)

Assignment 23: Episode

1 Value Function

Implementing Eq. 1 can cause confusion because V (S) is on both sides of the equation and

in Python V (S) is a dictionary. This document will help explain lines 23−25 in Algorithm 1.

V (St) = V (St) + α[Rt+1 + γV (St+1) − V (St)] (1)

Although lines 23 and 24 appear to update the valueFunction dictionary in Algorithm 1,

they do not. Lines 23 and 24 are retrieve information from the value function dictionary.

The introduction of two new variables, v st1 and v st0, to replace V (St+1) and V (St), would

help to clarify that only line 25 changes the dictionary.

v st1 ⇐ GetValueOf(board)

v st0 ⇐ GetValueOf(previousState)

V (St) ⇐ v st0+session.learningRate×(reward+(session.discountRate×v st1)−v st0)

Furthermore, GetValueOf(...) is a multistep process (1) get the key from the board (2)

check if the key is in valueFunction, either i. the key is in valueFunction —return the

value associated with the key in the dictionary, e.g., return self.valueFunction[key] or

ii. the key is not in valueFunction —add the key to the dictionary, initialise its value

to zero and return 0. It would be best to add a new method, getValueOf(self, board),

which does all of this. In Algorithm 1, lines 23 and 24, both board and previousState are

TicTacToe objects.

Algorithm 1 This method executes a single tictactoe game and updates the state value

table after every move played by the RL agent.

1: procedure episode(board, opponent, session)

3: result ⇐ True

4: turn ⇐ 0

5: previousState ⇐ CopyBoard()

7: while not board.isGameOver() and result do

8: if turn > 1 then :

9: turn ⇐ 0

10: end if

11:

12: agentMoved ⇐ False

13:

14: if turn is 0 and session.agentFirst or turn is 1 and not session.agentFirst then

15: result ⇐ makeTrainingMove(board, session.epsilon)

16: agentMoved ⇐ True

17: else

18: result ⇐ opponent.makeMove(board)

19: end if

20:

21: if agentMoved then

22: reward ⇐ getReward(board)

23: V (St+1) ⇐ GetValueOf(board)

24: V (St) ⇐ GetValueOf(previousState)

25: V (St) ⇐ V (St) +session.learningRate ×(reward + (session.discountRate ×

V (St+1)) − V (St))

26: previousState ⇐ CopyBoard()

27:

28: end if

29:

30: turn ⇐ turn + 1

31: end while

32:

33: reward ⇐ getReward(board)

34: V (St+1) ⇐ GetValueOf(board)

35: V (St+1) ⇐= V (St+1) + session.learningRate ∗ reward

36: end procedure

【返回顶部】【打印本稿】【关闭本页】

【上一篇】：代写159.272程序、代做Programming编程、java编程语言调试

【下一篇】：代写159.272程序、代做Programming编程、java编程语言调试

联系方式

最新辅导

热门辅导

您当前位置：首页 >> Java编程Java编程

代写0CCS0CSE编程、代做R，Java，Python程序语言

日期：2021-05-08 10:12

相关文章