联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2021-05-08 10:12

Introduction to CS & Engineering (0CCS0CSE)

Assignment 23: Episode

1 Value Function

Implementing Eq. 1 can cause confusion because V (S) is on both sides of the equation and

in Python V (S) is a dictionary. This document will help explain lines 23?25 in Algorithm 1.

V (St) = V (St) + α[Rt+1 + γV (St+1) ? V (St)] (1)

Although lines 23 and 24 appear to update the valueFunction dictionary in Algorithm 1,

they do not. Lines 23 and 24 are retrieve information from the value function dictionary.

The introduction of two new variables, v st1 and v st0, to replace V (St+1) and V (St), would

help to clarify that only line 25 changes the dictionary.

v st1 ? GetValueOf(board)

v st0 ? GetValueOf(previousState)

V (St) ? v st0+session.learningRate×(reward+(session.discountRate×v st1)?v st0)

Furthermore, GetValueOf(...) is a multistep process (1) get the key from the board (2)

check if the key is in valueFunction, either i. the key is in valueFunction —return the

value associated with the key in the dictionary, e.g., return self.valueFunction[key] or

ii. the key is not in valueFunction —add the key to the dictionary, initialise its value

to zero and return 0. It would be best to add a new method, getValueOf(self, board),

which does all of this. In Algorithm 1, lines 23 and 24, both board and previousState are

TicTacToe objects.

1

Algorithm 1 This method executes a single tictactoe game and updates the state value

table after every move played by the RL agent.

1: procedure episode(board, opponent, session)

2:

3: result ? True

4: turn ? 0

5: previousState ? CopyBoard()

6:

7: while not board.isGameOver() and result do

8: if turn > 1 then :

9: turn ? 0

10: end if

11:

12: agentMoved ? False

13:

14: if turn is 0 and session.agentFirst or turn is 1 and not session.agentFirst then

15: result ? makeTrainingMove(board, session.epsilon)

16: agentMoved ? True

17: else

18: result ? opponent.makeMove(board)

19: end if

20:

21: if agentMoved then

22: reward ? getReward(board)

23: V (St+1) ? GetValueOf(board)

24: V (St) ? GetValueOf(previousState)

25: V (St) ? V (St) +session.learningRate ×(reward + (session.discountRate ×

V (St+1)) ? V (St))

26: previousState ? CopyBoard()

27:

28: end if

29:

30: turn ? turn + 1

31: end while

32:

33: reward ? getReward(board)

34: V (St+1) ? GetValueOf(board)

35: V (St+1) ?= V (St+1) + session.learningRate ? reward

36: end procedure

2


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp