
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2021-03-18 11:24

CSE 525 Programming Assignment 1

Due March 20th 11:59:59

The goal of this assignment is to implement three RL algorithms listed as follows.

● Monte Carlo (with function approximation)

● Fitted Q iteration


You will be using 1 MuJoCo environment (InvertedPendulumMuJoCoEnv-v0) and 1

Atari environment (Pong-v0), and compare the RL algorithms. Feel free to use all of the

extensions/tricks we discussed during the classes for reliable learning. As the behavior

policy of off-policy RL methods, use epsilon-greedy.

What you need to submit:

(1) A notebook file that contains your network's definition, training processes,

evaluation results and necessary comments of your codes.

(2) A report that contains core codes of the algorithms and networks design, analysis

of your results and comparison between the algorithms.


In this assignment, we recommend using Colab, OpenAI Gym, OpenAI Gym[Atari],

PyBullet and PyBulletGym (Open AI Gym[Mujoco] implementation based on PyBullet).

So, before getting started, please be prepared for the smooth running of the required


The afore-mentioned packages are actually simulated environments that are able to

interact with our agents to offer instant observations, rewards, and other important

information. For this time, we picked 1 discrete environment in Atari called “Pong-v0”

and 1 continuous environment in MuJoCo called “InvertedPendulumMuJoCoEnv-v0”.

Note that the actions in “Pong” are discrete while the actions in “InvertedPendulum” are

continuous. As you know, the three algorithms are not able to deal with continuous

actions, which further requires you to discretize the action spaces in the

“InvertedPendulum” environment first.

For the Atari game “Pong” environment, we encourage you to preprocess the image

input to make it easier for the network to learn.


1) Network design for two environments. (20 points in toal, 10 points each)

2) Training process for three algorithms, there should be 6 training processes in total

for 2 environments and three algorithms. (30 points in toal, 5 points each, you

should provide a decent amount of comments to explain your codes.)

3) Evaluation results of your 6 training programs, this should include cumulative

reward by training episodes plots, average return on ten times run of your final

policy and any other plots that you find helpful to explain your design’s

performance. (30 points in toal, 5 points each.)

4) Analysis of performance of three algorithms for each environment, analyze your

plots and numbers under each algorithm and compare three algorithms under each

environment. (15 points in total)

5) Comparison between the use of epsilon-greedy vs. random behavior policy. For

this experiment, use “InvertedPendulum” as your environment and fitted Q

iteration as your RL algorithm. Give plots of your cumulative reward by episodes

and average return on test runs of your learned policy and analyze the performance

of different behavior policies. (5 points in total)

To start with:

We prepared a simple starter code for you to understand what you should code and where

to put your analysis. You don’t have to strictly follow the format, write your code in the

way you are comfortable with.

Before turning in:

1. Check your notebook file, make sure that once the instructors “restart and run

all”, no errors occur. Also, make sure the format of your report is correct.

2. Rename your notebook file like firstname_lastname_SBUID.ipynb and your

report like firstname_lastname_SBUID_report.pdf. Zip these two files in a name

like firstname_lastname_SBUID.zip and upload to Blackboard.

After turning in:

1. Any format errors and fail-to-run errors might result in penalty.

2. Late submissions might result in penalty. 10% per day, 50% max.

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图
