5071机器学习简单数值题short
Recover Bootstrapped Target From a Q-Learning Update 6
题目
A tabular Q-learning step starts from old Q=0.2, uses learning rate alpha=1, reward 0.5, and discount gamma=0.9. After the update the Q-value becomes 2.9. What max_a' Q(s',a') must the learner have used?
解题计时
0:00
提交作答时记录,用于后续平均用时统计。
你的答案