5072机器学习简单数值题short

Recover Bootstrapped Target From a Q-Learning Update 7

题目

A tabular Q-learning step starts from old Q=1.1, uses learning rate alpha=0.5, reward 0.2, and discount gamma=0.8. After the update the Q-value becomes 1.6. What max_a' Q(s',a') must the learner have used?

解题计时

0:00

提交作答时记录，用于后续平均用时统计。

你的答案

数值

支持题库 schema 允许的整数、小数、分数或四则表达式。