INTERVIEW PREP

数学与非代码面试题

覆盖数学、概率、统计、脑筋急转弯、机器学习和金融。这里负责筛选和进入单题;编程题使用独立的 LeetCode 式 coding lab。

题目
4169
领域
8
当前筛选
17

1 / 1

非代码面试题

显示 17 / 17 道匹配题目

答题状态:未尝试未正确已正确
5066Infer Self-Transition Probability From a Bellman Value 1Under a fixed policy, state s yields immediate reward 1 each step. With probability p it returns to s next step; otherwise the episode ends. If the discount factor is 0.9 and the state value is reported as V(s)=2.5, what p is implied?机器学习简单数值题未尝试面试订阅5067Infer Self-Transition Probability From a Bellman Value 2Under a fixed policy, state s yields immediate reward 0.5 each step. With probability p it returns to s next step; otherwise the episode ends. If the discount factor is 0.95 and the state value is reported as V(s)=2, what p is implied?机器学习简单数值题未尝试面试订阅5068Infer Self-Transition Probability From a Bellman Value 3Under a fixed policy, state s yields immediate reward 2 each step. With probability p it returns to s next step; otherwise the episode ends. If the discount factor is 0.8 and the state value is reported as V(s)=4, what p is implied?机器学习简单数值题未尝试面试订阅5069Infer Self-Transition Probability From a Bellman Value 4Under a fixed policy, state s yields immediate reward 1.2 each step. With probability p it returns to s next step; otherwise the episode ends. If the discount factor is 0.85 and the state value is reported as V(s)=2.4, what p is implied?机器学习简单数值题未尝试面试订阅5071Recover Bootstrapped Target From a Q-Learning Update 6A tabular Q-learning step starts from old Q=0.2, uses learning rate alpha=1, reward 0.5, and discount gamma=0.9. After the update the Q-value becomes 2.9. What max a' Q(s',a') must the learner have used?机器学习简单数值题未尝试面试订阅5072Recover Bootstrapped Target From a Q-Learning Update 7A tabular Q-learning step starts from old Q=1.1, uses learning rate alpha=0.5, reward 0.2, and discount gamma=0.8. After the update the Q-value becomes 1.6. What max a' Q(s',a') must the learner have used?机器学习简单数值题未尝试面试订阅5073Recover Bootstrapped Target From a Q-Learning Update 8A tabular Q-learning step starts from old Q=-0.4, uses learning rate alpha=0.25, reward 1, and discount gamma=0.95. After the update the Q-value becomes 1.2. What max a' Q(s',a') must the learner have used?机器学习简单数值题未尝试面试订阅5074Recover Bootstrapped Target From a Q-Learning Update 9A tabular Q-learning step starts from old Q=0.7, uses learning rate alpha=0.4, reward 0.3, and discount gamma=0.9. After the update the Q-value becomes 2. What max a' Q(s',a') must the learner have used?机器学习简单数值题未尝试面试订阅5075Recover Bootstrapped Target From a Q-Learning Update 10A tabular Q-learning step starts from old Q=0, uses learning rate alpha=0.5, reward 0.1, and discount gamma=0.99. After the update the Q-value becomes 3. What max a' Q(s',a') must the learner have used?机器学习简单数值题未尝试面试订阅5076Choose the Greedy Backup Action 11In one state, action 1 gives immediate reward 0.6 and then moves to states of value 3 with probability 0.4 and 1 otherwise. Action 2 gives immediate reward 0.9 and then moves to states of value 0.2 with probability 0.1 and 2 otherwise. If gamma=0.9, which action is greedy and what backup value does it produce?机器学习中等数值题未尝试面试订阅5079Choose the Greedy Backup Action 14In one state, action 1 gives immediate reward 0.8 and then moves to states of value 6 with probability 0.2 and 1 otherwise. Action 2 gives immediate reward 0.5 and then moves to states of value 2 with probability 0.5 and 3 otherwise. If gamma=0.75, which action is greedy and what backup value does it produce?机器学习中等数值题未尝试面试订阅5081Recover Epsilon From a Logged Action Probability 16An epsilon-greedy policy has 5 available actions and exactly one greedy action. A log file says the greedy action was chosen with probability 0.84. What epsilon does that imply?机器学习简单数值题未尝试面试订阅5086RL Training Diagnostic 21Why can bootstrapping help value estimates even before an episode terminates?机器学习困难essay未尝试面试订阅5087RL Training Diagnostic 22Why does an RL agent usually need explicit exploration even if its current greedy action already looks good?机器学习困难essay未尝试面试订阅5088Discount Factor IntuitionWhy does increasing the discount factor often make value estimates more sensitive to long-run model misspecification?机器学习困难essay未尝试面试订阅5089RL Training Diagnostic 23Why can off-policy learning become fragile when function approximation, bootstrapping, and distribution shift all interact?机器学习困难essay未尝试面试订阅5090RL In Trading CautionWhy should a quant be careful when mapping a toy MDP intuition directly into live trading?机器学习困难essay未尝试面试订阅