INTERVIEW PREP

数学与非代码面试题

覆盖数学、概率、统计、脑筋急转弯、机器学习和金融。这里负责筛选和进入单题；编程题使用独立的 LeetCode 式 coding lab。

做诊断按领域练习按面试风格练习代码题库

题目: 4169
领域: 8
当前筛选: 17

第 1 / 1 页

非代码面试题

显示 17 / 17 道匹配题目

答题状态：未尝试未正确已正确

ID题目领域难度题型进度权限

5066Infer Self-Transition Probability From a Bellman Value 1Under a fixed policy, state s yields immediate reward 1 each step. With probability p it returns to s next step; otherwise the episode ends. If the discount factor is 0.9 and the state value is reported as V(s)=2.5, what p is implied?机器学习简单数值题未尝试面试订阅 5067Infer Self-Transition Probability From a Bellman Value 2Under a fixed policy, state s yields immediate reward 0.5 each step. With probability p it returns to s next step; otherwise the episode ends. If the discount factor is 0.95 and the state value is reported as V(s)=2, what p is implied?机器学习简单数值题未尝试面试订阅 5068Infer Self-Transition Probability From a Bellman Value 3Under a fixed policy, state s yields immediate reward 2 each step. With probability p it returns to s next step; otherwise the episode ends. If the discount factor is 0.8 and the state value is reported as V(s)=4, what p is implied?机器学习简单数值题未尝试面试订阅 5069Infer Self-Transition Probability From a Bellman Value 4Under a fixed policy, state s yields immediate reward 1.2 each step. With probability p it returns to s next step; otherwise the episode ends. If the discount factor is 0.85 and the state value is reported as V(s)=2.4, what p is implied?机器学习简单数值题未尝试面试订阅 5071Recover Bootstrapped Target From a Q-Learning Update 6A tabular Q-learning step starts from old Q=0.2, uses learning rate alpha=1, reward 0.5, and discount gamma=0.9. After the update the Q-value becomes 2.9. What max a' Q(s',a') must the learner have used?机器学习简单数值题未尝试面试订阅 5072Recover Bootstrapped Target From a Q-Learning Update 7A tabular Q-learning step starts from old Q=1.1, uses learning rate alpha=0.5, reward 0.2, and discount gamma=0.8. After the update the Q-value becomes 1.6. What max a' Q(s',a') must the learner have used?机器学习简单数值题未尝试面试订阅 5073Recover Bootstrapped Target From a Q-Learning Update 8A tabular Q-learning step starts from old Q=-0.4, uses learning rate alpha=0.25, reward 1, and discount gamma=0.95. After the update the Q-value becomes 1.2. What max a' Q(s',a') must the learner have used?机器学习简单数值题未尝试面试订阅 5074Recover Bootstrapped Target From a Q-Learning Update 9A tabular Q-learning step starts from old Q=0.7, uses learning rate alpha=0.4, reward 0.3, and discount gamma=0.9. After the update the Q-value becomes 2. What max a' Q(s',a') must the learner have used?机器学习简单数值题未尝试面试订阅 5075Recover Bootstrapped Target From a Q-Learning Update 10A tabular Q-learning step starts from old Q=0, uses learning rate alpha=0.5, reward 0.1, and discount gamma=0.99. After the update the Q-value becomes 3. What max a' Q(s',a') must the learner have used?机器学习简单数值题未尝试面试订阅 5076Choose the Greedy Backup Action 11In one state, action 1 gives immediate reward 0.6 and then moves to states of value 3 with probability 0.4 and 1 otherwise. Action 2 gives immediate reward 0.9 and then moves to states of value 0.2 with probability 0.1 and 2 otherwise. If gamma=0.9, which action is greedy and what backup value does it produce?机器学习中等数值题未尝试面试订阅 5079Choose the Greedy Backup Action 14In one state, action 1 gives immediate reward 0.8 and then moves to states of value 6 with probability 0.2 and 1 otherwise. Action 2 gives immediate reward 0.5 and then moves to states of value 2 with probability 0.5 and 3 otherwise. If gamma=0.75, which action is greedy and what backup value does it produce?机器学习中等数值题未尝试面试订阅 5081Recover Epsilon From a Logged Action Probability 16An epsilon-greedy policy has 5 available actions and exactly one greedy action. A log file says the greedy action was chosen with probability 0.84. What epsilon does that imply?机器学习简单数值题未尝试面试订阅 5086RL Training Diagnostic 21Why can bootstrapping help value estimates even before an episode terminates?机器学习困难essay未尝试面试订阅 5087RL Training Diagnostic 22Why does an RL agent usually need explicit exploration even if its current greedy action already looks good?机器学习困难essay未尝试面试订阅 5088Discount Factor IntuitionWhy does increasing the discount factor often make value estimates more sensitive to long-run model misspecification?机器学习困难essay未尝试面试订阅 5089RL Training Diagnostic 23Why can off-policy learning become fragile when function approximation, bootstrapping, and distribution shift all interact?机器学习困难essay未尝试面试订阅 5090RL In Trading CautionWhy should a quant be careful when mapping a toy MDP intuition directly into live trading?机器学习困难essay未尝试面试订阅