INTERVIEW PREP

数学与非代码面试题

覆盖数学、概率、统计、脑筋急转弯、机器学习和金融。这里负责筛选和进入单题；编程题使用独立的 LeetCode 式 coding lab。

做诊断按领域练习按面试风格练习代码题库

题目: 4169
领域: 8
当前筛选: 4169

第 78 / 209 页

非代码面试题

显示 20 / 4169 道匹配题目

答题状态：未尝试未正确已正确

ID题目领域难度题型进度权限

2628Why Residual Connections Help Train Deep Nets 20Why do residual connections often make very deep networks easier to optimize?机器学习中等essay未尝试免费 2629EMA From Zero Initialization 6Let m t = beta m t-1 + (1-beta) x t with m 0=0. Derive m t as an explicit weighted sum of x 1,...,x t.机器学习中等derivation未尝试免费 2630Why BatchNorm Can Break Under Distribution Shift 21Why can a network that trains well with BatchNorm behave strangely at inference when the deployment distribution shifts?机器学习困难essay未尝试免费 2631Shared-Parameter Gradient Adds Across Paths 7A parameter w is used in two separate branches whose losses contribute L 1(w) and L 2(w). What is d(L 1+L 2)/dw?机器学习简单derivation未尝试免费 2632Warmup Learning Rate Numerically 17A linear warmup goes from 0 to 0.001 over 10 steps. What learning rate is used at step t=3 of the warmup?机器学习简单数值题未尝试免费 2633Layer-Norm Shift Invariance 8Ignoring learned affine parameters, why does adding the same constant a to every coordinate of a vector leave layer-normalized activations unchanged?机器学习中等derivation未尝试免费 2634Batch-Average Gradient 9If the minibatch loss is the average L = (1/B) sum i=1 B L i, derive dL/dw in terms of the per-example gradients.机器学习困难derivation未尝试免费 2635Why Warmup Helps Large-Batch Training 22Why is learning-rate warmup often helpful when training with very large batches?机器学习困难essay未尝试免费 2636Decoupled Weight Decay Numerically 18A scalar parameter has value w t=2, gradient g t=0.5, learning rate eta=0.1, and decoupled weight decay lambda=0.05. What is w t+1 ?机器学习简单数值题未尝试免费 2637ReLU Local Derivative 10For ReLU(z)=max(0,z), what derivative does backprop use when z>0 and when z<0?机器学习中等derivation未尝试免费 2638Residual Gradient Numerically 19A scalar residual block has y=x+f(x) with f(x)=3x 2. What is dy/dx at x=1?机器学习中等数值题未尝试免费 2639Steady-State Momentum Under a Constant Gradient 11If v t = beta v t-1 + g with constant gradient g and |beta|<1, what constant value does v t converge to?机器学习困难derivation未尝试免费 2640Cosine Decay Schedule 12A learning rate decays from eta max to eta min over T steps using cosine annealing. What is eta t at step t?机器学习困难derivation未尝试免费 2641Why Clipping Helps Exploding but Not Vanishing Gradients 23Why is gradient clipping a natural remedy for exploding gradients but not for vanishing gradients?机器学习简单essay未尝试免费 2642BatchNorm Running Mean Update 13A BatchNorm layer updates its running mean by mu new = m mu old + (1-m) mu batch. What does this formula mean operationally?机器学习简单derivation未尝试免费 2643Clipping Plus Weight Decay on a Vector 25A parameter vector is w t=(3,4). Its gradient is g=(6,8), whose norm is 10. Apply global-norm clipping with threshold 5, then a decoupled weight-decay step with learning rate eta=0.1 and lambda=0.1. What is the new parameter vector?机器学习中等数值题未尝试面试订阅 2644Why LayerNorm Is Attractive in Sequence and Online Settings 24Why is LayerNorm often preferred over BatchNorm in sequence models or online inference settings?机器学习中等essay未尝试面试订阅 2645Why Global-Norm Clipping Preserves Direction 14Why does global-norm clipping change the magnitude of a gradient vector but not its direction whenever clipping is active?机器学习困难derivation未尝试面试订阅 2646Model-Fit Count in a Nested CV SearchA team runs 5 outer folds. Inside each outer-training split, it evaluates 6 hyperparameter settings by 4-fold CV, then refits the chosen model once on the full outer-training split. How many total model fits are performed?机器学习简单数值题未尝试免费 2647Why Grouped CV Beats Row-Wise CV for Repeated EntitiesWhy is row-wise cross-validation inappropriate when each entity appears many times and the model can recognize entity-specific signatures?机器学习中等essay未尝试免费