INTERVIEW PREP

数学与非代码面试题

覆盖数学、概率、统计、脑筋急转弯、机器学习和金融。这里负责筛选和进入单题;编程题使用独立的 LeetCode 式 coding lab。

题目
4169
领域
8
当前筛选
4169

78 / 209

非代码面试题

显示 20 / 4169 道匹配题目

答题状态:未尝试未正确已正确
2628Why Residual Connections Help Train Deep Nets 20Why do residual connections often make very deep networks easier to optimize?机器学习中等essay未尝试免费2629EMA From Zero Initialization 6Let m t = beta m t-1 + (1-beta) x t with m 0=0. Derive m t as an explicit weighted sum of x 1,...,x t.机器学习中等derivation未尝试免费2630Why BatchNorm Can Break Under Distribution Shift 21Why can a network that trains well with BatchNorm behave strangely at inference when the deployment distribution shifts?机器学习困难essay未尝试免费2631Shared-Parameter Gradient Adds Across Paths 7A parameter w is used in two separate branches whose losses contribute L 1(w) and L 2(w). What is d(L 1+L 2)/dw?机器学习简单derivation未尝试免费2632Warmup Learning Rate Numerically 17A linear warmup goes from 0 to 0.001 over 10 steps. What learning rate is used at step t=3 of the warmup?机器学习简单数值题未尝试免费2633Layer-Norm Shift Invariance 8Ignoring learned affine parameters, why does adding the same constant a to every coordinate of a vector leave layer-normalized activations unchanged?机器学习中等derivation未尝试免费2634Batch-Average Gradient 9If the minibatch loss is the average L = (1/B) sum i=1 B L i, derive dL/dw in terms of the per-example gradients.机器学习困难derivation未尝试免费2635Why Warmup Helps Large-Batch Training 22Why is learning-rate warmup often helpful when training with very large batches?机器学习困难essay未尝试免费2636Decoupled Weight Decay Numerically 18A scalar parameter has value w t=2, gradient g t=0.5, learning rate eta=0.1, and decoupled weight decay lambda=0.05. What is w t+1 ?机器学习简单数值题未尝试免费2637ReLU Local Derivative 10For ReLU(z)=max(0,z), what derivative does backprop use when z>0 and when z<0?机器学习中等derivation未尝试免费2638Residual Gradient Numerically 19A scalar residual block has y=x+f(x) with f(x)=3x 2. What is dy/dx at x=1?机器学习中等数值题未尝试免费2639Steady-State Momentum Under a Constant Gradient 11If v t = beta v t-1 + g with constant gradient g and |beta|<1, what constant value does v t converge to?机器学习困难derivation未尝试免费2640Cosine Decay Schedule 12A learning rate decays from eta max to eta min over T steps using cosine annealing. What is eta t at step t?机器学习困难derivation未尝试免费2641Why Clipping Helps Exploding but Not Vanishing Gradients 23Why is gradient clipping a natural remedy for exploding gradients but not for vanishing gradients?机器学习简单essay未尝试免费2642BatchNorm Running Mean Update 13A BatchNorm layer updates its running mean by mu new = m mu old + (1-m) mu batch. What does this formula mean operationally?机器学习简单derivation未尝试免费2643Clipping Plus Weight Decay on a Vector 25A parameter vector is w t=(3,4). Its gradient is g=(6,8), whose norm is 10. Apply global-norm clipping with threshold 5, then a decoupled weight-decay step with learning rate eta=0.1 and lambda=0.1. What is the new parameter vector?机器学习中等数值题未尝试面试订阅2644Why LayerNorm Is Attractive in Sequence and Online Settings 24Why is LayerNorm often preferred over BatchNorm in sequence models or online inference settings?机器学习中等essay未尝试面试订阅2645Why Global-Norm Clipping Preserves Direction 14Why does global-norm clipping change the magnitude of a gradient vector but not its direction whenever clipping is active?机器学习困难derivation未尝试面试订阅2646Model-Fit Count in a Nested CV SearchA team runs 5 outer folds. Inside each outer-training split, it evaluates 6 hyperparameter settings by 4-fold CV, then refits the chosen model once on the full outer-training split. How many total model fits are performed?机器学习简单数值题未尝试免费2647Why Grouped CV Beats Row-Wise CV for Repeated EntitiesWhy is row-wise cross-validation inappropriate when each entity appears many times and the model can recognize entity-specific signatures?机器学习中等essay未尝试免费