第 8 / 21 页
非代码面试题
显示 20 / 420 道匹配题目
答题状态:未尝试未正确已正确
ID题目领域难度题型进度权限
4293Weight Decay Shrinkage 3A parameter has current value w = 2.0 and gradient g = 0.3. Using a decoupled weight-decay update w new = (1 - eta*lambda) w - eta*g with eta = 0.1 and lambda = 0.05, what is the updated weight after one step?机器学习简单数值题未尝试面试订阅4294Weight Decay Shrinkage 4A layer weight vector is w = (3, 4), so its norm is 5. You enforce max-norm regularization with cap c = 4 by rescaling only when the norm exceeds c. What vector is stored after clipping?机器学习简单数值题未尝试面试订阅4295Weight Decay Shrinkage 5An optimizer uses the proximal L1 shrinkage step sign(w)*max(|w| - tau, 0). If the pre-step weight is w = 0.7 and tau = 0.2, what weight remains after shrinkage?机器学习简单数值题未尝试面试订阅4296Dropout Noise Level 1Keep eta = 0.1, gradient g = 0.3, and current weight w = 2.0. In the decoupled update w new = (1 - eta*lambda)w - eta*g, lambda rises from 0.05 to 0.10. By how much does the updated weight decrease relative to the old lambda case?机器学习中等数值题未尝试面试订阅4297Dropout Noise Level 2A unit has activation 2.0 before standard dropout, meaning dropped units become 0 and kept units stay at 2.0. If keep probability falls from 0.8 to 0.5, what happens to the expected post-dropout activation?机器学习中等数值题未尝试面试订阅4298Dropout Noise Level 3A 5-class model uses label smoothing with epsilon distributed uniformly across all classes. If epsilon rises from 0.1 to 0.3, by how much does the true-class target change?机器学习中等数值题未尝试面试订阅4300Dropout Noise Level 5A proximal L1 step uses sign(w)*max(|w| - tau, 0). If the pre-step weight is 0.6, what output do you get when tau rises from 0.2 to 0.5?机器学习中等数值题未尝试面试订阅4316Attention Score CountA Transformer layer processes L=256 tokens with H=8 heads. Ignoring the value dimension, how many raw attention score entries are formed across all heads?机器学习简单数值题未尝试面试订阅4317Stacked CNN Receptive FieldA 1D CNN stacks 6 causal layers with kernel size 3, stride 1, and no dilation. What is the receptive field in tokens?机器学习简单数值题未尝试面试订阅4318Dilated CNN HorizonA causal CNN uses 4 layers with kernel size 3 and dilations 1, 2, 4, and 8. What dependency horizon can one output token directly aggregate?机器学习简单数值题未尝试面试订阅4319Sequential Depth ComparisonFor a length-512 sequence, how many sequential processing steps must a vanilla RNN execute, and how many sequential token-wise steps does a standard full-sequence Transformer need at inference once the whole block is available?机器学习简单数值题未尝试面试订阅4320Attention Memory FootprintA full-attention model uses L=1024 tokens and stores one attention score matrix per head in float16. Roughly how much memory does one head's score matrix use?机器学习简单数值题未尝试面试订阅4326Length-Doubling Cost ShockA local CNN with window size 7 scales like 7L interactions, while a Transformer attention block scales like L 2 score pairs. If L doubles from 256 to 512, by what factor does each interaction count grow, and which architecture hits the sharper scaling wall?机器学习中等essay未尝试面试订阅4327CNN Depth For Longer HorizonA stride-1 CNN uses kernel size 3 and no dilation. To cover a dependency horizon of 9 steps you need 4 layers. If the required horizon rises to 41 steps, how many layers are needed, and what does that imply about the architecture pressure?机器学习中等essay未尝试面试订阅4328Small-Data Regime ShiftSuppose the task stays strongly local and translation-equivariant, but your labeled dataset shrinks by a factor of 10. Which architecture becomes more attractive, and why does the shift in data regime matter?机器学习中等essay未尝试面试订阅4329Latency Budget RelaxationA task was originally fully online, making recurrence or causal convolution preferable. If the deployment changes to offline batch scoring with the whole sequence available, which architecture family gains the most from that relaxation?机器学习中等essay未尝试面试订阅4330From Local To Global Task StructureA prediction problem used to depend on short motifs, but after a product change the label now depends on matching information from the first and last quarter of each sequence. Which architecture family should move up the ranking?机器学习中等essay未尝试面试订阅4341Confusion-Matrix Metrics 1At a fixed threshold, prevalence is 20%, TPR is 80%, and FPR is 10%. What precision does that imply?机器学习简单数值题未尝试面试订阅4342Confusion-Matrix Metrics 2A fraud model keeps TPR = 0.90 and FPR = 0.03 when deployed into a market where prevalence falls from 10% to 2%. What precision should you now expect at the same threshold?机器学习简单数值题未尝试面试订阅4343Confusion-Matrix Metrics 3Predicted probabilities are [0.8, 0.6, 0.3, 0.1] and labels are [1, 0, 1, 0]. What is the Brier score?机器学习简单数值题未尝试面试订阅