全站搜索 — 锐望实验室

全部 · 4546 课程 · 299 模块 · 72 题目 · 4169 帮助 · 6 收藏题单 · 0

找到 30 个结果

中文题目

模块2.5.2 · 数学与统计能力 · 最优化

迭代法与正则化方法

optimization · gradient-descent · line-search · convergence · iterative-methods · newton-method · quasi-newton · bfgs

打开 →

题目2634 · 机器学习

Batch-Average Gradient 9

If the minibatch loss is the average L = (1/B) sum_{i=1}^B L_i, derive dL/dw in terms of the per-example gradients.

打开 →

题目2642 · 机器学习

BatchNorm Running Mean Update 13

A BatchNorm layer updates its running mean by mu_new = m mu_old + (1-m) mu_batch. What does this formula mean operationally?

打开 →

题目2643 · 机器学习

Clipping Plus Weight Decay on a Vector 25

A parameter vector is w_t=(3,4). Its gradient is g=(6,8), whose norm is 10. Apply global-norm clipping with threshold 5, then a decoupled weight-decay step with learning rate eta=0.1 and lambda=0.1. What is the new parameter vector?

打开 →

题目2640 · 机器学习

Cosine Decay Schedule 12

A learning rate decays from eta_max to eta_min over T steps using cosine annealing. What is eta_t at step t?

打开 →

题目2636 · 机器学习

Decoupled Weight Decay Numerically 18

A scalar parameter has value w_t=2, gradient g_t=0.5, learning rate eta=0.1, and decoupled weight decay lambda=0.05. What is w_{t+1}?

打开 →

题目2626 · 机器学习

Global-Norm Clipping Numerically 16

A gradient vector is g=(6,8), whose norm is 10. If the clip threshold is 5, what clipped gradient is produced?

打开 →

题目2523 · 机器学习

Gradient of Logistic Negative Log-Likelihood 3

For one observation (x,y) with y in {0,1} and score z = w^T x, what is the gradient of the negative log-likelihood with respect to w?

打开 →

题目2522 · 机器学习

Intercept From the Positive Rate 2

In an intercept-only logistic model, if the fitted probability is p_hat, what intercept b solves sigma(b)=p_hat?

打开 →

题目2633 · 机器学习

Layer-Norm Shift Invariance 8

Ignoring learned affine parameters, why does adding the same constant a to every coordinate of a vector leave layer-normalized activations unchanged?

打开 →

题目2624 · 机器学习

Momentum as an Unrolled Geometric Sum 3

If momentum obeys v_t = beta v_{t-1} + g_t, derive v_t in terms of v_0 and the past gradients g_1,...,g_t.

打开 →

题目2623 · 机器学习

One Momentum Update 15

Suppose momentum uses v_t = beta v_{t-1} + g_t with beta=0.9, previous velocity v_{t-1}=0.5, and current gradient g_t=2. What is v_t?

打开 →

题目2596 · 机器学习

Optimal Leaf Update Under Squared Loss 1

In gradient boosting for squared error, a terminal region R is assigned one constant update gamma. Derive the gamma that minimizes sum_{i in R} (r_i-gamma)^2, where r_i are the current residuals.

打开 →

题目2637 · 机器学习

ReLU Local Derivative 10

For ReLU(z)=max(0,z), what derivative does backprop use when z>0 and when z<0?

打开 →

题目2608 · 机器学习

Residual After Two Shrunken Updates 24

A point currently has residual 6. Two boosting rounds hit its region with leaf updates 1.5 and 0.8, using learning rate eta=0.2 in both rounds. What residual remains after the two rounds?

打开 →

题目2638 · 机器学习

Residual Gradient Numerically 19

A scalar residual block has y=x+f(x) with f(x)=3x^2. What is dy/dx at x=1?

打开 →

题目2639 · 机器学习

Steady-State Momentum Under a Constant Gradient 11

If v_t = beta v_{t-1} + g with constant gradient g and |beta|<1, what constant value does v_t converge to?

打开 →

题目2597 · 机器学习

Weighted Region Update 2

If observations in a boosting region R carry positive weights w_i, derive the constant update gamma that minimizes sum_{i in R} w_i (r_i-gamma)^2.

打开 →

题目2630 · 机器学习

Why BatchNorm Can Break Under Distribution Shift 21

Why can a network that trains well with BatchNorm behave strangely at inference when the deployment distribution shifts?

打开 →

题目2524 · 机器学习

Why No Closed Form in Logistic Regression 5

Why does logistic regression usually require iterative optimization rather than a normal-equation-style closed form?

打开 →

题目2628 · 机器学习

Why Residual Connections Help Train Deep Nets 20

Why do residual connections often make very deep networks easier to optimize?

打开 →

题目2526 · 机器学习

Why Separable Data Pushes Coefficients Outward 7

Why do logistic-regression coefficients tend to diverge on perfectly linearly separable data if no regularization is used?

打开 →

题目2515 · 机器学习

Why Small Lambda Means Weak Regularization 20

Why does a very small lambda leave the regularized solution close to OLS?

打开 →

题目2635 · 机器学习

Why Warmup Helps Large-Batch Training 22

Why is learning-rate warmup often helpful when training with very large batches?

打开 →

课程迭代法与正则化方法 · 最优化

梯度下降与线搜索

周五下午两点，你在上海某私募的因子研究组里收到一张 12,000 × 600 的设计矩阵——600 个候选 alpha 因子在沪深300 成分股上 18 个月日频的横截面暴露。组合经理希望你下班前给一组系数，明早接入回测。你写下普通最小二乘（ordinary least squares, OLS）的闭式解 beta = np.linalg.solve(X.T...

打开 →

课程迭代法与正则化方法 · 最优化

正则化最小二乘:岭回归与 Lasso

深圳某私募的多因子研究员手头有 60 个交易日的沪深300 成分股横截面收益,外加一份「因子动物园」(factor zoo)清单:动量、价值、质量、低波,再加上 70 多个另类与基本面因子,合计公式个候选预测变量、公式个观测——一个典型的公式病态设计矩阵。她直接套用上一模块的普通最小二乘(ordinary least squares, OLS),解...

打开 →

课程迭代法与正则化方法 · 最优化

牛顿法与拟牛顿法

周一开盘后 15 分钟，沪深300 ETF 期权（300ETF options on SSE）的隐含波动率（implied volatility, IV）整体上抬了 3 个 vol。你在一家私募的做市账户上挂着一组 50ETF 与 300ETF 近月平值 call，定价模型需要把每张合约的市场报价反推成 IV。上一节用梯度下降跑过同样的题：在某些深度虚值（o...

打开 →

课程迭代法与正则化方法 · 最优化

随机与小批量优化方法

钩子：当一次完整梯度要四个小时某上海百亿私募的研究员准备把一套基于沪深300 成分股的多因子神经网络 α 信号搬上生产。训练集是过去 5 年的日频面板：约 180 万行样本 × 300 只成分股 × 80 个特征。前两课的工具一一被排除——海森矩阵（Hessian matrix, 公式）装不进显存，L BFGS 一次方向计算也要把整批数据过一遍。退到最朴素...

打开 →

题目2541 · 机器学习

One Gradient Step on a Single Logistic Observation 22

For one observation with x = 2, y = 1, current weight w = 0, and learning rate eta = 0.4, what is one gradient-descent update on the negative log-likelihood?

打开 →

题目2492 · 机器学习

Why Feature Scaling Helps Gradient Descent More Than Closed Form 22

Why is feature scaling often crucial for gradient-descent training of OLS even though the closed-form solution itself is scale-equivariant?

打开 →