全站搜索 — 锐望实验室

For one observation with x = 2, y = 1, current weight w = 0, and learning rate eta = 0.4, what is one gradient-descent update on the negative log-likelihood?

打开 →

题目2534 · 机器学习

One Gradient Step on a Tiny Logistic Problem

A one-feature logistic model without intercept uses beta = 0 initially, learning rate 0.2, data x = [-1, 0, 1], and labels y = [0, 0, 1]. What is beta after one gradient step on the negative log-likelihood?

打开 →

题目2433 · 机器学习

Pinball Loss Subgradient at the Kink 9

For pinball loss rho_tau(r)=tau r if r>=0 and (tau-1)r if r<0, what is the subgradient set at r=0?

打开 →

题目2641 · 机器学习

Why Clipping Helps Exploding but Not Vanishing Gradients 23

Why is gradient clipping a natural remedy for exploding gradients but not for vanishing gradients?

打开 →

题目2492 · 机器学习

Why Feature Scaling Helps Gradient Descent More Than Closed Form 22

Why is feature scaling often crucial for gradient-descent training of OLS even though the closed-form solution itself is scale-equivariant?

打开 →

题目2485 · 机器学习

Why Gradient Descent and Closed Form Agree 15

Why do exact gradient descent convergence and the normal-equation solution agree for OLS?

打开 →

题目2431 · 机器学习

Pseudo-Huber Gradient 8

For pseudo-Huber loss ell(r)=delta^2(sqrt(1+(r/delta)^2)-1), derive d ell / d r.

打开 →

题目2643 · 机器学习

Clipping Plus Weight Decay on a Vector 25

A parameter vector is w_t=(3,4). Its gradient is g=(6,8), whose norm is 10. Apply global-norm clipping with threshold 5, then a decoupled weight-decay step with learning rate eta=0.1 and lambda=0.1. What is the new parameter vector?

打开 →

题目2636 · 机器学习

Decoupled Weight Decay Numerically 18

A scalar parameter has value w_t=2, gradient g_t=0.5, learning rate eta=0.1, and decoupled weight decay lambda=0.05. What is w_{t+1}?

打开 →

题目2625 · 机器学习

Decoupled Weight Decay Update 4

Under decoupled weight decay with learning rate eta, decay lambda, parameters w_t, and gradient g_t, derive w_{t+1}.

打开 →

题目2626 · 机器学习

Global-Norm Clipping Numerically 16

A gradient vector is g=(6,8), whose norm is 10. If the clip threshold is 5, what clipped gradient is produced?

打开 →

题目2624 · 机器学习

Momentum as an Unrolled Geometric Sum 3

If momentum obeys v_t = beta v_{t-1} + g_t, derive v_t in terms of v_0 and the past gradients g_1,...,g_t.

打开 →

题目2623 · 机器学习

One Momentum Update 15

Suppose momentum uses v_t = beta v_{t-1} + g_t with beta=0.9, previous velocity v_{t-1}=0.5, and current gradient g_t=2. What is v_t?

打开 →

题目2596 · 机器学习

Optimal Leaf Update Under Squared Loss 1

In gradient boosting for squared error, a terminal region R is assigned one constant update gamma. Derive the gamma that minimizes sum_{i in R} (r_i-gamma)^2, where r_i are the current residuals.

打开 →

模块2.5.2 · 数学与统计能力 · 最优化

迭代法与正则化方法

optimization · gradient-descent · line-search · convergence · iterative-methods · newton-method · quasi-newton · bfgs

打开 →

课程面向最优化的微积分 · 线性代数与微积分

链式法则与雅可比矩阵

Hook：一次梯度核对失败的复盘上海某量化私募的小组复盘会上，工程师摊开一张 PnL 时序图：基于沪深300 成分股的因子神经网络回测里，梯度核对（gradient check）数值在第三层之后开始与解析梯度系统性偏离一个常数倍。CFFEX 主力合约的日线策略本来 7 月稳得像一块表，过完一个版本后突然走样——根因追下来是一行被写反的 transpose ...

打开 →

模块2.4.2 · 数学与统计能力 · 线性代数与微积分

面向最优化的微积分

calculus · gradient · directional-derivative · optimization · chain-rule · jacobian · backpropagation · taylor-expansion

打开 →

课程面向最优化的微积分 · 线性代数与微积分

梯度与方向导数

Hook：风控室里的一行警告周三盘中两点四十，上海某私募基金的多因子组合管理岗位上，你刚收到风控的一行警告：「沪深300 成分股口径下，当前权重公式对应的组合方差曲面在某只大盘消费品权重方向上斜率最大——加一个百分点的仓位，组合方差大致抬升 0.6 个 bp²。」这一句话里其实只藏着一个数学对象：函数公式在当前点公式处的梯度（gradi...

打开 →

课程迭代法与正则化方法 · 最优化

随机与小批量优化方法

钩子：当一次完整梯度要四个小时某上海百亿私募的研究员准备把一套基于沪深300 成分股的多因子神经网络 α 信号搬上生产。训练集是过去 5 年的日频面板：约 180 万行样本 × 300 只成分股 × 80 个特征。前两课的工具一一被排除——海森矩阵（Hessian matrix, 公式）装不进显存，L BFGS 一次方向计算也要把整批数据过一遍。退到最朴素...

打开 →

题目2642 · 机器学习

BatchNorm Running Mean Update 13

A BatchNorm layer updates its running mean by mu_new = m mu_old + (1-m) mu_batch. What does this formula mean operationally?

打开 →

题目2640 · 机器学习

Cosine Decay Schedule 12

A learning rate decays from eta_max to eta_min over T steps using cosine annealing. What is eta_t at step t?

打开 →

题目1960 · 数学

Derive the Optimizer for an Exponential Asymmetry Objective 15

For A>0 and B>0, derive the unique minimizer of J(x)=A e^x + B e^{-2x}.

打开 →

题目3241 · 数学

Directional Derivative Along a Unit Diagonal

Let $f(x,y)=1x^2+2y^2$. Compute the directional derivative of $f$ at $(1,1)$ in the direction of the vector $(1,1)$.

打开 →

题目3245 · 数学

Directional Derivative of a Log Level

Let $f(x,y)=\ln(x+y)$. Compute the directional derivative at $(1,2)$ in the direction of $(1,0)$.

打开 →