Gradient of Logistic Negative Log-Likelihood 3
For one observation (x,y) with y in {0,1} and score z = w^T x, what is the gradient of the negative log-likelihood with respect to w?
打开 →GLOBAL SEARCH
搜索在服务端完成,题目解析与答案不会进入搜索结果。登录后可搜索自己的收藏题单。
找到 30 个结果
中文题目For one observation (x,y) with y in {0,1} and score z = w^T x, what is the gradient of the negative log-likelihood with respect to w?
打开 →If the minibatch loss is the average L = (1/B) sum_{i=1}^B L_i, derive dL/dw in terms of the per-example gradients.
打开 →If v_t = beta v_{t-1} + g with constant gradient g and |beta|<1, what constant value does v_t converge to?
打开 →A scalar residual block outputs y = x + f(x). Derive dy/dx.
打开 →A scalar residual block has y=x+f(x) with f(x)=3x^2. What is dy/dx at x=1?
打开 →A parameter w is used in two separate branches whose losses contribute L_1(w) and L_2(w). What is d(L_1+L_2)/dw?
打开 →For one observation with x = 2, y = 1, current weight w = 0, and learning rate eta = 0.4, what is one gradient-descent update on the negative log-likelihood?
打开 →A one-feature logistic model without intercept uses beta = 0 initially, learning rate 0.2, data x = [-1, 0, 1], and labels y = [0, 0, 1]. What is beta after one gradient step on the negative log-likelihood?
打开 →For pinball loss rho_tau(r)=tau r if r>=0 and (tau-1)r if r<0, what is the subgradient set at r=0?
打开 →Why is gradient clipping a natural remedy for exploding gradients but not for vanishing gradients?
打开 →Why is feature scaling often crucial for gradient-descent training of OLS even though the closed-form solution itself is scale-equivariant?
打开 →Why do exact gradient descent convergence and the normal-equation solution agree for OLS?
打开 →For pseudo-Huber loss ell(r)=delta^2(sqrt(1+(r/delta)^2)-1), derive d ell / d r.
打开 →A parameter vector is w_t=(3,4). Its gradient is g=(6,8), whose norm is 10. Apply global-norm clipping with threshold 5, then a decoupled weight-decay step with learning rate eta=0.1 and lambda=0.1. What is the new parameter vector?
打开 →A scalar parameter has value w_t=2, gradient g_t=0.5, learning rate eta=0.1, and decoupled weight decay lambda=0.05. What is w_{t+1}?
打开 →Under decoupled weight decay with learning rate eta, decay lambda, parameters w_t, and gradient g_t, derive w_{t+1}.
打开 →A gradient vector is g=(6,8), whose norm is 10. If the clip threshold is 5, what clipped gradient is produced?
打开 →If momentum obeys v_t = beta v_{t-1} + g_t, derive v_t in terms of v_0 and the past gradients g_1,...,g_t.
打开 →Suppose momentum uses v_t = beta v_{t-1} + g_t with beta=0.9, previous velocity v_{t-1}=0.5, and current gradient g_t=2. What is v_t?
打开 →In gradient boosting for squared error, a terminal region R is assigned one constant update gamma. Derive the gamma that minimizes sum_{i in R} (r_i-gamma)^2, where r_i are the current residuals.
打开 →A BatchNorm layer updates its running mean by mu_new = m mu_old + (1-m) mu_batch. What does this formula mean operationally?
打开 →A learning rate decays from eta_max to eta_min over T steps using cosine annealing. What is eta_t at step t?
打开 →For A>0 and B>0, derive the unique minimizer of J(x)=A e^x + B e^{-2x}.
打开 →Let $f(x,y)=1x^2+2y^2$. Compute the directional derivative of $f$ at $(1,1)$ in the direction of the vector $(1,1)$.
打开 →Let $f(x,y)=\ln(x+y)$. Compute the directional derivative at $(1,2)$ in the direction of $(1,0)$.
打开 →Let $f(x,y)=xy$. Compute the directional derivative at $(2,3)$ in the direction of $(0,1)$.
打开 →Let $f(x,y)=2x^2+1y^2$. Compute the directional derivative at $(1,-1)$ in the direction of $(3,4)$.
打开 →Let $f(x,y)=e^{x-y}$. Compute the directional derivative at the origin in the direction of $(4,3)$.
打开 →Let m_t = beta m_{t-1} + (1-beta) x_t with m_0=0. Derive m_t as an explicit weighted sum of x_1,...,x_t.
打开 →If the ridge optimum in R^p is beta_hat_lambda, what radius t makes it also solve the constrained problem min RSS(beta) subject to ||beta||_2 <= t?
打开 →