Newton Raphson in Gradient Boosting

Question

I want to use the Gradient Boosting algorithm with exponential loss function and I am struggling to understand how to use the Newton-Raphson step to update the predictions. In python's sklearn GradientBoostingClassifier the update is the following:

    numerator = np.sum(y_ * sample_weight * np.exp(-y_ * pred))
    denominator = np.sum(sample_weight * np.exp(-y_ * pred))

    # prevents overflow and division by zero
    if abs(denominator) < 1e-150:
        tree.value[leaf, 0, 0] = 0.0
    else:
        tree.value[leaf, 0, 0] = numerator / denominator

The numerator is the negative sum of the partial first derivatives of the exponential loss function \begin{align} \ L(pred) = sum(e^{-y * pred}) \\ \ numerator = sum(-\frac{d}{dpred} L(pred)) = sum(y*e^{-y * pred}) \end{align}

The denominator is the sum of the partial second derivatives of the exponential loss function \begin{align} \ denominator = sum(\frac{d^2}{dpred^2} L(pred)) = sum(y^2*e^{-y * pred}) = sum(e^{-y * pred}), \ since \ y^2 = 1 \end{align}

But according to Newton-Raphson algorithm the update in pred should be:

\begin{align} \ pred = pred - inverse(Hessian(L(pred)))*Gradient(L(pred)) \\ \end{align}

where Hessian is the diagonal matrix with the partial second derivatives in the main diagonal.

Why python sums over the Gradient and over the Hessian and then take the ratio of the two as the update in predictions???

For the Newton-Raphson Algorithm I follow the following links https://www.stat.washington.edu/adobra/classes/536/Files/week1/newtonfull.pdf

Newton's method vs. gradient descent with exact line search

score 1 · Answer 1 · edited Aug 02 '22 at 14:03

I have formulated the problem wrong. According to

Greedy Function Approximation: A Gradient Boosting Machine

the Gradient Boosting algorithm applied to the exponential loss function includes the following steps:

Initialize $F_0(x) = argmin_{\rho} = \sum e^{-y\rho}$ and repeat the following steps 2 to 5 for the number of the weak learners $m$
Compute the negative gradient of the exponential loss function $\ L(y,F) = \sum e^{-yF}$

$\ res = -\frac{\partial L(y,F)}{\partial F} = \ ye^{-yF}$ where $ y \in (-1,1) $
Fit a weak learner $h$ (regression tree) to the negative gradient $res$
In each terminal node of the tree compute the optimal step size $\rho$

$\rho = argmin_\rho = L(y,F_{m-1} + \rho h) $
Update $F$ as $F_m = F_{m-1} + \rho h$

By applying a single Newton-Raphson step in order to find $\rho$ we have:

$ G(\rho) = L(y,F + \rho h) = L(y,F) + \rho h^T \frac{\partial L}{\partial F} + \frac{1}{2} \rho^2 h^T \frac{\partial {^2} L}{\partial F^2}h, $ by applying a Taylor second order approximation

Computing the derivative of $G(\rho)$ and setting it to zero we have: $ \frac{dG}{d\rho} = h^T \frac{\partial L}{\partial F} + \rho h^T \frac{\partial {^2} L}{\partial F^2}h = 0 $

Hence, $\rho = -\frac{h^T \frac{\partial L}{\partial F}}{h^T \frac{\partial L^2}{\partial {^2} F}h} = -\frac{h \sum \frac {\partial L}{\partial F}}{h^2 \sum \frac {\partial {^2} L}{\partial F^2}} = -\frac{\sum \frac {\partial L}{\partial F}}{h \sum \frac {\partial {^2} L}{\partial F^2}}$ since $h$ vector has the same value across all observations (that was the tricky part I have missed before)

Finally, $F_m = F_{m-1} -\frac{\sum \frac {\partial L}{\partial F}}{\sum \frac {\partial {^2} L}{\partial F^2}}$ which is the same as the python code

where $ pred = F, numerator = \sum -\frac{\partial L}{\partial F} = \sum ye^{-yF}, denominator = \sum \frac{\partial {^2} L}{\partial F^2} = \sum e^{-yF} $

Newton Raphson in Gradient Boosting

1 Answers1