Gauss-Newton versus gradient descent

Question

I would like to ask first if the second order gradient descent method is the same as the Gauss-Newton method.

There is something I didn't understand. I read that with the Newton's method the step we take in each iteration is along a quadratic curve in $R^n$ rather than along a straight line (steepest descent). Can anyone explain more clearly this statement?

Many Thanks

pgorczak · Answer 1 · 2015-03-09T17:41:39.390

The Gauss-Newton method is an approximation of the Newton method for specialized problems like

$$ \underset{\mathbf{x}}{\operatorname{argmin}}\;\mathbf{r}(\mathbf{x})^T\mathbf{r}(\mathbf{x}) $$

In other words, it finds a solution $\mathbf{x}$ that minimizes the squared norm of a nonlinear function $||\mathbf{r}(\mathbf{x})||_2^2$.

If you look at the update step for gradient descent and Gauss-Newton applied to the equivalent problem $\frac{1}{2}\mathbf{r}(\mathbf{x})^T\mathbf{r}(\mathbf{x})$, the relationship becomes clear:

Gradient descent

$$ \begin{align} \mathbf{x}_{n+1} &= \mathbf{x}_n - \mu \Delta(\frac{1}{2}\mathbf{r}(\mathbf{x_n})^T\mathbf{r}(\mathbf{x_n})) \\ &= \mathbf{x}_n - \mu\mathbf{J}_r^T\mathbf{r}(\mathbf{x}_n) \end{align} $$

Gauss-Newton

$$ \begin{align} \mathbf{x}_{n+1} = \mathbf{x}_n - (\mathbf{J}_r^T\mathbf{J}_r)^{-1}\mathbf{J}_r^T\mathbf{r}(\mathbf{x}_n) \end{align} $$

The structure of the problem enables the approximation of the Hessian used in Newton's method as $\mathbf{H} \approx \mathbf{J}_r^T\mathbf{J}_r$. As you said, the method jumps to the minimum of the second order Taylor-approximation around $\mathbf{x}_n$ in every step.

The qualitative behavior in the neighborhood of a solution is that the approximated second-order (curvature) information allows for convergence along a more direct, less "zigzaggy" path. It also converges faster than gradient descent. Imagine how the region that is approximated as a quadratic function (the one that you "jump across" in an iteration) becomes smaller and smaller. In turn, that approximation becomes more and more accurate for a sufficiently smooth function.

However, if the initial guess is far away from a solution, the (approximated) Hessian can become ill-conditioned. The resulting correction-vector is not guaranteed to point in the general direction of descent anymore (if the angle between it and the steepest descent is larger than 90°, the method actually diverges).

Gauss-Newton versus gradient descent

1 Answers1

Gradient descent

Gauss-Newton