Why is Newton's method faster than gradient descent?

Question

Can you provide some intuition as to why Newton's method is faster than gradient descent?

Often we are in a scenario where we want to minimize a function f(x) where x is a vector of parameters. To do that the main algorithms are gradient descent and Newton's method. For gradient descent we need just the gradient, and for Newton's method we also need the hessian. Each iteration of Newton's method needs to do a linear solve on the hessian:

$$x \leftarrow x - \textrm{hess}(f,x) \backslash \textrm{grad}(f,x)$$

Where \ indicates doing a linear solve (like in matlab). In many cases the hessian is sparse, and in that case we would use an iterative method like conjugate gradient descent to do the linear solve. So what we're actually doing is this:

Compute a quadratic approximation of f around x.
Minimize that quadratic approximation with conjugate gradient descent.
Make a step in the direction of that minimum.

Compare this to normal gradient descent, where we just run (conjugate) gradient descent directly on the unapproximated original function f.

My question is: why is it faster to repeatedly do a quadratic approximation, then minimize that approximation with gradient descent, and then do a step in that direction, than to simply run gradient descent on the original function?

My immediate intuition: the one that requires more information carries more information and thus computational power. — zibadawa timmy, Nov 09 '14 at 12:52

score 11 · Answer 1 · answered Nov 10 '14 at 22:38

The quick answer would be, because the Newton method is an higher order method, and thus builds better approximation of your function. But that is not all.

Newton method typically exactly minimizes the second order approximation of a function $f$. That is, iteratively sets $$ x \gets x - \left[ \nabla^2 f(x) \right]^{-1} \nabla f(x). $$

Gradient has access only to first order approximation, and makes update $$ x \gets x - h \nabla f(x), $$ for some step-size $h$.

Practical difference is that Newton method assumes you have much more information available, makes much better updates, and thus converges in less iterations.

If you don't have any further information about your function, and you are able to use Newton method, just use it.

But number of iterations needed is not all you want to know. The update of Newton method scales poorly with problem size. If $x \in \mathbb{R}^d$, then to compute $ \left[ \nabla^2 f(x) \right]^{-1} $ you need $\mathcal{O}(d^3)$ operations. On the other hand, cost of update for gradient descent is linear in $d$.

In many large-scale applications, very often arising in machine learning for example, $d$ is so large (can be a billion) that you are way beyond being able to make even a single Newton update.

actually, in some of the situations you mention, you can't even afford O(d) so you'd go for stochastic methods — Surb, Jul 28 '17 at 19:26
"If you don't have any further information about your function, and you are able to use Newton method, just use it." may be worth mentioning that Newton's method will fail spectacularly if the function is nonconvex around the initialization (may even have to be strongly convex? not sure) — Nathan Wycoff, Jan 13 '23 at 17:20

score 4 · Answer 2 · answered Dec 17 '20 at 04:45

Suppose we're minimizing a smooth convex function $f: \mathbb R^n \to \mathbb R$. The gradient descent iteration (with step size $t > 0$) is $$ x_{k+1} = \arg \min_x \,f(x_k) + \langle \nabla f(x_k), x - x_k \rangle + \frac{1}{2t} \| x - x_k \|_2^2. $$ At each iteration, we minimize a linear approximation to $f$ (with an additional quadratic penalty term that prevents us from moving too far from $x_k$).

The Newton's method iteration is $$ x_{k+1} = \arg \min_x f(x_k) + \langle \nabla f(x_k), x - x_k \rangle + \frac12 (x - x_k)^T \nabla^2 f(x_k)(x - x_k). $$ At each iteration, we are minimizing a quadratic approximation to $f$. The quadratic approximation is more accurate than the linear approximation that gradient descent uses, so it's plausible that Newton's method converges faster to a minimizer of $f$.

The way you wrote it clearly shows that gradient descent also minimizes a quadratic approximation of f about $x_k$. Only one where the Hessian is approximated by a multiple of the identity. — Dominique, Sep 28 '22 at 21:38

José C Ferreira · Answer 3 · 2022-03-16T13:45:40.510

I intend to give some glimpses, like one I did here.

Let us consider the minimization problem $$0=g(\textbf{a})=\min_{\textbf{x}\in A}{g(\textbf{x})},\qquad {g(\textbf{x})}=\frac{1}{2}\|{\bf f}({\bf x})\|^2,$$ to some continuously differentiable function $\textbf{f}:A\to \mathbb{R}^p$, where $A$ is an open set of $\mathbb{R}^m$ containing $\textbf{a}$. Now, if you have some differentiable curve $\textbf{u}:(a,b)\to A$, you can apply the chain rule to obtain $$\frac{d\, g({\bf u}(t))}{dt}= \left\langle {\bf u}'(t), \nabla g({\bf u}(t))\right\rangle= \left\langle {\bf u}'(t),[J{\bf f}({\bf u}(t))]^*{\bf f}({\bf u}(t))\right\rangle=\left\langle J{\bf f}({\bf u}(t)){\bf u}'(t),{\bf f}({\bf u}(t))\right\rangle,$$ in which $\langle \cdot,\cdot\rangle$ denotes the inner product.

A natural choice to $u(t)$ is given by the the initial value problem (IVP) $$\left\{\begin{array}{rrrrl}{\bf u}'(t)&=&-\alpha \nabla g({\bf u}(t))\\ {\bf u}(0)&=&{\bf u}_0\end{array}\right.,$$ where $[J{\bf f}({\bf u}(t))]^* {\bf f}({\bf u}(t))=\nabla g({\bf u}(t))$, and $\alpha>0$.

If you use Euler method to solve this PVI numerically, you find the gradient descent method. This method, with step size $h_j$, takes the form $${\bf u}_{j+1}=\phi({\bf u}_j),$$ to $$\phi({\bf u})={\bf u}-h_j\alpha\left[J{\bf f}({\bf u})\right]^*{\bf f}({\bf u}),$$ as a fixed point iteration to solve $${\bf f}({\bf a})={\bf 0},\qquad \phi({\bf a})={\bf a}.$$ It converges when $$\|\phi'({\bf a})\|=\max_{1\leq i\leq m}|1-h_j\alpha s_i^2|<1,$$ if you have a good choice to ${\bf u}_0$, in which $s_i$ is a singular value of $J{\bf f}({\bf a})$.

It holds the inequality $$\frac{d\, g({\bf u}(t))}{dt}= -\alpha\|\nabla g({\bf u}(t))\|^2\leq -2\alpha \sigma_{min}(t)^2g({\bf u}(t))\leq 0,$$ using the inequality $$\|\nabla g({\bf u}(t))\|^2=\|[J{\bf f}({\bf u}(t))]^*{\bf f}({\bf u}(t))\|^2\geq \sigma_{min}(t)^2\|{\bf f}({\bf u}(t))\|^2,$$ in which $\sigma_{min}(t)$ is the smallest singular value of $J{\bf f}({\bf u}(t))$. You can prove that this produces the inequality $$g({\bf u}(t))\leq g({\bf u}(0))e^{-2\alpha \lambda(t)},\qquad \lambda(t)=\int_0^t\sigma_{min}(s)^2\,ds.$$

Another choice is the curve satisfing the initial value problem (IVP) $$\left\{\begin{array}{rrl}J{\bf f}({\bf u}(t)){\bf u}'(t)&=&-\alpha {\bf f}({\bf u}(t))\\ {\bf u}(0)&=&{\bf u}_0\end{array}\right.,$$ to some $\alpha>0$. You find that $$\frac{d\, g({\bf u}(t))}{dt}= -2\alpha g({\bf u}(t))$$ or $$g({\bf u}(t))=g({\bf u}(0))e^{-2\alpha t}.$$

In both cases, it follows that, ${g(\textbf{u}}(t))$ is a non increasing function. This also means that, if $g(\textbf{u}(t))> 0$, then $g(\textbf{u}(t+h))<g(\textbf{u}(t))$, when $0<h<h_t$, to some $h_t>0$ close enough to $0$. Please see Picard Theorem and Lyapunov stabylity theory.

If $m=p$ and $J{\bf f}({\bf x})$ has bounded inverse matrix, to all $\textbf{x}\in A$, the previous IVP becomes
$$\left\{\begin{array}{lll}{\bf u}'(t)&=&-\alpha \left[J{\bf f}({\bf u}(t))\right]^{-1}{\bf f}({\bf u}(t))\\ {\bf u}(0)&=&{\bf u}_0\end{array}\right..$$

We can use the Euler method $$\left\{\begin{array}{rll}J{\bf f}({\bf u}_j) {\bf w}_j&=&-\alpha_j {\bf f}({\bf u}_j)\\ {\bf u}_{j+1}&=&{\bf u}_j+{\bf w}_j\end{array}\right.,$$ to solve the previous IVP numerically, where $\textbf{u}_0=u(0)$, $t_{j+1}=t_j+h_j$, $0<h_j$, ${\bf u}(t_{j+1}) \approx {\bf u}_{j+1}$ and $\alpha_j=\alpha h_j$. This method, with step size $h_j$, takes the form $${\bf u}_{j+1}=\psi({\bf u}_j),$$ to $$\psi({\bf u})={\bf u}-h_j\alpha\left[J{\bf f}({\bf u})\right]^{-1}{\bf f}({\bf u}),$$ as a fixed point iteration to solve $${\bf f}({\bf a})={\bf 0},\qquad \psi({\bf a})={\bf a}.$$ It converges when $$\|\psi_j'({\bf a})\|=|1-h_j\alpha|<1,$$ if you have a good choice to ${\bf u}_0$.

We call $\alpha_j$ as the tuning parameter, as we call it in gradient descent method, and you should be carefully to choose it to have $g({\bf u}_{j+1})<g({\bf u}_j)$. Otherwise you can has a "bad" approximation ${\bf u}(t_{j+1}) \approx {\bf u}_{j+1}$ in which $g({\bf u}_{j+1})>g({\bf u}_j)$.

But, when the things works well, $\alpha_j$ can be 1, to $j$ big enough.

Remark: You can rewrite this text using $$g(\textbf{x})=\frac{1}{2}\|\nabla \textbf{f}(\textbf{x})\|^2$$ instead, when you are working with the implication $$ \min_{\textbf{x}\in A}{f(\textbf{x})}=f(\textbf{a})\Longrightarrow \nabla \textbf{f}(\textbf{a})=\textbf{0}.$$ The hessian matrix arises in this case.

We can find some other discussions related to this thread on SearchOnMan. — José C Ferreira, Mar 16 '22 at 13:48

Why is Newton's method faster than gradient descent?

3 Answers3

Linked