18

I'm studying about Newton's method and I get the single dimension case perfectly, but the multidimensional version makes me ask question...

In Wikipedia Newton's method in higher dimensions is defined as:

$$\textbf{x}_{n+1} = \textbf{x}_n - [Hf(\textbf{x}_n)]^{-1}\nabla f(\textbf{x}_n), \;\;\; n \geq 0.$$

Where $\textbf{x}_n$ is the $p$-dimensional vector at $n$th iteration, $[Hf(\textbf{x}_n)]^{-1}$ is the inverse of the Hessian matrix of the function $f(\textbf{x})$ at $\textbf{x}_n$ and $\nabla f(\textbf{x}_n)$ is the gradient of the function $f(\textbf{x})$ at $\textbf{x}_n$. That is:

$$\left( \begin{array}{c} x_1^{(n+1)} \\ x_2^{(n+1)} \\ \vdots \\ x_p^{(n+1)} \end{array} \right) = \left( \begin{array}{c} x_1^{(n)} \\ x_2^{(n)} \\ \vdots \\ x_p^{(n)} \end{array} \right) - \left( \begin{array}{cccc} \frac{\partial^2f}{\partial x_1^2}(\textbf{x}_n) & \dots & \dots &\frac{\partial^2f}{\partial x_p\partial x_1}(\textbf{x}_n)\\ \frac{\partial^2f}{\partial x_1\partial x_2}(\textbf{x}_n) & \ddots & \vdots & \vdots\\ \vdots & \vdots & \vdots & \vdots\\ \frac{\partial^2f}{\partial x_1\partial x_p}(\textbf{x}_n) & \dots & \dots & \frac{\partial^2f}{\partial x_p^2}(\textbf{x}_n) \end{array} \right)^{-1}\left( \begin{array}{c} \frac{\partial f}{\partial x_1}(\textbf{x}_n) \\ \frac{\partial f}{\partial x_2}(\textbf{x}_n) \\ \vdots \\ \frac{\partial f}{\partial x_p}(\textbf{x}_n) \end{array} \right)$$

Now my question is: "What is the intuition behind this formula?" This resembles somehow the gradient descent algorithm, but the inverse of the Hessian is like it came from the magician's hat :S Can somebody give me a similar kind of proof as is given here on the one-dimensional case:

Why does Newton's method work?

Why the Hessian? Why its inverse?! :) Intuition of the formula?

Thank you for any help :) P.S. I here is the page I got the formula above:

http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization#Higher_dimensions

Note also that in my notation the topscript in the $x_i$s doesn't mean exponent, it's just an iteration label...

jjepsuomi
  • 8,979

5 Answers5

16

I'll assume we're trying to minimize a twice continuously differentiable function $f$ defined on $\mathbb R^p$.

We wish to find $x$ such that $\nabla f(x) = 0$.

Given $x_n$, we would ideally like to find $\Delta x$ such that $\nabla f(x_n + \Delta x) = 0$. Rather than satisfying this requirement exactly (which would probably be too difficult), we instead use the approximation \begin{equation*} \nabla f(x_n + \Delta x) \approx \nabla f(x_n) + Hf(x_n) \Delta x. \end{equation*} Setting the right hand side equal to $0$ gives us \begin{equation*} \Delta x = -Hf(x_n)^{-1} \nabla f(x_n). \end{equation*} We can hope that $x_{n+1} = x_n + \Delta x$ will be an improvement on $x_n$.

littleO
  • 54,048
  • 3
    Note that Newton's method in optimization simply solves $\nabla f(x) = 0$, using Newton's method for solving nonlinear equations. – littleO Aug 02 '13 at 10:49
  • What if the Hessian isn't invertible, can you use an approximation like SVD? – IntegrateThis Jan 31 '22 at 08:22
  • "Newton's method in optimization simply solves ∇()=0, using Newton's method for solving nonlinear equations". I completely disagree, as your answer ignores linesearch/trust region globalization mechanism, the desire to make $f$ decrease from one iteration to the next, and the fact that $f$ may be nonconvex. – Dominique Sep 28 '22 at 21:43
9

Here is another interpretation of Newton's method that I rather like.

Newton's method takes the known information of the function at a given point (value, gradient and Hessian), makes a quadratic approximation of that function, and minimizes that approximation.

More specifically, suppose $x_n$ is given, $g = \nabla f(x_n)$, and $H = \nabla^2 f(x_n)$. The quadratic approximation of $f$ at $x_n$ is a quadratic function $h(x)$ such that $h(x_n) = f(x_n)$, $\nabla h(x_n) = g$ and $\nabla^2 h(x_n) = H$. It turns out that $$ h(x) = \frac 12(x - x_n)^T H (x - x_n) + g^T (x - x_n) + f(x_n). $$ This function has a unique global minimum if and only if $H$ is positive definite. This is the requirement for Newton's method to work. Assuming $H$ is positive definite, the minimum of $h$ is achieved at $x^*$ such that $\nabla h(x^*) = 0$. Since $$ \nabla h(x) = H(x - x_n) + g, $$ we get $x^* = x_n - H^{-1}g$.

Tunococ
  • 10,433
3

The Hessian is simply the higher-dimensional generalization of the second derivative, and multiplying by its inverse is the non-commutative generalization of dividing by $f''$ in the one-dimensional case. I don't know if it's reasonable to try a geometric argument along the line of your link for the double generalization to optimization and to $n$ dimensions: better to just look up an actual proof.

It might be less confusing if you look at the higher-dimensional Newton's method for roots, before that for optimization. That's the "nonlinear systems of equations" section on Wikipedia. Hubbard and Hubbard's book on linear algebra, multivariable calculus, and differential forms has the best treatment of multivariate Newton's method I know.

Kevin Carlson
  • 54,515
2

You can see Newton's method as a descent direction search method (please see detais in Why is Newton's method faster than gradient descent? that I answered before).

To see this, you can consider the minimization problem $$0=g(\textbf{a})=\min_{\textbf{x}\in A}{g(\textbf{x})},\qquad {g(\textbf{x})}=\frac{1}{2}\|{\bf f}({\bf x})\|^2.$$ to some continuously differentiable function $\textbf{f}:A\to \mathbb{R}^p$, where $A$ is an open set of $\mathbb{R}^m$ containing $\textbf{a}$.

If you have some differentiable curve $\textbf{u}:(a,b)\to A$, you can apply the chain rule to obtain $$\frac{d\, g({\bf u}(t))}{dt}= \left\langle {\bf u}'(t), \nabla g({\bf u}(t))\right\rangle=\left\langle J{\bf f}({\bf u}(t)){\bf u}'(t),{\bf f}({\bf u}(t))\right\rangle,$$ in which $\langle \cdot,\cdot\rangle$ denotes the inner product.

If you choose the curve satisfing the initial value problem (IVP) $$\left\{\begin{array}{rrl}J{\bf f}({\bf u}(t)){\bf u}'(t)&=&-\alpha {\bf f}({\bf u}(t))\\ {\bf u}(0)&=&{\bf u}_0\end{array}\right.,$$ to some $\alpha>0$. You find that $$\frac{d\, g({\bf u}(t))}{dt}= -2\alpha g({\bf u}(t))$$ or $$g({\bf u}(t))=g({\bf u}(0))e^{-2\alpha t}.$$

Newton's method is just the Euler method to solve this IVP numerically.

Remark: If $f(x)=\nabla h(x)$ you find the expression in your question, as a particular case. You can find more, if you search to "(x_{n+1} = x_n - [Hf(x_n)]^{-1}\nabla f(x_n))" on SearchOnMath.

0

I wonder if part of your confusion comes from the fact that one-dimensional Newton is usually presented as a method for finding the zero of a function $f(x)$, whereas your higher-dimensional formula is about finding the minimum of a function $f(x)$. So you first need to observe that your formula is about finding the zero of $\nabla f$.

In other words: you only need a first derivative in both cases, but since you are getting your formula in an optimization context, the function you are finding the zero of is already a derivative, hence the need for a second derivative.

Mjoseph
  • 1,039