15

I have been searching for an intuitive explanation of the conjugate gradient method (as it relates to gradient descent) for at least two years without luck.

I even find articles like "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain" hard to understand.

Intuitively, what does this method do (e.g. geometrically) and why does it outperform gradient descent?

Josh
  • 467
  • 2
    I've took the course, though not really remember things, here Michael explains the method and why it outperforms gradient descent. https://www.youtube.com/watch?v=hZVK_PGE0_I. The thing was, the expanding manifold property - see starting around minute 40 in the first lecture, then continues in Lecture 11 – dEmigOd Jul 30 '17 at 20:39
  • 2
    That might be because you're looking at an abbreviated version of Shewchuk's "...Agonizing Pain" article. The original is full of diagrams and gives a lot of geometrical intuition. –  Jul 30 '17 at 21:50

3 Answers3

13

Just giving an intuitive overview. Michael Zubilevsky's lectures go through it in more technical detail.

Consider the quadratic objective $f(x) = x^T Q x + b^T x$. Geometrically, CG tries to "unwarp" some of the geometry induced by the quadratic form $Q$.

To build intuition, consider $Q \in \mathbb{R}^{2\times 2}$. If $Q = I$, then the contours of $f$ are circles. If $Q$ is diagonal, then the contours of $f$ are $x$-axis and $y$-axis aligned ellipses. If $Q$ has off-diagonal components as well, then there is "correlation" between the $x$ and $y$ directions, so the contours are ellipses whose principal axes are a combination of the $x$- and $y$-axis.

Now consider performing coordinate descent in the $Q=I$ or $Q$ diagonal case. We can reach the optimum in at most $2$ steps by viewing the problem as two $1$-dimensional minimization problems over each coordinate. We have this property because the $x$- and $y$-axis are orthogonal, and they "align" with the geometry of $Q$.

In the case where $Q$ contains off-diagonal components, CG finds orthogonal directions in $Q$'s geometry so that we can perform coordinate descent. We can use Gram-Schmidt to do the orthogonalization to find orthogonal directions, but if we start by orthogonalizing against the gradient, then we observe a lot of simplification in the update rule.

It outperforms gradient descent because by orthogonality of search directions, we ensure that we're never repeating a step along a direction we've searched in before. Thus we get rid of the "zig-zagging" of GD.

[EDIT: some additional notes here]

jjjjjj
  • 2,779
  • Is it correct to say that, since generally (negative of) normals of ellipse doesn't directly point to center, CG use the local information to redirect those (negative of) normals to center? – somebody4 Dec 03 '20 at 04:40
  • If it's true, then each CG step is equivalent to: 1) rescale the (locally) n-ellipse into n-circle, 2) perform steepest gradient. – somebody4 Dec 03 '20 at 04:45
  • I'll have to think a little and get back to you, but my first reaction is that what you're describing is like Newton's method – jjjjjj Dec 03 '20 at 04:46
  • Sorry maybe I am wrong, but the ellipse thing kind of remind me of svd... – somebody4 Dec 03 '20 at 05:07
  • @somebody4: I'm only thinking in the convex quadratic function case for now, but the thing is that Newton can give you information to "rescale" the ellipse to a circle all at once, but CG does so in a step-wise fashion (search "expanding manifold property" for more on Krylov / spectral analysis) – jjjjjj Dec 04 '20 at 06:16
3

Check the full version of

Shewchuk (1994) An Introdution to the Conjugate Gradient Method without Pain

This pdf is a 64 page document with 40+ figures (full of geometric insights). The version you got is just a 17 page version of the full document, without figures.

bluemaster
  • 4,537
0

Gradient descent with an optimal step results in orthogonal directions, that makes a sort of "Zig-zaging" when heading to the minimum, i.e., to have an optimal step $f(x_k+\alpha d_k)$ is minimized according to $\alpha$: $d f(x_k+\alpha d_k)/d \alpha=0 \Rightarrow \nabla^T f(x_{k+1})\nabla f(x_k)=0 (i.e.: d^{T}_{k+1} d_k=0)$

To overcome this problem, conjugate gradients method makes the directions not orthogonal by puting: $d^{T}_{k+1} Q d_k =0 $. And finally the CG gives: $x_{k+1}=x_k-\alpha \nabla f(x_k) + (\alpha_k \beta_{k-1}/\alpha_{k-1})(x_k-x_{k-1})$. Note this the same as gradient descent plus a term called Momentum term which works on accelerating the algorithm.

himoury
  • 11
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center. – Community Oct 31 '23 at 23:13