Derive steepest descent vector subject to norm constraint

Question

I am currently working through an old textbook Practical Optimization by Gill, Murray and Wright (c 1982) who make some derivations which seem correct, but I am unable to duplicate. In the equations below $g$ and $p$ are both vectors. $g$ is the gradient from the current iterate of a optimization procedure and $p$ is the search direction we wish to solve for.

The authors inform the reader that the minimum (over $p$) of

$\Large\frac{g'p}{||p||}$ is $p=-g$.

Alternatively, we might form a different norm for $p$ by considering a symmetric positive definite matrix $C$ in which case the minimizer of

$\Large\frac{g'p}{(p'Cp)^{1/2}}$ is $p=-C^{-1}g$.

I am having trouble deriving these statements.

My question is: Does it take more than a derivative and setting equal to zero to prove this? If so, how do I derive these formulas?

This question is similar to

Gradient descent with constraints

but not identical. This other question deals with solutions that are unit length, not necessarily step directions which are unit length.

Note that if $p$ is a minimizer in the above problems, then so is $\lambda p$, where $\lambda >0$. — copper.hat, Jan 28 '13 at 03:49

score 3 · Answer 1 · answered Jan 28 '13 at 03:37

This is a simple application of Cauchy-Schwarz.

$(g^Tp)^2 \leq \|g\|^2 \|p\|^2.$

So, $-\|g\| \leq \frac{g^T p}{\|p\|} \leq \|g\| $ and the lower bound is attained when $ p = -g $.

Similarly, $ (g^T p)^2 = \langle C^{-1/2}g , C^{1/2} p\rangle^2 \leq (g^T C^{-1} g) \times (p^T C p) $ and the lower bound is similarly attained.

score 2 · Answer 2 · edited Jul 23 '15 at 22:48

I prefer to look at the problem (which ends up being equivalent) of solving $\min \{ \langle g, h \rangle | \|h\| \leq 1 \}$. The norm is the standard $2$-norm. The reason I prefer this formulation is that it is closer in form to the original problem from which the search direction problem is derived. Also, it has a trust region flavor, which I personally find more appealing than a search direction/step length approach.

For the first problem, the Cauchy Bunyakovsky Schwarz Beiber inequality gives $|\langle g, h \rangle| \leq \|g\| \|h\|$, hence it is easy to see that the minimum is $\min \{ \langle g, h \rangle \,|\, \|h\| \leq 1 \} = - \|g\|$, and the minimizer is (it is unique with the $2$-norm) $h = -\frac{1}{\|g\|} g$. You could also derive this by noting that (if $g\neq 0$), the constraint $\|h\| \leq 1$ must be active, and use Lagrange multipliers to obtain the same result.

For the second, let $A^TA = C$ be the Cholesky decomposition of $C$. The problem is now $\min \{ \langle g, h \rangle \,|\, \|Ah\| \leq 1 \} = \min \{ \langle g, h \rangle \,|\, \|\delta\| \leq 1, \, h= A^{-1}\delta \} $. This is equivalent to $\min \{ \langle g, A^{-1}\delta \rangle \,|\, \|\delta\| \leq 1 \} = \min \{ \langle (A^{-1})^T g, \delta \rangle \,|\, \|\delta\| \leq 1 \}$, where $h = A^{-1}\delta$, and the first problem shows that the minimum is $-\|(A^{-1})^T g\|$ and the minimizer is $\delta = -\frac{1}{\|(A^{-1})^T g\|} (A^{-1})^Tg$, or in terms of the original problem $h = A^{-1}\delta = -\frac{1}{\|(A^{-1})^T g\|} A^{-1}(A^{-1})^Tg$. Since $A^TA = C$, we have $\|(A^{-1})^T g\| = \sqrt{\langle (A^{-1})^T g, (A^{-1})^T g \rangle} = \sqrt{\langle g, C^{-1} g \rangle}$, and $h = -\frac{1}{\sqrt{\langle g, C^{-1} g \rangle}} C^{-1} g$.

greg · Answer 3 · 2023-11-12T16:34:18.697

$ \def\a{\alpha} \def\g{\gamma} \def\b{\beta} \def\o{{\tt1}} \def\t{\theta} \def\l{\lambda} \def\C{C^{-1}} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\vecc#1{\op{vec}\LR{#1}} \def\diag#1{\op{diag}\LR{#1}} \def\Diag#1{\op{Diag}\LR{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\frob#1{\left\| #1 \right\|_F} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} $The first function is a special case of the second, so we only need to examine the function $$\eqalign{ \phi(p) &= \frac{g^Tp}{\sqrt{p^TCp}} \\ }$$ Note that if $p$ is replaced by $\l p$, the value of the function is unchanged: $\;\phi(\l p)=\phi(p)$

Consider the numerator and denominator as separate functions $\LR{{\rm i.e.}\;\;\phi = \large{\frac\a\b}}$ $$\eqalign{ \a &= g^Tp &\qiq &d\a = g^Tdp \\ \b^2 &= p^TCp &\qiq &2\b\:d\b = 2\LR{Cp}^T\:dp \qiq d\b= \fracLR{Cp}{\b}^Tdp }$$ Then use the quotient rule to calculate the desired gradient $$\eqalign{ d\phi &= \fracLR{\b\:d\a - \a\:d\b}{\b^2} \\ &= \fracLR{\o}{\b}g^Tdp - \fracLR{\a}{\b^3}\LR{Cp}^Tdp \\ \grad{\phi}{p} &= \fracLR{\o}{\b}g - \fracLR{\a}{\b^3} {Cp} \\ }$$ Setting this gradient to zero determines the optimal $p,\,$ up to a scale factor $$\eqalign{ g &= \fracLR{\a}{\b^2} {Cp} \qiq p = \l \C g \\ }$$ We don't care about the magnitude of $\l$ but we do care about its sign
since we require $p$ to be a descent direction $$g^Tp=\a\lt0 \qiq \l=-\o$$

Derive steepest descent vector subject to norm constraint

3 Answers3