Formalizing conventional wisdom about gradient descent with decreasing step sizes

Question

We are trying to minimize the unconstrained, convex function $f(x)$. The general iterative scheme for gradient descent is $$x^{t+1} = x^t - \alpha^t \nabla f(x^t)$$ It is conventional knowledge that gradient descent converges to the minimum if $\alpha^t$ are a fixed, decreasing sequence of step sizes satisfying the so-called Robbins—Monro conditions $\alpha^t >0, \sum \alpha^t = \infty, \sum (\alpha^t)^2 < \infty$.

I have read/heard this so many times that I have begun to regard it as fact. However, it occurs to me that I have never seen a proof, or read any discussion of what additional conditions are required on $f$ for the statement to hold. In fact, if you read the original Robbins and Monro paper from 1951, it has nothing to do with gradient descent! It's about univariate root finding problems and the possibility of replacing the original function with a random "experiment" whose expected value is the same. The only commonality is the intuitive understanding that these conditions on $\alpha$ ensure that you have enough "gas" to explore all of $\mathbb{R}^n$ while slowing down so that you don't overshoot the minimum. That is not a proof.

My questions are

Who first applied the Robbins–Monro conditions to convex minimization? (Note that I am speaking about the step-size conditions in a deterministic context, not the Robbins–Monro stochastic approximation algorithm or stochastic gradient descent.)
Under what conditions on $f$ (e.g. strict convexity) is gradient descent with decreasing step sizes certain to converge?
Does the same algorithm work as a root-search algorithm when $\nabla f(x)$ is not the gradient of a function $f$, but just an arbitrary function from $\mathbb{R}^n$ to $\mathbb{R}^n$ that has a positive (semi)definite Jacobian?

There are many questions on this site about various flavors of gradient descent, but most of them either concern the line search methods discussed by Boyd and Vanderberghe in Convex Optimization, or the use of a sufficiently small fixed step size $\alpha$. I can't find one that addresses this particular point. This answer answers my second question by claiming, without citation or proof, that strict convexity and Lipschitz continuity of the derivative are sufficient for the decreasing sequence of step sizes to converge.

You don't need decreasing step sizes with gradient descent, by the way. It would be unusual to use decreasing step sizes with gradient descent in convex optimization. Stochastic gradient descent, on the other hand, is a different story. — littleO, Aug 05 '21 at 06:57
@littleO In some problems, it is necessary to use decreasing step sizes (or line search). One of the questions I linked to mentions $f = x^4$. No fixed step size yields convergence to $x^* =0$. — Max, Aug 05 '21 at 07:02

Max · Answer 1 · 2021-08-10T03:59:21.790

Here is a partial answer.

Proposition 1.2.4 of Dimitri Bertsekas, Nonlinear Programming (1999) (or his lecture slides at p. 32) says that diminishing step sizes meeting the Robbins–Monro condition converge when the objective function

is strictly convex,
has a (unique) minimum, and
its gradient is Lipschitz continuous.

I will not reiterate the proof details because the question was mainly a reference request. This reference addresses questions (1) and (2) of the original post.

As for question (3), Theorem 11.6 of Jorge Nocedal and Stephen Wright, Numerical Optimization (2006) considers the general root-finding problem in $F: \mathbb{R}^n \mapsto \mathbb{R}^m$. They use the sum of the squares of $F$'s elements as a "merit function" $f$. Then they show that step sequences meeting the Wolfe condition guarantee that $\nabla f$ converges to the zero vector, which implies $f$ reaches a minimum, and if the value of $f$ is zero at this minimum, then it must be the root of $F$. Moreover, the minimum of $f$ fails to have a value of zero if and only if the Jacobian of $F$ is singular, which cannot happen if $\mathbf{J} F$ is positive definite.

This is a slight mismatch for the OP's question (3), which concerns a function $F: \mathbb{R}^n \mapsto \mathbb{R}^n$ that resembles the gradient of a convex function in that its Jacobian is positive (semi)definite, and thus can be used exactly like the gradient instead of having to create a surrogate function $f$.

Also, it appears that decreasing step sizes always meet the Wolfe condition if $f$ is strictly convex (see Bertsekas), but this is not argued specifically by Nocedal and Wright, and I am not sure if there is an analogue of the Wolfe condition for the root-finding problem that works on $F$ directly without navigating through the merit function.

I still hope someone can supply a more rigorous answer than this, specifically, a proof (or reference to one) of the following statement:

Proposition. If $F: \mathbb{R}^n \mapsto \mathbb{R}^n $ is a Lipschitz continuous function whose Jacobian is positive definite and for which there exists an $x$ such that $F(x) =\vec 0$, then when the sequence of step sizes $\alpha_t$ meets the Robbins–Monro condition, the iterative procedure $ x^{t+1} = x^t + \alpha_t F(x^{t})$ converges to the root.

Formalizing conventional wisdom about gradient descent with decreasing step sizes

1 Answers1