We are trying to minimize the unconstrained, convex function $f(x)$. The general iterative scheme for gradient descent is $$x^{t+1} = x^t - \alpha^t \nabla f(x^t)$$ It is conventional knowledge that gradient descent converges to the minimum if $\alpha^t$ are a fixed, decreasing sequence of step sizes satisfying the so-called Robbins—Monro conditions $\alpha^t >0, \sum \alpha^t = \infty, \sum (\alpha^t)^2 < \infty$.
I have read/heard this so many times that I have begun to regard it as fact. However, it occurs to me that I have never seen a proof, or read any discussion of what additional conditions are required on $f$ for the statement to hold. In fact, if you read the original Robbins and Monro paper from 1951, it has nothing to do with gradient descent! It's about univariate root finding problems and the possibility of replacing the original function with a random "experiment" whose expected value is the same. The only commonality is the intuitive understanding that these conditions on $\alpha$ ensure that you have enough "gas" to explore all of $\mathbb{R}^n$ while slowing down so that you don't overshoot the minimum. That is not a proof.
My questions are
Who first applied the Robbins–Monro conditions to convex minimization? (Note that I am speaking about the step-size conditions in a deterministic context, not the Robbins–Monro stochastic approximation algorithm or stochastic gradient descent.)
Under what conditions on $f$ (e.g. strict convexity) is gradient descent with decreasing step sizes certain to converge?
Does the same algorithm work as a root-search algorithm when $\nabla f(x)$ is not the gradient of a function $f$, but just an arbitrary function from $\mathbb{R}^n$ to $\mathbb{R}^n$ that has a positive (semi)definite Jacobian?
There are many questions on this site about various flavors of gradient descent, but most of them either concern the line search methods discussed by Boyd and Vanderberghe in Convex Optimization, or the use of a sufficiently small fixed step size $\alpha$. I can't find one that addresses this particular point. This answer answers my second question by claiming, without citation or proof, that strict convexity and Lipschitz continuity of the derivative are sufficient for the decreasing sequence of step sizes to converge.