There is this "folklore" result that gradient descent on a non-convex function takes $O(\frac n {\epsilon^2})$ steps to get to a point whose gradient norm is below $\epsilon$ and with SGD this takes $O(\frac {1}{\epsilon^4})$ steps.
- Can someone share a reference where this is proven?
I am aware of the recent references where these numbers have been improved. But I am not able to locate a pedagogic presentation of these older results.