Escape from local minima in the gradient descent method

Question

I'm using gradient descent $$x_i=x_{i-1}-\gamma\nabla f(x_{i-1})\tag1$$ to minimize a function $f$. Now, I've observed that I got stuck in a local minimum when $x_0$ was chosen in an unlucky way.

Is there any mechanism to detect that we've got stuck in a local minimum and escape it in some way?

I've come up with a simple idea, which actually works, but I guess it is quite inefficient. What I'm doing is replacing $(1)$ by $$x_i=x_{i-1}-\gamma_i\frac{\nabla f(x_{i-1})}{\left\|\nabla f(x_{i-1})\right\|}\tag2,$$ where $\gamma_i=\gamma_{i-1}/2$ and $\gamma_0=1$. Now, after every iteration $i$, I choose randomly a $\tilde\gamma$ in $[0,1)$, compute $$\tilde x_i=x_i-\tilde\gamma\frac{\nabla f(x_i)}{\left\|\nabla f(x_i)\right\|}\tag3$$ and if $f(\tilde x_i)<f(x_i)$, then I set $\gamma_i=\tilde\gamma$ and $x_i=\tilde x_i$.

I'm sure there is a smarter way.

Pick a "box". Sample this box. Use each sample point to "seed" the gradient descent. You then obtain various gradient descent trajectories. Pick the lowest local minimum. — Rodrigo de Azevedo, Apr 14 '23 at 09:04
@RodrigodeAzevedo You mean uniformly sampling on $[0,1)^d$? I don't understand what you then want do with these samples. — 0xbadf00d, Apr 14 '23 at 09:51
Sample over whatever box makes sense to you. Use each sample to "seed" the gradient descent iteration, i.e., use each sample as an initial state $x_0$ and then iterate — Rodrigo de Azevedo, Apr 14 '23 at 09:58
@RodrigodeAzevedo So, in each iteration, I forgot about the current state, sample $y_1,\ldots,y_k$ from my domain (which is $[0,1)^d$ actually) and pick the lowest among $y_i-t\nabla f(y_i)$ as the new current state? I guess I've got something wrong. And do you want the step size to be fixed? — 0xbadf00d, Apr 14 '23 at 10:02
@RodrigodeAzevedo Or do you only want to choose the initial state in that way? (In that case, again: Fixed step size or somehow adaptive?) — 0xbadf00d, Apr 14 '23 at 10:04
Choose the initial state $x_0$ via random sampling and then use the gradient descent iteration. Alternatively, use gradient descent to find a local minimum and then use a strong random "shock" to kick you out of that valley and into another valley. You need to leave one basin of attraction and enter another. If you randomly sample over a region and always converge to the same local minimum, that is nice. If the objective is convex, the local minimum is also the global minimum. — Rodrigo de Azevedo, Apr 14 '23 at 10:40

score 3 · Accepted Answer · answered Apr 18 '23 at 14:54

There are plenty of solutions to this very problem being developed because of the usage of gradient descent methods to optimize neural networks.

Depending on the dimension of your state space $x$ there are smarter or more straight forward ways to do it.

If the dimension of $x$ is low, you can do the brute-force search on the grid surrounding the point you are getting stuck around. How big and dense the grid has to be is dependent on the regularity of the function $f$.

If the dimension is greater than 1 you probably also want to escape the saddle points therefore you should analyse the eigenvalues of a Hessian matrix of $f$ nearby the point you are being stuck.

That being said, if the dimensionality of your problem is low, there are more theoretically sound, higher than order 1 optimization techniques which have some theoretical guarantees on convergence and work better than plain gradient descent.

If the dimensionality of your problem is high, you cannot afford the computation of Hessian due to its quadratic complexity. In this case we want to restrict ourselves to first order optimization techniques. The most popular choices here are SGD, Adam and RAdam. If you are concerned about saddle points you can add noise at each iteration and you can do it smartly as presented in the works of M. Jordan.

As a rule of thumb, if you are looking for good enough solution to your problem and the dimensionality of $x$ is absurdly high, then go and read up on some Deep Learning literature. If you want any theoretical guarantees of your optimization algorithm and the dimensionality of $x$ is still pretty high then read the Physics literature. If you have no margin of error and the dimension of $x$ is low then read the Mathematical literature.

The topic is vast and I barely scratched the surface here. I hope you can research the best suited solution for your problem yourself now, because stated as is there is no good answer really.

The function I'm minimizing is of the form $[0,:1)^d\ni x\mapsto\int_{[0,:1)^d}|\varphi_x-g|^p$, where $g$ is a given function and $\varphi_x$ is a Gaussian kernel centered at $x$. Do you think the best I can do is to uniformly sample $k$ candidates from $[0,1)^d$, look for which sample my function has the smallest value, start with that minimum sample as an initial guess and then use a fixed step size gradient descent? Or should I do something else and/or use an adaptive step size? Before, I've tried to normalize the gradient and use step size $1/2^i$ in iteration $i$. Is that good? — 0xbadf00d, Apr 23 '23 at 15:54
If $g$ has a potential to be approximated by Gaussian-like shaped function I would start by picking $x$ to be a center of mass of $g$. I don't know the wider context but it looks like you could read some statistics references. The book "Machine Learning: The Probabilistic Perspective" has some chapters about Gaussian mixtures and alike problems. — Jan Olszewski, Apr 24 '23 at 14:00
Moreover, I'd give up on an idea that normalizing the gradient and scheduling the step size manually is a good way to optimize things. Note that when reaching local minimum the norm of gradient is decreasing anyway, therefore gradient descent with fixed step size is somehow self-tuning. That being said, in Deep Learning scenarios since people are usually trying to find any local minima, and there are many because of ridiculous overparameterization of neural networks, it is common practice to use custom learning rate schedulers to speed-up the learning process. — Jan Olszewski, Apr 24 '23 at 14:04

score 0 · Answer 2 · answered Apr 20 '23 at 04:52

Although I'm not very qualified in this, I believe I have enough experience with numerical optimization to make the following statement, which you might find helpful.

Is there any mechanism to detect that we've got stuck in a local minimum and escape it in some way?

Firstly, there are conditions such as oscillating $x_{i}$ to determine when you are stuck.

However, more importantly, in the event where you are stuck before "jumping" to the next valley is to make sure that you are currently stuck in a local min valley instead of a global min valley.

Currently, there are no known algorithm that guarantees the current valley is global min.

That being said, if you know the domain of $x_{i}$, as Rodrigo de Azevedo mentioned above, you can sample a grid, within the domain, to seed the initial guesses and see which initial guess converges to the minimum value.

But, I disagree with Rodrigo de Azevede when mentioning the grid being uniform. In fact, you can make the samples to be non-uniform. This is kind of an adaptive particle swarm method.

More efficient/smarter way when getting stuck:

As Jan have mentioned the computation of Hessian, I will give an alternative.

Instead of randomly choose different step size, $\gamma$, it is often better to choose different step direction. I.e. not the steepest decent.

If you have a dictionary of the values of all the previous guesses, you can go back to the most recent uphill and set the direction to be steepest ascend (the opposite) instead and use that as the seed of your next iteration.

This is often better than choosing different step size, especially when the current valley you are stuck in is not the local min.

Escape from local minima in the gradient descent method

2 Answers2