0

The question came to me when I'm reading a paper of an optimization algorithm. It is an iterative method and in each step we need to find the gradient of $f:\mathbb{R}^d\rightarrow \mathbb{R}$ and do a gradient descent. Due to the lack of information of the function $f$, the author cited a so-called one-point estimate to the gradient:

$$\nabla f(x)\approx \mathbf{E}[(f(x + δu) − f(x))u]d/\delta$$ where $d$ is the dimension of $x$, $\delta$ is a small radius, $u$ is a uniformly random unit vector.

I tried to obtain this approximation in this way: $$f(x+\delta x)-f(x)\approx \nabla f(x)^T \delta u=\delta u^T \nabla f(x)$$ Then multiply $u$ and take the expectation on both sides: $$\mathbf{E}[(f(x + δu) − f(x))u]\approx \delta \mathbf{E}[u u^T \nabla f(x)]$$

Can we do something further? I'm not familiar with the vector or matrix expectation, but my intuition tells me it may have something to do with eigenvalues(maybe it's completely irrelevant).

I need help to understand the derivation of this formula.

Furthermore, I would appreciate very much if somebody could explain how to use this formula in practice. The author in one-point estimate to the gradient was considering the case $f$ is a black-box function, i.e., we have no information of $f$ except the value of $f$ at $x$. Then I wonder if we only know the value of $f$ at $x$, how can we use the information of $f(x+\delta u)$ to estimate $\nabla f(x)$?

Thank you very much!

PPP
  • 477
  • Did you read Section 2 of the paper? – angryavian Jan 14 '24 at 07:16
  • @ angryavian Yes, I do. It proves that $\nabla \mathbf{E}[(f(x + δu)] = \mathbf{E}[f(x + δu) u]d/\delta$ but it is not the proof for the formula in my question. I know that it is reasonable to say $f(x)$ can be approximated by $\mathbf{E}[f(x+\delta u)]$, but what is the intuition for the formula here, which is proposed before the lemma in section 2 without proof. – PPP Jan 14 '24 at 12:30

1 Answers1

1

If $u\sim\text{Uniform}(\mathbf S^{d-1})$ where $\mathbf S^{d-1}$ denotes the unit sphere of $\mathbb R^d$, then we know that $u$ has the same distribution as $\frac{X}{\|X\|_2}$ where $X\sim\mathcal N(0,\mathbf I_d)$, which implies that $\mathbb E[u] = 0 $ and $$\text{Var}[u] := \mathbb E\left[(u - \mathbb E[u])(u-\mathbb E[u])^T\right] = \mathbb E[uu^T]=\frac1d\mathbf I_d. \tag1$$

You may see here for a proof of this last equality, which is essentially a consequence of the many symmetries of this distribution.

If we plug $(1)$ into your initial estimate, we find $$\begin{align}\mathbf{E}[(f(x + δu) − f(x))u]&\approx \delta \mathbb{E}\left[u u^T \nabla f(x)\right]\\ &=\delta \mathbb{E}\left[u u^T \right]\nabla f(x)\\ &=\frac\delta d \nabla f(x) \end{align} $$ where we used in the second line the fact that $\nabla f(x)$ is not random. The desired estimate follows from multiplying both sides by $d/\delta$.

Then you ask

Furthermore, I would appreciate very much if somebody could explain [...] if we only know the value of $f$ at $x$, how can we use the information of $f(x+\delta u)$ to estimate $\nabla f(x)$?

The idea is that by "one point estimate of the gradient", we don't mean that the only information we have is $f(x)$, but rather that we want to build an estimator $\hat g$ of the gradient $g:=\nabla f(x)$ which requires only one evaluation of $f$ to be computed (because $f$ is a blackbox function which is costly to evaluate many times). The above discussion tells us that such an estimator $\hat g$ is given by $$\hat g := \frac d\delta \cdot f(x+\delta u)u $$ where $u$ is one realization of the uniform distribution on the sphere. We see that $\hat g$ satisfies our requirements as it only requires to evaluate $f$ at $(x+\delta u)$ one time and it satisfies $\mathbb E[\hat g]=g$.