The question came to me when I'm reading a paper of an optimization algorithm. It is an iterative method and in each step we need to find the gradient of $f:\mathbb{R}^d\rightarrow \mathbb{R}$ and do a gradient descent. Due to the lack of information of the function $f$, the author cited a so-called one-point estimate to the gradient:
$$\nabla f(x)\approx \mathbf{E}[(f(x + δu) − f(x))u]d/\delta$$ where $d$ is the dimension of $x$, $\delta$ is a small radius, $u$ is a uniformly random unit vector.
I tried to obtain this approximation in this way: $$f(x+\delta x)-f(x)\approx \nabla f(x)^T \delta u=\delta u^T \nabla f(x)$$ Then multiply $u$ and take the expectation on both sides: $$\mathbf{E}[(f(x + δu) − f(x))u]\approx \delta \mathbf{E}[u u^T \nabla f(x)]$$
Can we do something further? I'm not familiar with the vector or matrix expectation, but my intuition tells me it may have something to do with eigenvalues(maybe it's completely irrelevant).
I need help to understand the derivation of this formula.
Furthermore, I would appreciate very much if somebody could explain how to use this formula in practice. The author in one-point estimate to the gradient was considering the case $f$ is a black-box function, i.e., we have no information of $f$ except the value of $f$ at $x$. Then I wonder if we only know the value of $f$ at $x$, how can we use the information of $f(x+\delta u)$ to estimate $\nabla f(x)$?
Thank you very much!