Gradient descent generally starts with a first order Taylor approximation motivation. If we have a function $f:\mathbb{R}^p\rightarrow\mathbb{R}^p$, and we start at a point $x\in \mathbb{R}^p$, then we can look at the first order Taylor approximation \begin{align} f(x+\Delta x)\approx f(x)+\langle\nabla f(x),\Delta x \rangle_{l^2} \end{align} We want to have the update $\Delta x$ to point in the same direction as $-\nabla f(x)$ in order to minimize $\langle\nabla f(x),\Delta x \rangle_{l^2}$. However could we use a different inner product? For instance let's say we have an SPD matrix $A\in \mathbb{R}^{p\times p}$ and we use the inner product $\langle x,y\rangle_A=x^T A y$. Then we could Taylor approximate \begin{align} f(x+\Delta x)\approx f(x)+\langle \nabla f(x),\Delta x\rangle_A \end{align} We would then have gradient descent updates \begin{align} x_{n+1}=x_n-\eta A\nabla f(x) \end{align} where $\eta$ is the learning rate. Is this type of gradient descent an actual procedure? If so, what is it called? If not, what is 'wrong' with it? I'm asking this because this paper 'seems' to be doing an infinite dimensional/functional version of this procedure.
Asked
Active
Viewed 104 times
1
Rodrigo de Azevedo
- 23,223
mlstudent
- 591
-
1If $A$ is SPD you can write it as $A=P^T P$ for some matrix $P$, which is now essentially the same problem within a different coordinate system. – obareey Mar 28 '22 at 11:00
-
Please see my answer to this thread. – José C Ferreira Mar 28 '22 at 12:27
-
I just quickly looked at the article mentioned, but I think they use some version to the representer theorem. – José C Ferreira Mar 28 '22 at 12:31
-
Rodrigo de Azevedo I think it doesn't matter since $A=A^T$. obaarey can you elaborate on this? Not super familiar with solving same problem within different coordinate system. – mlstudent Mar 28 '22 at 14:37
-
Someone pointed out that when you let $A$ be the inverse of the Hessian you actually get a special case of Newton's method for finding roots of the gradient, which is interesting. – mlstudent Mar 30 '22 at 19:13
1 Answers
1
While I don't know if this 'matrix gradient descent' actually has a formal name, we can note that it has nice properties in the gradient flow case. Note that if \begin{align*} \frac{\partial x(t)}{\partial t}&=-A\nabla f(x(t)) \end{align*} then \begin{align*} \frac{\partial f(x(t))}{\partial t}&=\nabla f(x(t))^T\frac{\partial x(t)}{\partial t}\\ &=-\nabla f(x(t))^T A \nabla f(x(t)) \end{align*} where the first line applies the chain rule. So if $A$ is SPD then $\frac{\partial f(x(t))}{\partial t}=0$ iff $\nabla f(x(t))=0$ i.e. $x(t)$ is a critical point. Thus 'matrix' gradient flow is guaranteed to converge to a critical point if the matrix is SPD.
mlstudent
- 591