1

Gradient descent generally starts with a first order Taylor approximation motivation. If we have a function $f:\mathbb{R}^p\rightarrow\mathbb{R}^p$, and we start at a point $x\in \mathbb{R}^p$, then we can look at the first order Taylor approximation \begin{align} f(x+\Delta x)\approx f(x)+\langle\nabla f(x),\Delta x \rangle_{l^2} \end{align} We want to have the update $\Delta x$ to point in the same direction as $-\nabla f(x)$ in order to minimize $\langle\nabla f(x),\Delta x \rangle_{l^2}$. However could we use a different inner product? For instance let's say we have an SPD matrix $A\in \mathbb{R}^{p\times p}$ and we use the inner product $\langle x,y\rangle_A=x^T A y$. Then we could Taylor approximate \begin{align} f(x+\Delta x)\approx f(x)+\langle \nabla f(x),\Delta x\rangle_A \end{align} We would then have gradient descent updates \begin{align} x_{n+1}=x_n-\eta A\nabla f(x) \end{align} where $\eta$ is the learning rate. Is this type of gradient descent an actual procedure? If so, what is it called? If not, what is 'wrong' with it? I'm asking this because this paper 'seems' to be doing an infinite dimensional/functional version of this procedure.

mlstudent
  • 591

1 Answers1

1

While I don't know if this 'matrix gradient descent' actually has a formal name, we can note that it has nice properties in the gradient flow case. Note that if \begin{align*} \frac{\partial x(t)}{\partial t}&=-A\nabla f(x(t)) \end{align*} then \begin{align*} \frac{\partial f(x(t))}{\partial t}&=\nabla f(x(t))^T\frac{\partial x(t)}{\partial t}\\ &=-\nabla f(x(t))^T A \nabla f(x(t)) \end{align*} where the first line applies the chain rule. So if $A$ is SPD then $\frac{\partial f(x(t))}{\partial t}=0$ iff $\nabla f(x(t))=0$ i.e. $x(t)$ is a critical point. Thus 'matrix' gradient flow is guaranteed to converge to a critical point if the matrix is SPD.

mlstudent
  • 591