1

Suppose $f$ is a real-valued function of the matrix $X \in \mathbb{R}^{m \times n}$. We typically define the gradient of $f$, written $\nabla f(X)$, as the $m \times n$ matrix whose $(i,j)$th entry is the corresponding partial derivative of $f$, so that $$(\nabla f(X))_{ij} = \frac{\partial f}{\partial x_{ij}}(X).$$

So, how do you define the Hessian of $f$? Will this be some higher-dimensional tensor?

If I "flatten" $X$ into a vector (or more formally, fix a basis of $\mathbb{R}^{m \times n}$ and express $X$ using coordinates in that basis), then the gradient of $f$ becomes a vector again (written in the coordinates for that basis.) Then, I think we could just write the Hessian as a matrix, as usual for maps $f:\mathbb{R}^n \to \mathbb{R}$. Is this the correct approach?

  • Yes, that’s fine, but this is all very arbitrary, so it’s a literal nightmare when trying to compare different sources. See Taylor Series coordinate free form and Differentiation definition for spaces other than $\Bbb{R}^n$ for the relevant definitions and theorem. – peek-a-boo Jun 08 '23 at 22:05
  • @peek-a-boo Thank you for these. Would you be able to recommend some books to learn these theorems from? As an aside, would you recommend learning tensor calculus to clear up some of this stuff? – TheProofIsTrivium Jun 08 '23 at 22:20
  • tensor calculus usually means tensor analysis on manifolds, so no you don’t need that (not for this anyway, but differential geometry is a cool subject, which comes after this). What you need is sufficiently decent/general linear algebra text, and multivariable calculus text. Loomis and Sternberg was one of my favorites when learning this material (chapters 1,2 cover more than enough linear algebra, chapter 3 covers a lot of differential calculus). Henri Cartan’s differential calculus, as the name suggests, covers differential calculus, though this may be slightly harder to digest. – peek-a-boo Jun 08 '23 at 22:23
  • Awesome, thank you very much – TheProofIsTrivium Jun 08 '23 at 22:34

1 Answers1

1

This doesn't address the question directly but is more like a long comment.

The main problem with the full Hessian is that it can contain a very large amount of coefficients if $m\times n$ is large itself.

Often, especially for large optimization problems, you might prefer to use the directional Hessian, which can be more tractable for large matrices and that is supported by algorithms (including scipy.optimize.minimize for example). If that's what you're looking for, then the directional Hessian (in direction $P\in\mathbb R^{m\times n}$) is defined as: $$\nabla^2_P(f)=\nabla(P:\nabla f)$$

By sampling for a few $P$, a well written algorithm can then get a good idea of the full Hessian without having to compute it entirely.

PC1
  • 2,236
  • 1
  • 9
  • 25
  • Thank you. Is this related to Hessian-vector products? – TheProofIsTrivium Jun 08 '23 at 22:59
  • I fixed a mistake in the equation, the product is the Frobenius inner product: $A:B=\mathrm{Tr}(A^TB)$. If you have a way to express $\nabla f$ then often you can also express $\nabla^2_P$. – PC1 Jun 09 '23 at 00:04