3

Let $f: \mathbb{R}^{m\times n} \rightarrow \mathbb{R}^m$ where $f(W)=Wx$.

Question1: How is the Jacobian matrix defined for a vector-valued function whose variable is a matrix?

Question2: Using the answer to the above, how one can generalized the Jacobian formula of a composition functions whose variables are matrices? By generalization I mean how the Jacobian of a function like $g(x)=h(u(x))$ where $x \in \mathbb{R}^n$, $u: \mathbb{R}^n \rightarrow \mathbb{R}^l$, and $h: \mathbb{R}^l \rightarrow \mathbb{R}^m$ as follows:

$$ J_x(g) = J_u(h) J_x(u) $$ and $J_x(g)$ is the Jacobian of $g$ with respect to $x$ can be handle when $x$ becomes a matrix.

The above simply says the Jacobian of a composition function is the product of the Jacobians. How can one mimic this when the variable is a matrix?

My thoughts: For the given function $f(W)=Wx$ one can write the following:

$$ f(W)=Wx = \begin{bmatrix} W_{1\bullet}x\\ \vdots\\ W_{m\bullet}x \end{bmatrix} $$ where $W_{m\bullet}$ is the m-th row of $W$. Now the Jacobian could be either $\begin{bmatrix} x^T\\ \vdots\\ x^T \end{bmatrix} \in \mathbb{R}^{m \times n} $ or $\begin{bmatrix} x& \dots& x \end{bmatrix} \in \mathbb{R}^{n \times m}. $

  • the basic theory and definitions for differential calculus on normed spaces is the same in any number of dimensions; see this answer of mine. In particular, the concepts of derivatives as a linear map, the standard sum rule, chain rule, etc all carry over virtually unchanged. The thing is if you want to interpret the derivative (which by definition is a linear transformation) in terms of a matrix, then you need to choose ordered bases for the domain and target. This is unnecessary (and confusing) for many questions with matrices as domain. – peek-a-boo Jan 31 '21 at 08:16
  • However, when your domain os itself a space of matrices, there is no "standard basis", and ecen if you pick a basis, there's often room to reorder the basis, and different sources use different orderings which can make certain formulas look different. So once again, the fundamental concept is that of "derivatives as linear approximations", not the "Jacobian matrix", so unless for some reason you really have to work with yhe matrix representation of the derivative, I would avoid it. – peek-a-boo Jan 31 '21 at 08:20
  • @peek-a-boo: I am fine with the definition of the derivative of normed spaces. By "Is is defined" I meant that if we can handle the case where matrices are involved. My question is a problem where matrix has to be involved. –  Jan 31 '21 at 08:21
  • then like I said, introduce a basis for the space of matrices(your domain) and the usual ordered basis for the target space, and calculate the matrix representation of the derivative relative to this basis. The result will be an $m \times(mn)$ matrix. and if you want the hessian then... good luck expressing that as a matrix. But anyway the $f$ in your first line is a linear transformation so it ks its own derivative everywhere. – peek-a-boo Jan 31 '21 at 08:24
  • @peek-a-boo: My initial goal is to calculate the hessian of $y=f_2(Vf_1(Wx+b)+c)$ where $V$ and $W$ are matrices and $b$ and $c$ are vectors. I want to do it compactly using the concept of the Jacobian. The given composition may go deeper. –  Jan 31 '21 at 08:25
  • @ peek-a-boo: $f_1$ is a vector-valued function the number of rows is the same as $W$ and $f_2$ is also a vector-valued function whose number rows is the same as V. What basis or strategy you would suggest? –  Jan 31 '21 at 08:29
  • If $b$ and $c$ are vectors, then surely $Wx$ and $Vf_1$ are also vectors. Where is the need to use matrices as input? – syockit Jan 31 '21 at 08:41
  • @ syockit: Because I was thinking of this problem as it has $\theta_1 = [W ,, b]$ and $\theta_2 = [V,, c]$ and wanted to have the Jacobian of $f_1$ in terms of $\theta1$ and $f_2$ in terms of $\theta_2$. –  Jan 31 '21 at 09:10

1 Answers1

2

As peek-a-boo commented, the concept doesn't change even whether the input is a matrix or a vector. If the notation is confusing, then I suggest having a kind of a mapping to "flatten" the matrix into a vector.

For $f: \mathbb{R}^n \to \mathbb{R}^m$, its Jacobian is $J(f)\in\mathbb{R}^{m\times n}$. So if $f: \mathbb{R}^{m\times n} \to \mathbb{R}^m$, it should follow that that $J(f) \in \mathbb{R}^{m\times m\times n}$. What was missing from your attempted solution? Well, you only derived against $W_i$ for each row $i$, while for a Jacobian you need to derive against all inputs, e.g. $W_1 x$ needs to be derived against $W_2, W_3,\dots$ as well.

For your function $f(W)=Wx$, the Jacobian is $$ J_W(f) = \frac{\partial}{\partial W}f(W)=\frac{\partial}{\partial W}Wx = I_m\otimes x $$ where $I_m$ is the identity matrix of size $m\times m$ and $\otimes$ is the Kronecker product. If you have to express this in terms of matrix notation, then I guess it will look like this

$$ J_W(f) =\left[\begin{matrix} \left[\begin{matrix}\frac{\partial}{\partial W_1}W_1\cdot x \\ \frac{\partial}{\partial W_2}W_1\cdot x \\ \dots\end{matrix}\right]\\ \left[\begin{matrix}\frac{\partial}{\partial W_1}W_2\cdot x \\ \frac{\partial}{\partial W_2}W_2\cdot x \\ \dots\end{matrix}\right] \\ \dots \end{matrix}\right] =\left[\begin{matrix} \left[\begin{matrix}x^T \\ \mathbf{0}^T \\ \dots\end{matrix}\right]\\ \left[\begin{matrix}\mathbf{0}^T \\ x^T \\ \dots\end{matrix}\right] \\ \dots \end{matrix}\right] $$ This is going to be very cumbersome to write, so it's probably easier if you express it in terms of indices instead. $$ [J_W(f)]_{ijk} = \begin{cases}x_k & (i=j)\\0 & (i\neq j)\end{cases} $$ Or perhaps with Kronecker delta $$ [J_W(f)]_{ijk} = \delta_{ij}x_k $$

Regarding your second question $g = h\circ u$, you got it close, except that the second term should $x$ as its subscript. $$ J_x(g) = J_u(h) J_x(u) $$ Just as we earlier had $J(f)\in\mathbb{R}^{m\times n}$ for $f:\mathbb{R}^n \to \mathbb{R}^m$, the dimensions for $u$ and $h$ can also be similarly be determined. Let's say $u$ is a mapping from a matrix to a matrix e.g. $g:\mathbb{R}^{m\times n\to k\times l}$. Then $J(u)\in\mathbb{R}^{k\times l\times m\times n}$, and its components written as $$ [J_X(u)]_{abcd} = \frac{\partial}{\partial X_{ab}}u_{cd}(X) $$ When writing $J(h)$ you probably would want to use a placeholder matrix $Y$ to represent the input of $h$ (which again, is assumed to be matrix to matrix mapping, but feel free to change it), so that you have $$ [J_Y(h)]_{cdpq} = \frac{\partial}{\partial Y_{cd}}h_{pq}(Y)\big|_{Y=u(X)} $$ so that we finally have $$ \begin{aligned}{} [J_X(g)]_{abpq} &= [J_Y(h)]_{cdpq}[J_X(u)]_{abcd} \\ &= \frac{\partial}{\partial Y_{cd}}h_{pq}(Y)\big|_{Y=u(X)}\frac{\partial}{\partial X_{ab}}u_{cd}(X) \end{aligned} $$

syockit
  • 899