Why is the numerator-layout Jacobian transposed in backpropagation calculation?

Question

In the derivation of the backpropagation algorithm in Neural Network Design by Hagan et al., we consider the derivative of the scalar-valued sample loss function $\hat{F}$ with respect to a vector of "sensitivities" $\mathbf{n}^{m}$ in layer $m$ of a fully connected neural network. We find a recurrence that allows us to express $\partial \hat{F} / \partial \mathbf{n}^{m}$ in terms of $\partial \hat{F} / \partial \mathbf{n}^{m+1}$ (the sensitivities of layer $m+1$).

The authors mention an application of "the chain rule in matrix form" to obtain the following:$${\partial \hat{F} \over \partial \mathbf{n}^m} = \left({\partial \mathbf{n}^{m+1} \over \partial \mathbf{n}^{m}}\right)^{T} {\partial \hat{F} \over \partial \mathbf{n}^{m+1}}$$

Earlier, the authors present the Jacobian $\partial \mathbf{n}^{m+1} / \partial \mathbf{n}^m$ using the "numerator layout":$$J = {\partial \mathbf{n}^{m+1} \over \partial \mathbf{n}^m} = \left[\begin{matrix}{\partial \mathbf{n}_1^{m+1} \over \partial \mathbf{n}_1^m} & \cdots & {\partial \mathbf{n}_1^{m+1} \over \partial \mathbf{n}_{S^m}^m} \\ \vdots & \ddots & \vdots \\ {\partial \mathbf{n}_{S^{m+1}}^{m+1} \over \partial \mathbf{n}_1^m} & \cdots & {\partial \mathbf{n}_{S^{m+1}}^{m+1} \over \partial \mathbf{n}_{S^m}^m} \end{matrix}\right]$$Here, $S^i$ is the dimension of the ith layer.

Given this Jacobian, why does a transpose and a left-multiplication appear in the recurrence expression for $\partial \hat{F} / \partial \mathbf{n}^{m}$? Why is the expression not equivalent to $D_{\mathbf{n}^{m+1}} \hat{F} \cdot J$?

https://math.stackexchange.com/questions/4902028/in-linear-algebra-why-do-we-freely-transpose-some-matrices/4902035#4902035 A justification/example I made about why a transpose had to appear in a calculation — random0620, May 23 '24 at 06:05
Generally with this kind of thing if you write out the definition of the derivative you'll see why it has to appear — random0620, May 23 '24 at 06:06
As in, you're recommending that I re-derive the general multivariable chain rule via $\varepsilon$-$\delta$ arguments? — aas, May 23 '24 at 06:18

Ted Black · Answer 1 · 2024-05-23T15:03:01.747

Like the notation, $\def\qty#1{\left( #1 \right)}$ terminology is also a bit chaotic in machine learning. I assume that "sensitivities" refers to the partial derivatives of the loss function w.r.t. to the network parameters. For example, assume we have calculated $\frac{\partial {\cal L}}{\partial Z_i^L}$ which is the sensitivity of the loss function ${\cal L}$ w.r.t. the $i$th output of the $L$th layer, $Z_i^L$. We know that, $$ Z_i^L=\sigma\qty{\sum_j W_{ij}^LZ_j^{L-1}+b_i^{L}} =\sigma\qty{Y_i^L} $$ and so, $$ dZ_i^L=\sigma'\qty{Y_i^L} \sum_j W_{ij}^L dZ_j^{L-1} $$ We already have calculated that $$ d {\cal L}=\sum_i\frac{\partial {\cal L}}{\partial Z_i^L} dZ_i^L $$ which is a standard application of the chain rule. Therefore, \begin{align} d {\cal L} & =\sum_i\frac{\partial {\cal L}}{\partial Z_i^L} \sigma'\qty{Y_i^L}\sum_j W_{ij}^L dZ_j^{L-1} \\ &= \sum_j \qty{\sum_i\frac{\partial {\cal L}}{\partial Z_i^L} \sigma'\qty{Y_i^L}W_{ij}^L} dZ_j^{L-1} \end{align} Again, applying the chain rule, we have \begin{equation} \frac{\partial {\cal L}}{\partial Z_j^{L-1}} = \sum_i\frac{\partial {\cal L}}{\partial Z_i^L} \sigma'\qty{Y_i^L}W_{ij}^L \tag{1} \end{equation} $\frac{\partial {\cal L}}{\partial Z_j^{L-1}}$, $\frac{\partial {\cal L}}{\partial Z_i^L}$ are vectors and $A_{ij}=\sigma'\qty{Y_i^L}W_{ij}^L$ is a matrix. The product ${\bf y}={\bf A}{\bf x}$ can be written as, $$ y_j = \sum_i A_{ji}x_i $$ But note that in $(1)$ we have, $$ \frac{\partial {\cal L}}{\partial Z_j^{L-1}} = \sum_i A_{ij} \frac{\partial {\cal L}}{\partial Z_i^L} $$ This works as the product of a matrix with a vector if we write, $$ \frac{\partial {\cal L}}{\partial Z_j^{L-1}} = \sum_i A_{ji}^T \frac{\partial {\cal L}}{\partial Z_i^L} $$ Since, \begin{align} dZ_i^L & =\sigma'\qty{Y_i^L} \sum_j W_{ij}^L dZ_j^{L-1} \\ & = \sum_j \frac{\partial Z_i^L}{\partial Z_j^{L-1}} dZ_j^{L-1} \end{align} it follows that, $$ \frac{\partial Z_i^L}{\partial Z_j^{L-1}} = \sigma'\qty{Y_i^L} W_{ij}^L $$ and so, $$ \frac{\partial {\cal L}}{\partial Z_j^{L-1}} = \sum_i \qty{\frac{\partial Z_i^L}{\partial Z_j^{L-1}}}^T \frac{\partial {\cal L}}{\partial Z_i^L} $$ or using matrix notation, $$ \frac{\partial {\cal L}}{\partial {\bf Z}^{L-1}} = \qty{\frac{\partial {\bf Z}^L}{\partial {\bf Z}^{L-1}}}^T \frac{\partial {\cal L}}{\partial {\bf Z}^L} $$

Thanks for the detailed derivation! I'm trying to quickly adapt your notation to the slightly different quantities I'm computing. That said, I am more comfortable with vector quantities than detailed index manipulation. Is it fair to summarize this by saying that, given that the chain rule for total derivatives gives $d(f \circ g)a = df{g(a)} \cdot dg_a$, the gradient, being the transpose, would give $\nabla (f \circ g)(x) = J_{g(x)}^{T} \nabla f(x)$? — aas, May 23 '24 at 19:01
These particular authors express the recurrence in terms of $\mathbf{n}^L = W^{L} \cdot f^{L-1}(\mathbf{n}^{L-1}) + \mathbf{b}^{L}$. It's the same recurrence with some terms rearranged. — aas, May 23 '24 at 22:28
Ultimately, I think this answer addresses my confusion exactly. — aas, May 23 '24 at 22:31
The "sensitivities" in this case are the terms $\partial \mathcal{L} / \partial \mathbf{n}_i^{L}$, which I mistyped in the body of the question. — aas, May 23 '24 at 22:35
This does not make sense the partial derivatives are w.r.t. ${\bf z}^L=f^L({\bf n}^L)$ not ${\bf n}^L$. $\frac{\partial \hat{F}}{\partial {\bf z}^L}$ is also required to calculate $\frac{\partial \hat{F}}{\partial {\bf W}^L}$ and $\frac{\partial \hat{F}}{\partial {\bf b}^L}$. — Ted Black, May 23 '24 at 22:41

Why is the numerator-layout Jacobian transposed in backpropagation calculation?

1 Answers1