2

I have been trying for the past few days to find a solution to this very simple problem, yet every time I seem to hit a wall. (PS: please excuse the poor notations, it's my first time using this website).

Let's say we have $C = AB$, $A$ and $B$ matrices so that $C$ is defined.

What is $\frac{\partial C}{\partial A}$ and $\frac{\partial C}{\partial B}$, i.e., the derivatives of $C$ w.r.t $A$ and $B$ respectively?

Let us assume size $A = m \times n$, size $B = n \times p$, size $C = m \times p$

By doing $C_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj}$ for $i = 1, \dots, m$ and $j = 1, \dots, p$

Then $\frac{\partial C_{ij}}{\partial a_{ls}} = \frac{\partial (\sum_{k=1}^{n} a_{ik} b_{kj})}{\partial a_{ls}} = b_{sj}$ for $i = l$

Then shouldn't $\frac{\partial C}{\partial A}$ be the rate of change for each $C_{ij}$ w.r.t each element of $A$?

So that $\frac{\partial C}{\partial A}$ is a matrix $D$ of size $C$ where each $D_{ij}$ is a matrix of size $A$.

For instance $C = AB$, given $A$ of size $1 \times 3$ and $B$ of size $3 \times 2$.

$A = $$\begin{pmatrix}a_1 & a_2 & a_3 \end{pmatrix}$

$B = $$\begin{pmatrix}b_{11} & b_{12} \\ b_{21} & b_{22} \\ b_{31} & b_{32}\end{pmatrix}$

$C = \begin{pmatrix} a_1 b_{11} + a_2 b_{21} + a_3 b_{31} & a_1 b_{12} + a_2 b_{22} + a_3 b_{32} \end{pmatrix}$

Then $\frac{\partial C_1}{\partial B} = \frac{\partial (a_1 b_{11} + a_2 b_{21} + a_3 b_{31})}{\partial B} = \begin{pmatrix} a_1 & 0 & a_2 & 0 & a_3 & 0 \end{pmatrix}$ and $\frac{\partial C_2}{\partial B} = \frac{\partial (a_1 b_{12} + a_2 b_{22} + a_3 b_{32})}{\partial B} = \begin{pmatrix} 0 & a_1 & 0 & a_2 & 0 & a_3 \end{pmatrix}$

So that $\frac{\partial C}{\partial B} = \begin{pmatrix} a_1 & 0 & a_2 & 0 & a_3 & 0 \\ 0 & a_1 & 0 & a_2 & 0 & a_3 \end{pmatrix}$, is of size $2 \times 6$.

$\frac{\partial C}{\partial A} = B^T$.

However, in the machine learning equations, it is given that $\frac{\partial C}{\partial B} = A^T$.

To be more precise, the context of this query is for machine learning and backpropagation.

We have $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{Y}} \frac{\partial \hat{Y}}{\partial \text{out}} \frac{\partial \text{out}}{\partial W}$, where $L$ is a scalar function (the loss), $\hat{Y}$ is a $1 \times 2$ matrix (the activated output), $\text{out}$ is a $1 \times 2$ vector (pre-activation), and $W$ is a $3 \times 2$ matrix (the weights).

If you could also explain why we actually write:

$\frac{\partial L}{\partial W} = \frac{\partial \text{out}}{\partial W} \frac{\partial \hat{Y}}{\partial \text{out}} \frac{\partial L}{\partial \hat{Y}}$, instead of the forward notation used earlier $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{Y}} \frac{\partial \hat{Y}}{\partial \text{out}} \frac{\partial \text{out}}{\partial W}$

Is there an explanation beyond "cause it fits for the matrix product so we rearrange the terms in the chain rule"?

Asking Chat-GPT these questions yields different answers every time.

Thank you for your time.

Edit : the formatting was done using Chat-GPT. I also checked the link : Not understanding derivative of a matrix-matrix product. and it seems that the result AX is a Kronecker product, which also seems to not be the right size.

The part where I usually get stuck at is computing every $D_{ij}$ for $D = \frac{\partial C}{\partial B}$.

Other EDIT : Here is more context for the chain derivation above.

$L(Y, \hat{Y}) = \frac{1}{2m} \sum_{i=1}^{m}(Y^{(i)} - \hat{Y}^{(i)})^2$ where $\hat{Y}$ is the predicted out

$ \hat{Y} = \sigma_{o}(out) $

$ out = h_{2} W $

where $\sigma_{o}$ is the activation function for the out layer,

$h_{2}$ is of size $1 \times 3$ , $W$ of size $3 \times 2$, $out$ of size $1 \times 2$ and $\hat{y}$ of size $1 \times 2$

Here the activation function $\sigma_{o}(X)$ is applied element-wise to $X$, a matrix.

  • Hello, and welcome to MSE! I know it's your first question, but this question will not be well-recieved unless you format your math using MathJax (especially since it contains a lot of math). For some basic information about writing mathematics at this site see, e.g., here, here, here and here. – Christian E. Ramirez Mar 11 '23 at 04:17
  • 1
    The formatting should be slightly better now ! – hau louis Mar 11 '23 at 04:42
  • Your dimensions are already off, as with the example you give $\frac{\partial C}{\partial B}$ should be a $m\times p \times n \times p$ rank $4$ tensor. In the example you computed, you found the derivative to be a $2\times 6$ matrix but should really be a $1\times 2 \times 3 \times 2$ object. Although usually, an index of range $1$ is ignored and $A$ (and by extension, $C$) would be treated like vectors. That still means $\frac{\partial C}{\partial B}$ should be a rank $3$, $2 \times 3 \times 2$ tensor – Ninad Munshi Mar 11 '23 at 05:41
  • 1
    RIght now, your question makes no sense since dimensions seem wrong : what is the (matrix) relation that links out, W and Y ? – Steph Mar 11 '23 at 08:05
  • @NinadMunshi Right, thank you for pointing this out. I was confused by that as well. Does that mean that $\frac{\partial C}{\partial B} = \begin{pmatrix} \begin{pmatrix}a_1 & 0 \ a_2 & 0 \ a_3 & 0\end{pmatrix} , \begin{pmatrix} 0 & a_1 \ 0 & a_2 \ 0 & a_3\end{pmatrix} \end{pmatrix}$? If so, how does that correlate with $A^T$ ? – hau louis Mar 11 '23 at 14:29
  • I edited the post to add the relations with $L, \hat{Y} , out, W$, though my main question is more about $\frac{\partial C}{\partial A}$ and $\frac{\partial C}{\partial B}$ – hau louis Mar 11 '23 at 14:53
  • ok. but that is not how backprop proceeds. What you have to do is to backprop the sensitivity from one layer to the previous one. I suggest that you start to compute the gradient of the loss function wrt $\hat{\mathbf{y}}$ and pursue until $\mathbf{W}$...Also I think it is a good habit to consider vectors to be organized columnwise.. – Steph Mar 11 '23 at 15:16
  • @Steph thank you for answering, I suggest that you start to compute the gradient of the loss function wrt $/hat{y}$ and pursue until $W$. Isn't that exactly what we are doing with the chain rule : $\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{Y}} \frac{\partial \hat{Y}}{\partial \text{out}} \frac{\partial \text{out}}{\partial W}$ we first compute $\frac{\partial L}{\partial \hat{Y}}$ and continue on until we reach $\frac{\partial \text{out}}{\partial W}$.

    Then why is the order reversed in the actual equations ?

    – hau louis Mar 11 '23 at 15:36
  • I would agree that organizing the vectors to follow standard notation, a.k.a column-vectors notation would be a better habit. However, the equations listed above do take into account the fact that the vectors are row-vectors. The results is the same, independently of whether we use column-vector or row-vector notations. – hau louis Mar 11 '23 at 15:40
  • "Asking Chat-GPT these questions yields different answers every time." Shocking that nevertheless so many people use it ! It is also surprising that Chat GPT can sometimes actually generate correct answers , probably because of pure coincidence ! – Peter Mar 24 '24 at 14:32

2 Answers2

1

Let $V≔ℝ^{m×n}⊕ℝ^{n×k}$ and $W≔ℝ^{m×k}$. By definition, the derivative of the function

$$f:V ⟶ W,(A, B) ⟼ A⋅B$$

At point $(A^*, B^*)∈U$ is the unique linear function

$$ f(A^*, B^*):V ⟶ W,(∆A, ∆B) ⟼ f(A^*, B^*)(∆A, ∆B)$$

with the property

$$ \lim_{‖(∆A, ∆B)‖_{V}→0} \frac{‖f(A^* +{∆A}, B^*+{∆B})) - f(A^*, B^*) - f(A^*, B^*)(∆A, ∆B) ‖_{W}}{‖(∆A, ∆B)‖_{V}} = 0$$

We can find $f(A^*, B^*)$ simply by dropping higher order terms:

$$\begin{aligned} f(A^* +{∆A}, B^*+{∆B})) - f(A^*, B^*) &= (A^* +{∆A})⋅(B^* +{∆B}) - A^*⋅B^* \\&= A^*⋅B^* + A^*⋅{∆B} + {∆A}⋅B^* + {∆A}⋅{∆B} - A^*⋅B^* \\&= A^*⋅{∆B} + {∆A}⋅B^* + o(‖(∆A, ∆B)‖_V) \end{aligned}$$

Hence, the derivative is the linear function

$$ f(A^*, B^*):V ⟶ W,(∆A, ∆B) ⟼ A^*⋅{∆B} + {∆A}⋅B^*$$


Now, a linear function $V→W$ can be naturally encoded as an element of the tensor product $W⊗V^*$, where $V^*$ denotes the dual space.

Thus, we can encode $f(A^*, B^*)$ as an element $_f(A^*, B^*)$ of the tensor product space $$ℝ^{m×k}⊗(ℝ^{m×n}⊕ℝ^{n×k})^* ≅ (ℝ^{m×k}⊗(ℝ^{m×n})^*)⊕(ℝ^{m×k}⊗(ℝ^{n×k})^*)$$

I.e. as a tuple of two 4-dimensional tensors. The components of these tensors are simply given by:

$$ _f(A^*, B^*) = \Big(\frac{f(A^*, B^*)_{}}{(A^*, B^*)_{}}\Big)_{, } $$

In the specific case at hand, we have

$$\begin{aligned} \frac{ (A⋅B)_{k, l}}{ A_{m, n}} &= \frac{ ∑_{i}A_{ki}⋅B_{il}}{ A_{mn}} = δ_{km}B_{nl} &&⟹ \frac{f}{} = _m ⊗ B^T \\ \frac{ (A⋅B)_{k, l}}{ B_{r, s}} &= \frac{ ∑_{i}A_{ki}⋅B_{il}}{ B_{rs}} = δ_{ls}A_{kr} &&⟹ \frac{f}{B} = A ⊗ _n \end{aligned}$$

Depending on convention/definition, also the result $\frac{f}{} = _m ⊗ B$ and $\frac{f}{B} = A^T ⊗ _n$ is possible.

Hyperplane
  • 12,204
  • 1
  • 22
  • 52
0

Let $\hat{\mathbf{y}}= \sigma(\mathbf{a})$ and $\mathbf{a} = \mathbf{h}_2 \mathbf{W}$ with your notations.

The sensitivities backprop as follows $$ \frac{\partial L}{\partial \mathbf{a}} = \frac{\partial L}{\partial \hat{\mathbf{y}}} \odot \sigma'(\mathbf{a}) $$ where $\odot$ denotes the Hadamard product.

$$ \frac{\partial L}{\partial \mathbf{W}} = \mathbf{h}_2^T \frac{\partial L}{\partial \mathbf{a}} $$

Steph
  • 4,140
  • 1
  • 5
  • 13
  • The question here is : why is $\frac{\partial a}{\partial W}$ equal to the transpose of $h2$ ? – hau louis Mar 11 '23 at 18:07
  • Note that I did not write it this way and I think it is wrong to write this since such Jacobian is a 3rd order tensor – Steph Mar 12 '23 at 09:11