There is the following puzzle that stems from Neural networks:
I have a matrix $\mathbf{Y} = \mathbf{X}\mathbf{W}^{T} + \mathbf{B}$ where $\mathbf{Y} \in \mathbb{R}^{S \times N}$, $\mathbf{B} \in \mathbb{R}^{S \times N}$, $\mathbf{X} \in \mathbb{R}^{S \times M}$ and $\mathbf{W} \in \mathbb{R}^{N \times M}$,
Finally, since the whole linear operator (usually passes through an activation layer that works element-wise) $L = h(\mathbf{Y})$
I would like to calculate the $\frac{\partial L}{\partial \mathbf{W}}$. We can first calculate the result using "index notation":
$$ \left[\frac{\partial L}{\partial \mathbf{W}} \right]_{ij} = \frac{\partial L}{\partial {W}_{ij}} = \sum_{m}^{S} \sum_{n}^{N} \frac{\partial L}{\partial Y_{mn}} \frac{\partial Y_{mn}}{\partial W_{ij}} = \sum_{m}^{S} \sum_{n}^{N} \frac{\partial L}{\partial Y_{mn}} \frac{\partial(\mathbf{X}\mathbf{W}^{T} + \mathbf{B})}{\partial W_{ij}} \ \ \ \ [1]$$.
Note that we can write the expression $\mathbf{X}\mathbf{W}^{T} = \sum_p X_{mp}W_{np}$ and we can ignore $\mathbf{B}$ since it will die later during the differentiation.
So focusing on the second fraction and using Kronecker deltas we do have:
$$\frac{\partial(\mathbf{X}\mathbf{W}^{T})_{mn}}{\partial W_{ij}} = \frac{\partial(\sum_p X_{mp}W_{np})}{\partial W_{ij}} = \sum_{p} X_{mp}\delta_{in}\delta_{pj} = X_{mj}\delta_{in}$$
Now, back to Equation (1):
$$ \left[\frac{\partial L}{\partial \mathbf{W}} \right]_{ij} = \sum_{m}^{S} \sum_{n}^{N} \frac{\partial L}{\partial Y_{mn}} \frac{\partial(\mathbf{X}\mathbf{W}^{T} + \mathbf{B})}{\partial W_{ij}} = \sum_{m}^{S} \sum_{n}^{N} \frac{\partial L}{\partial Y_{mn}} X_{mj}\delta_{in} = \sum_{m}^{S} \frac{\partial L}{\partial Y_{mi}} X_{mj} = \left[(\frac{\partial L}{\partial \mathbf{Y}})^{T} \mathbf{X} \right]_{ij}$$
Back to matrix notation, we do have:
$$\frac{\partial L}{\partial \mathbf{W}} = (\frac{\partial L}{\partial \mathbf{Y}})^{T} \mathbf{X}$$
Firstly, is my logic correct? My main question is how can I motivate the conclusion from index to matrix notation?
Secondly is there a difference between: $$(\frac{\partial L}{\partial \mathbf{Y}})^{T} \mathbf{X}$$ and $$\mathbf{X}^{T}(\frac{\partial L}{\partial \mathbf{Y}})$$
Finally, how exactly can I motivate that $ \frac{\partial \mathbf{Y}}{\partial \mathbf{W}}$, did not lead to a 4d tensor and was just a matrix at the end?
Is there a way to explain that result using index notation? Is it that I have indeed a 4d tensor as a solution that parts of it is filled with zeros (the third and fourth rank of it)?
– Jose Ramon Dec 06 '23 at 14:50