2

There is the following puzzle that stems from Neural networks:

I have a matrix $\mathbf{Y} = \mathbf{X}\mathbf{W}^{T} + \mathbf{B}$ where $\mathbf{Y} \in \mathbb{R}^{S \times N}$, $\mathbf{B} \in \mathbb{R}^{S \times N}$, $\mathbf{X} \in \mathbb{R}^{S \times M}$ and $\mathbf{W} \in \mathbb{R}^{N \times M}$,

Finally, since the whole linear operator (usually passes through an activation layer that works element-wise) $L = h(\mathbf{Y})$

I would like to calculate the $\frac{\partial L}{\partial \mathbf{W}}$. We can first calculate the result using "index notation":

$$ \left[\frac{\partial L}{\partial \mathbf{W}} \right]_{ij} = \frac{\partial L}{\partial {W}_{ij}} = \sum_{m}^{S} \sum_{n}^{N} \frac{\partial L}{\partial Y_{mn}} \frac{\partial Y_{mn}}{\partial W_{ij}} = \sum_{m}^{S} \sum_{n}^{N} \frac{\partial L}{\partial Y_{mn}} \frac{\partial(\mathbf{X}\mathbf{W}^{T} + \mathbf{B})}{\partial W_{ij}} \ \ \ \ [1]$$.

Note that we can write the expression $\mathbf{X}\mathbf{W}^{T} = \sum_p X_{mp}W_{np}$ and we can ignore $\mathbf{B}$ since it will die later during the differentiation.

So focusing on the second fraction and using Kronecker deltas we do have:

$$\frac{\partial(\mathbf{X}\mathbf{W}^{T})_{mn}}{\partial W_{ij}} = \frac{\partial(\sum_p X_{mp}W_{np})}{\partial W_{ij}} = \sum_{p} X_{mp}\delta_{in}\delta_{pj} = X_{mj}\delta_{in}$$

Now, back to Equation (1):

$$ \left[\frac{\partial L}{\partial \mathbf{W}} \right]_{ij} = \sum_{m}^{S} \sum_{n}^{N} \frac{\partial L}{\partial Y_{mn}} \frac{\partial(\mathbf{X}\mathbf{W}^{T} + \mathbf{B})}{\partial W_{ij}} = \sum_{m}^{S} \sum_{n}^{N} \frac{\partial L}{\partial Y_{mn}} X_{mj}\delta_{in} = \sum_{m}^{S} \frac{\partial L}{\partial Y_{mi}} X_{mj} = \left[(\frac{\partial L}{\partial \mathbf{Y}})^{T} \mathbf{X} \right]_{ij}$$

Back to matrix notation, we do have:

$$\frac{\partial L}{\partial \mathbf{W}} = (\frac{\partial L}{\partial \mathbf{Y}})^{T} \mathbf{X}$$

Firstly, is my logic correct? My main question is how can I motivate the conclusion from index to matrix notation?

Secondly is there a difference between: $$(\frac{\partial L}{\partial \mathbf{Y}})^{T} \mathbf{X}$$ and $$\mathbf{X}^{T}(\frac{\partial L}{\partial \mathbf{Y}})$$

Finally, how exactly can I motivate that $ \frac{\partial \mathbf{Y}}{\partial \mathbf{W}}$, did not lead to a 4d tensor and was just a matrix at the end?

1 Answers1

1

$ \def\qiq{\quad\implies\quad} \def\L{{\cal L}} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\frob#1{\left\| #1 \right\|_F} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} $The formula which you derived is correct. Here is another way to calculate it...

Assume that the gradient of a scalar cost function $\L$ with respect to the matrix $Y$ is known.
Expand the differential of $\L\,$ and change the independent variable from $\,Y\to W$ $$\eqalign{ Y &= XW^T + B \qiq &dY = X\:dW^T \\ G &= \grad{\L}{Y} &\{{\rm known\ gradient}\} \\ d\L &= G:dY &\{{\rm differential}\} \\ &= G:\LR{X\,dW^T} \\ &= G^T:\LR{dW\,X^T} \\ &= \LR{G^TX}:dW \\ \grad{\L}{W} &= {G^TX} &\{{\rm new\ gradient}\} \\ }$$ where $(:)$ denotes the Frobenius product, which is a concise notation for the trace $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \frob{A}^2 \qquad \{ {\rm Frobenius\;norm} \}\\ A:B &= B:A \;=\; B^T:A^T \\ \LR{AB}:C &= A:\LR{CB^T} \;=\; B:\LR{A^TC} \\ }$$ The advantage of using differentials is to avoid the need for any awkward fourth-order tensors.

Update

I misread the question. Here is the derivation for the tensor-valued gradient.

Define the following tensors $$\eqalign{ \def\d{\delta} \def\o{{\tt1}} \def\H{{\large\cal H}} \def\F{{\cal F}} &\F_{ijkl} = \d_{il}\,\d_{jk} \\ &\H_{ij\,kl\,mn} = \begin{cases} \o \qquad {\rm if}\;i=k=m\;\;{\rm and}\;\;j=l=n \\ 0 \qquad {\rm otherwise} \end{cases} }$$ and the matrix variables $$\eqalign{ L = h(Y), \qquad L' = h'(Y) \\ }$$ where $h'$ is the ordinary (scalar) derivative of the $h$ function and is applied elementwise.

Expand the differential of the matrix-valued function and change $Y\to W$ once again $$\eqalign{ dL &= L'\odot dY \\ &= L':\H:dY \qquad\qiq \grad LY = L':\H \\ &= L':\H:\LR{X\,dW^T} \\ &= L':\H:\LR{X\cdot\F}:dW \\ \grad LW &= L':\H:\LR{X\cdot\F} \\ \\ \grad{L_{kl}}{W_{pq}} &= L'_{ij}\:\H_{ijklmn}\:{X_{ms}\F_{snpq}} \\ &= h'(Y_{ij})\;\H_{ijklmn}\:{X_{mq}\,\d_{np}} \\ }$$ In the above, $(\cdot)$ is the single-contraction product, $(:)$ is the double-contraction product, $(\odot)$ is the elementwise/Hadamard product, and the index expression employs the Einstein summation convention.

$\sf NB\!:\:$ The Hadamard tensor $\H$ is a sixth-order tensor defined such that $$ A\odot B = A:\H:B \qquad \qquad \qquad \qquad \quad $$ $\qquad$ for any two matrices $\{A,B\}$ which have identical dimensions.

greg
  • 40,033
  • Are the two derivations a the end the same thing? $G^{T}X$ with $X^{T}G$? To be honest I have lost you in the solution since I am not that familiar with the Frobenius form. – Jose Ramon Dec 06 '23 at 13:15
  • Not sure how exactly I am avoiding the awkward 4d tensor in that case! – Jose Ramon Dec 06 '23 at 13:16
  • Depending on your preferred layout convention it can be written as either $,G^TX,$ or $,X^TG.;$ – greg Dec 06 '23 at 14:39
  • 1
    $\large\frac{\partial Y}{\partial W}$ is a fourth-order tensor which is required by the chain rule, does not appear in the differential method. – greg Dec 06 '23 at 14:42
  • Ok that makes sense for the layout. But for the tensor thingy, I am curious how exactly after the application of the chain rule the 4d tensor is neutralized? The solution implies that the result is a matrix and not a tensor.

    Is there a way to explain that result using index notation? Is it that I have indeed a 4d tensor as a solution that parts of it is filled with zeros (the third and fourth rank of it)?

    – Jose Ramon Dec 06 '23 at 14:50
  • Oops! My derivation is for a scalar-valued cost function $\cal L$ rather than a matrix-valued activation function $L=h(Y)$. I suspect that you are trying to calculate the tensor $\large\frac{\partial L}{\partial W}$ in a misguided attempt to apply the chain rule in a larger context, and you can avoid the need for this tensor by using differentials. – greg Dec 06 '23 at 15:01
  • Can you elaborate a bit more on this? I know ofc that $$\frac{\partial L}{\partial W}$$ should be a matrix and not a 4d matric since $L$ is a scalar. But how exactly I can motivate that the 4d is neutralized during the chain rule? – Jose Ramon Dec 06 '23 at 15:23
  • I've updated the post with the tensor result. As you can see, it's not very pretty. How will you use it? I suspect that you will simply plug it into the chain rule to evaluate the gradient of a scalar cost function. – greg Dec 06 '23 at 16:08