Question about vector-matirix derivative in back propagation

Question

Let's say I have a matrix like below:

$$ W = \begin{bmatrix} w_{1,1} & w_{1,2} \\ w_{2,1} & w_{2,2} \end{bmatrix} $$ $$ \vec{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} $$ $$ \vec{y} = W\vec{x} = \begin{bmatrix} w_{1,1}x_1 + w_{1,2}x_2 \\ w_{2,1}x_1 + w_{2,2}x_2 \end{bmatrix} $$

In back-propagation, it is needed to calculate the $\partial {\vec{y}} \over \partial {W}$ to update $W$.

But, according to Wiki, there is no consensus about the definition of a derivative of a vector by a matrix.

Then, how can I get the value of $\partial {\vec{y}} \over \partial {W}$?

As greg says on your original post, it's best to go through the calculation for the back propagation derivative without using this derivative directly. Perhaps if you include the full form of the (scalar-valued or vector-valued) function that you wanted to differentiate, somebody could show you how to proceed without getting mired in order $3$ tensors — Ben Grossmann, Aug 16 '20 at 09:05
Can you point where did you saw that is needed to calculate this derivative? Usually the back-propagation is interested in minimize the loss function, which is a norm (scalar value). The derivative is with respect to this norm function. — Carlos, Aug 16 '20 at 15:43

Ben Grossmann · Accepted Answer · 2020-08-16T09:45:30.077

Whatever your notion of $\frac{\partial y}{\partial W}$, part of the data carried by this object is the set of all partial derivatives $\frac{\partial y}{\partial W_{ij}}$, and these derivatives should form all "entries" of $\frac{\partial y}{\partial W}$. In this wiki page, the author(s) use only these partial derivatives and do not make any reference to a "total" derivative $\frac{\partial y}{\partial W}$.

Let $e_1,e_2$ denote the canonical basis of $\Bbb R^2$, i.e. the columns of the $2 \times 2$ identity matrix. We can see that these partial derivatives are given by $$ \frac{\partial y}{\partial W_{ij}} = x_j e_i. $$ To put things in terms of scalar entries, we would say that $ \frac{\partial y_k}{\partial W_{ij}} = \delta_{ik} x_j, $ where $y_k$ denotes the $k$th entry of $y$ and $\delta_{ik}$ denotes a "Kronecker delta".

Now, in terms of the total/Frechet derivative, we could say the following. $y(W)$ defines a function from $\Bbb R^{2 \times 2}$ to $\Bbb R^2$, so for any $W \in \Bbb R^{2 \times 2}$, $D_Wy(X) = Dy(X)$ defines a linear map from $\Bbb R^{2 \times 2}$ to $\Bbb R^2$; specifically, for any $H \in \Bbb R^{2 \times 2}$, we have $$ Dy(X)(H) = y(H) = Hx. $$ Although it is not an array of entries, this function $Dy$ is the operator that the array/tensor $\frac{\partial y}{\partial W}$ would represent. We can recover the partial derivatives by evaluating the "directional derivatives" $d_Wy(X)(E_{ij})$, where $E_{ij} = e_ie_j^T$ is the matrix with a $1$ in the $i,j$ entry and zeros elsewhere. Indeed, we have $$ Dy(X)(E_{ij}) = E_{ij} x = e_i (e_j^Tx) = x_j e_i. $$ The chain rule tells us the following: for any function $g:\mathcal Z \to \Bbb R^{2 \times 2}$, we may compute the total derivative of $y \circ g$ as follows. For any $z \in \mathcal Z$, the derivative (a linear map from $\mathcal Z$ to $\Bbb R^{2}$) is given by $$ D(y \circ g)(z) = Dy(g(z)) \circ Dg(z), $$ where $Dy(g(z))$ is a linear map from $\Bbb R^{2 \times 2} \to \Bbb R^2$ and $Dg(z)$ is a linear map from $\mathcal Z$ to $\Bbb R^{2 \times 2}$. More concretely, if $h \in \mathcal Z$, then the directional derivative "along" $h$ should be given by $$ D(y \circ g)(z)(h) = [Dy(g(z)) \circ Dg(z)](h) = [Dg(z)(h)] x. $$ Similarly, for any function $p: \Bbb R^2 \to \mathcal Z$, we may compute the total derivative of $p \circ y$ as follows. For any $X \in \Bbb R^{2 \times 2}$, the derivative (a linear map from $\Bbb R^{2 \times 2}$ to $\mathcal Z$) is given by $$ D(p \circ y)(X) = Dh(y(X)) \circ Dy(X), $$ where $Dh(y(X))$ is a linear map from $\Bbb R^2$ to $\mathcal Z$ and $Dy(X)$ is a linear map from $\Bbb R^{2 \times 2}$ to $\Bbb R^2$. More concretely, if $H \in \Bbb R^{2 \times 2}$, then the directional derivative "along" $H$ should be given by $$ D(p \circ y)(X)(H) = [Dp(y(X)) \circ Dy(X)](H) = Dp(y(X))(Hx). $$

Question about vector-matirix derivative in back propagation

1 Answers1

Linked

Related