Gradients of $ \sum_{i=1}^N \|W_3 g(W_2 f(W_1 x_i) ) - y_i \|_2^2$ w.r.t. $W_1$, $W_2$, and $W_3$?

Question

How to obtain the gradient and optionally Hessian of \begin{align} L(W_1, W_2, W_3) := \sum_{i=1}^N \| W_3 \ g\left(W_2 \ f\left(W_1 x_i \right) \right) - y_i \|_2^2 \ , \end{align} with respect to $W_1$, $W_2$, and $W_3$?

The definition of $x_i \in \mathbb{R}^n$, $W_1 \in \mathbb{R}^{m \times n}$, $W_2 \in \mathbb{R}^{p \times m}$, $W_3 \in \mathbb{R}^{q \times p}$, and $y_i \in \mathbb{R}^q$, and $f(z) = g(z) = \frac{1}{1 + \exp(-z)}$.

can we also generalize for any differentiable $f$ and $g$ functions?

Thank you so much in advance for your help

greg · Accepted Answer · 2019-02-11T05:58:18.783

Define some new vectors $$\eqalign{ p &= W_1x &\implies dp = dW_1\,x \cr f &= \sigma(p) &\implies df = (F-F^2)\,dp \cr r &= W_2f &\implies dr = W_2\,df+dW_2\,f \cr g &= \sigma(r) &\implies dg = (G-G^2)\,dr \cr s &= W_3g-y &\implies ds = W_3\,dg+dW_3\,g \cr }$$ where $F={\rm Diag}(f)$ and $G={\rm Diag}(g)$.

Write the loss function in terms of these new variables. $$\eqalign{ L &= \|s\|^2_F = s:s \cr }$$ where the colon is a convenient product notation for the trace, i.e. $\,A:B = {\rm tr}(A^TB)$

Now calculate the differentials and desired gradients. $$\eqalign{ dL &= 2s:ds \cr &= 2s:(W_3\,dg+dW_3\,g) \cr }$$ Setting $dg=0$ yields our first gradient $$\eqalign{ dL &= 2sg^T:dW_3 \cr \frac{\partial L}{\partial W_3} &= 2sg^T }$$ Now set $dW_3=0$ and continue on towards $W_2$. $$\eqalign{ dL &= 2W_3^Ts:dg \cr &= 2W_3^Ts:(G-G^2)\,dr \cr &= 2(G-G^2)W_3^Ts:(W_2\,df+dW_2\,f) \cr }$$ Setting $df=0$ yields our second gradient $$\eqalign{ dL &= 2(G-G^2)W_3^Tsf^T:dW_2 \cr \frac{\partial L}{\partial W_2} &= 2(G-G^2)W_3^Tsf^T }$$ Now set $dW_2=0$ and continue on towards $W_1$. $$\eqalign{ dL &= 2W_2^T(G-G^2)W_3^Ts:(F-F^2)\,dp \cr &= 2(F-F^2)W_2^T(G-G^2)W_3^Ts:dW_1\,x \cr &= 2(F-F^2)W_2^T(G-G^2)W_3^Tsx^T:dW_1 \cr \frac{\partial L}{\partial W_1} &= 2(F-F^2)W_2^T(G-G^2)W_3^Tsx^T \cr }$$ Actually we've only worked with the $i^{th}$ component of the loss function, i.e. $L_i$.
The full function or gradient is obtained by summing over all $N$ components. $$\eqalign{ L_{total} &= \sum_{i=1}^N L_i \cr \frac{\partial L_{total}}{\partial W_k} &= \sum_{i=1}^N \frac{\partial L_i}{\partial W_k} }$$ NB: In the derivation, $(x, y)$ were treated a single vectors, but in the summation they must be replaced by $(x_i, y_i)$

Thank you. I am not able to understand how you got $df = (F - F^2)dp$ and $dg = (G - G^2)dr$? — learning, Feb 11 '19 at 06:36
the derivative of $f(z)$ w.r.t. $z$ is $\frac{\exp(-z)}{(1 + \exp(-z))^2}$, right? — learning, Feb 11 '19 at 06:42
No, the derivative is $\Big(\frac{df}{dz}=f-f^2\Big)$ for the logistic function. This scalar function is applied element-wise to a vector argument, which necessitates the use of elementwise/Hadamard products $df = (f-f\odot f)\odot dz,$ to express the vector result. And finally, the Hadamard product can be replaced by the regular matrix product with a diagonal matrix $df = (F-F^2),dz$. — greg, Feb 11 '19 at 07:25
Now, I get that. Well, $\frac{\exp(-z)}{(1 + \exp(-z))^2} = f - f^2 \equiv \frac{1}{(1 + \exp(-z))} - \frac{1}{(1 + \exp(-z))^2}$? or am I really making mistake? — learning, Feb 11 '19 at 07:46
Oops, I didn't see the tiny "2" in the denominator of your derivative. — greg, Feb 11 '19 at 13:31

Gradients of $ \sum_{i=1}^N \|W_3 g(W_2 f(W_1 x_i) ) - y_i \|_2^2$ w.r.t. $W_1$, $W_2$, and $W_3$?

1 Answers1

Linked