In the deep learning book we have the standard RNN with these equations. It calculates various derivatives, including one for W.
I understand that:
- $1 - {(h^{(t)})}^2 $ is coming from the derivative of $tanh$
- $h^{(t-1)}$ is coming from the chain rule
What I don't understand:
- where is the $diag$ coming from?
- where is the transpose for $h^{(t-1)}$ is coming from?
- why are they on the particular order they are in (apart from that this way the dimensions match). It feels like the gradient of L just got somehow between the parts of the derivate of $h^{(t)}$
Thank you