2

I have an issue with the following problem. I am trying to derive the gradients with respect to $x_t, h_{t-1}, W_x, W_h$. $x_t$ is a $N*D$ vector. $h_t$ is a $N*H$ vector. $W_h$ is a $H*H$ matrix. $W_x$ is a $D*H$ matrix.

The function is $h_t=tanh(h_{t-1}W_h+x_tW_x+b)=tanh(O_t)$

I am struggling with the derivative with respect to $W_h, W_x$ The answer said that the derivative with respect to $W_h$ is $h_{t-1}^T⋅dh_t⋅(1-h_t⋅h_t)$.

$dh_t$ means the derivative obtained in the previous step. Also, $dh_t⋅(1-h_t⋅h_t)$ is just inner product in my guess. $h_{t-1}^T⋅dh_t$ is done by matrix multiplication. The most confusing part is how $h_{t-1}^T$ moves to the first term when the chain rule is applied.

I will appreciate if anyone could answer this question.

  • What is a matrix RNN? – PC1 Jun 26 '23 at 05:39
  • @PC1 sorry for the confusion. This problem is the backpropagation in the RNN. I am just struggling how to apply the chain rule for the matrix. They told me that dL/dW=xT∗dL/dy,dL/dx=dL/dy∗WT for the equation L=f(y),y=xW . I am wondering how to derive them. – Samuel Lee Jun 26 '23 at 06:24

2 Answers2

2

$ \def\l{\lambda} \def\o{{\tt1}} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\diag#1{\op{diag}\LR{#1}} \def\Diag#1{\op{Diag}\LR{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\frob#1{\left\| #1 \right\|_F} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} $First, let's give each variable a distinct name and remove those distracting subscripts (and denote row vectors by $v^T$ rather than $v$) $$\eqalign{ x &= x_t^T,\quad &h=h_t^T,\quad &y = h_{t-1}^T,\quad &p = O_t^T \\ q &= b^T,\quad &R = W_h^T,\quad &S = W_x^T \\ }$$ Second, given a scalar function and its derivative $\LR{f(z),\:f'=\grad fz}$ the differential is defined as $$df = f'\,dz$$ These functions can be applied elementwise to a vector argument (like $p$) resulting in $$h = f(p),\quad h'=f'(p),\qquad\qquad\quad dh = h'\odot dp$$ where $\odot$ denotes the elementwise/Hadamard product.
Such products can be replaced by diagonal matrices $$H = \Diag{h},\quad H' = \Diag{h'},\qquad dh = H'\,dp$$

Fortunately, the derivative of tanh() is quite well known $$dh = \LR{\o-h\odot h}\odot dp \;=\; \CLR{I-H^2}dp \;\equiv\; \c{M}\,dp$$ We're almost done, we just need to calculate $dp$ and substitute $$\eqalign{ p &= q + Ry + Sx \\ dp &= \LR{R\:dy+dR\:y} + \LR{S\:dx+dS\:x} \\ dh &= M\LR{R\:dy+dR\:y} + M\LR{S\:dx+dS\:x} \\ &= {MR\:dy+M\:dR\:y} + {MS\:dx+M\:dS\:x} \\ }$$ The vector gradients are easily identified from this expression $$\eqalign{ \grad hy &= MR,\qquad \grad hx = MS \\ }$$ The gradients with respect to the matrix variables are trickier, but they are $$\eqalign{ \grad hR &= M\star y,\qquad \grad hS &= M\star x \\ }$$ where $\star$ denotes a dyadic/tensor product, i.e. $$\eqalign{ \grad hS = M\star x \qiq \grad{h_i}{S_{jk}} = M_{ij}\,x_k \\ }$$ Such quantities are third-order tensors and are a PITA to work with using standard matrix-vector notation.

greg
  • 40,033
0

Denote $$ \mathbf{h}_t = \tanh \left( \mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b} \right) = \tanh \left( \mathbf{o}_t \right) $$

Using same notations as in Greg's answer \begin{eqnarray*} d\phi &=& \frac{\partial \phi}{\partial \mathbf{h}_t} : d\mathbf{h}_t \\ &=& \frac{\partial \phi}{\partial \mathbf{h}_t} : \left( \mathbf{1}-\mathbf{h}_t\odot \mathbf{h}_t \right) \odot d\mathbf{o}_t \\ &=& \left( \mathbf{1}-\mathbf{h}_t\odot \mathbf{h}_t \right) \odot \frac{\partial \phi}{\partial \mathbf{h}_t} : \left( d\mathbf{W}_h \right)\mathbf{h}_{t-1} \\ &=& \left[ \left( \mathbf{1}-\mathbf{h}_t\odot \mathbf{h}_t \right) \odot \frac{\partial \phi}{\partial \mathbf{h}_t} \right] \mathbf{h}_{t-1}^T : d\mathbf{W}_h \end{eqnarray*} The matrix gradient is given by the LHS of the Frobenius inner product.

Steph
  • 4,140
  • 1
  • 5
  • 13