2

Does the gradient of the cross-entropy have a nice matrix expression? Let $\mathbf X$ be a matrix whose row vectors are features, and $$\mathbf Y_{ij} = \begin{cases} 1 & \text{if the $j$th row vector of $\mathbf X$ has label $i$} \\ 0 & \text{otherwise} \end{cases} $$ With $\mathbf W$ and $\mathbf b$ denoting our weights and biases we compute the $i$th row of $\hat {\mathbf Y}$ $$\hat {\mathbf Y}_i = \sigma(\mathbf X_i \mathbf W+\mathbf b)$$ where $\sigma$ is the softmax function and $\mathbf X_i$ is the $i$th row of $\mathbf X$. Then, the cross entropy is defined as $$ L = \sum_i \mathbf Y_i \cdot \ln (\hat {\mathbf Y}_i) $$ where $\mathbf Y_i$ is the $i$th column of $\mathbf Y$ and the logarithm is applied elementwise. Define $$\nabla{\mathbf W}_{ij} = \frac{\partial}{\partial w_{ij}}L $$ and $$ \nabla \mathbf b_i = \frac{\partial}{\partial b_i}L $$ I am looking for nice matrix expressions for $\nabla{\mathbf W}_{ij}$ and $\nabla \mathbf b_i $. This is what I have done so far:
Fix $n$ and $r$ so that: $$\begin{align} \frac{\partial}{\partial w_{nr}}L &= \frac{\partial}{\partial w_{nr}}\sum_i \mathbf Y_i \cdot \ln (\hat {\mathbf Y}_i) \\&= \frac{\partial}{\partial w_{nr}}\sum_i \mathbf Y_i \cdot\left(\mathbf X_i \mathbf W + \mathbf b -\ln(Z_i) \mathbf 1\right) \tag{1}\\&= \sum_i \frac{\partial}{\partial w_{nr}}\mathbf Y_i \cdot\mathbf X_i \mathbf W - \frac { \mathbf X_{in}e^{\mathbf X_i \mathbf W_r+\mathbf b_r}}{Z_i} \tag{2}\\ &=\sum_{i \text{ s.t. $\mathbf X_i$ has label $r$}} \mathbf X_{in} - \sum_i\mathbf X_{in}\hat {\mathbf Y}_{ir} \\ &=\sum_{i } \mathbf X_{in}\mathbf Y_{ri} - \sum_i\mathbf X_{in}\hat {\mathbf Y}_{ir} \tag{3} \\ & = \mathbf X^{\mathsf T}(\mathbf Y^{\mathsf T}-\hat {\mathbf Y})_{nr} \end{align} $$ In $(1)$, $Z_i = \sum_{j}e^{\mathbf X_i \mathbf W_j+\mathbf b_j}$ from the definition of the softmax function and $\mathbf 1$ is a row vector of all ones. $(2)$ follows from the chain rule and $(3)$ follows from the fact that $\mathbf Y_{ri} = 1$ iff $\mathbf X_i$ has label $r$.
For $\nabla \mathbf b_r$ we proceed similarly and obtain:

$$\begin{align} \frac{\partial}{\partial b_r}L &= \sum_i \frac{\partial}{\partial b_{r}}\mathbf Y_i \cdot\mathbf b - \frac { e^{\mathbf X_i \mathbf W_r+\mathbf b_r}}{Z_i} \\ &=\sum_{i \text{ s.t. $\mathbf X_i$ has label $r$}} 1 - \sum_i\hat {\mathbf Y}_{ir} \\ &=\sum_{i} \mathbf Y_{ri} - \sum_i\hat {\mathbf Y}_{ir} \tag{3} \\ & = \mathbf 1(\mathbf Y^{\mathsf T}-\hat {\mathbf Y})_{r} \end{align} $$ However, when I apply these expressions in Python and use stochastic gradient descent, the regression does not yield accurate predictions (I got exactly $90\%$ error rate which is comically just as good as random guessing). This has led me to believe that these derivations are erroneous and that the gradient must be computed differently. Unfortunately, no resource I could find listed expressions for the gradient, so I do not have any way to check where I went wrong. I would like to know if I made any mistakes in the math or if the mistakes I made were in the code.

GuPe
  • 7,518

0 Answers0