2

Say that we want to extend an arbitrary predictor to multi class classification. Say that the predictor $h_{\theta}(x)$ outputs a vector that depends on each label. In fact one can get the "confidence margin" for each label by indexing the parameters as follow: $h_{\theta_l}(x)$. Using this "confidence" one can get the probability of a label as follows:

$$ p( y = l \mid x ; \theta ) = \frac{ e^{ h_{\theta_l}(x) }}{ \sum^L_{l'=1} e^{ h_{\theta_{l'}}(x) }} = \frac{ e^{ h_{\theta_l}(x) } }{z(x;\theta)}$$

armed with this model and notation, then our goal is to compute the derivative of the cross entropy loss:

$$ J(x,y; \theta) = - \sum^L_{l=1} \mathbb{1}\{ y =l\} \log p(y=l \mid x; \theta )$$

with respect to any subset of the parameters $\theta$.

I claim that the derivative is as follow:

$$ \frac{\partial J(x,y; \theta)}{\partial t_k} = \sum^L_{l=1} \frac{\partial h_{\theta_l}(x) }{\partial t_k} [\mathbb{1}\{ y=l\} - p(y=l \mid x; \theta ) ]$$

and this is the derivative I'd like to check with the community. This is my derivation:

$$ J(x,y; \theta) = - \sum^L_{l=1} \mathbb{1}\{ y =l\} \log p(y=l \mid x; \theta ) = - \sum^L_{l=1} \left[ \mathbb{1}\{ y = l\} h_{\theta_l}(x) - \mathbb{1}\{ y = l\} \log z(x; \theta) \right] $$

$$ = -\sum^L_{l=1} \mathbb{1}\{ y = l\} h_{\theta_l}(x) - \left[ \sum^L_{l=1} \mathbb{1}\{ y = l\} \right] \log z(x; \theta)= \sum^L_{l=1} \mathbb{1}\{ y = l\} h_{\theta_l}(x) - \log z(x; \theta) $$

where $ -\left[ \sum^L_{l=1} \mathbb{1}\{ y = l\} \right] = 1$ because the label of any valid data point has to match one and only one of the labels $l$. Now taking the derivative we have:

$$ -\sum^L_{l=1} \mathbb{1}\{ y = l\} \frac{ \partial h_{\theta_l}(x)}{\partial \theta} - \frac{\partial }{\partial \theta }\log z(x; \theta) $$

using the chain rule on $z(x; \theta)$ we get:

$$ -\sum^L_{l=1} \mathbb{1}\{ y = l\} \frac{ \partial h_{\theta_l}(x)}{\partial \theta} - \frac{1}{z(x; \theta)} \sum^L_{l=1} e^{h_{\theta_l}(x)} \frac{ \partial h_{\theta_l}(x)}{\partial \theta} $$

factoring $\frac{ \partial h_{\theta_l}(x)}{\partial \theta}$ and collecting the summations gives:

$$ -\sum^L_{l=1} \frac{ \partial h_{\theta_l}(x)}{\partial \theta} \left[ \mathbb{1}\{ y = l\} - \frac{e^{h_{\theta_l}(x)}}{z(x; \theta)} \right] $$

but we defined $\frac{e^{h_{\theta_l}(x)}}{z(x; \theta)} = p( y = l \mid x ; \theta ) $

so we get my final answer:

$$ \frac{\partial J(x,y; \theta)}{\partial t_k} = - \sum^L_{l=1} \frac{\partial h_{\theta_l}(x) }{\partial t_k} [\mathbb{1}\{ y=l\} - p(y=l \mid x; \theta ) ]$$

Its actually quite nice to be able to confirm that this seems to agree with what Andrew Ng has when the model is just linear i.e. when $h_{ \theta_l }(x) = {\theta^{(l)}}^T x$:

$$ \frac{\partial J(x,y; \theta)}{\partial t_k} = - \sum^L_{l=1} x [\mathbb{1}\{ y=l\} - p(y=l \mid x; \theta ) ]$$

The only reason I sort of doubt my derivation is because I cannot get code based on this match my numerical derivatives. I made a question addressing my code in stack overflow here.

  • After fixing my code, just as I expected, this derivation is correct (I already had code based on the above derivation that worked, so I would have been shocked if the derivation was wrong). – Charlie Parker Oct 19 '15 at 23:23

0 Answers0