7

I was looking at keras source here which calculates cross entropy loss using:

output /= tf.reduce_sum(output,
                        reduction_indices=len(output.get_shape()) - 1,
                        keep_dims=True)
# manual computation of crossentropy
epsilon = _to_tensor(_EPSILON, output.dtype.base_dtype)
output = tf.clip_by_value(output, epsilon, 1. - epsilon)
return - tf.reduce_sum(target * tf.log(output),
                       reduction_indices=len(output.get_shape()) - 1)

target is the truth data, which is 0 or 1, and output is the output of the neural net.

So it looks like the loss is of the form

$$J_{y'} (y) = - \sum_{i} y_{i}' \log (y_i)$$

where $y_i$ is the model output for class $i$, and $y_i'$ is the truth data.

Does this mean the errors for $y_i' = 0$ do not contribute to the loss? Why isn't the formula

$$J_{y'}(y) = - \sum_{i} ({y_i' \log(y_i) + (1-y_i') \log (1-y_i)})$$

used?

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34

1 Answers1

7

Does this mean the errors for $y_i=0$ do not contribute to the loss?

That is correct.

However, the respective weights that connect to wrong neurons will still have gradients due to the error, and those gradients will be influenced by the size of each incorrect classification. That is due to how softmax works:

$$\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$$

(where $z_i$ is the pre-softmax value of each neuron, a.k.a. the logit) . . . weights that affect one neuron's pre-transform value affect the post-transform value of all neurons. So those weights will still be adjusted to produce a lower $z_j$ value for the incorrect neurons during weight updates.

Why isn't the formula

$$J_{y'}(y) = - \sum_{i} ({y_i' \log(y_i) + (1-y_i') \log (1-y_i)})$$

used?

It is not clear why when selecting a single class, that you would care how probability estimates were distributed amongst incorrect classes, or what the benefit would be to drive the incorrect values to be equal. For instance if $y' = [1, 0, 0, 0]$ then using the suggested formula for $J_{y'}(y)$ gives ~ 0.67 for $y = [0.7, 0.1, 0.1, 0.1]$ and ~0.72 for $y = [0.73, 0.26, 0.05, 0.05]$, yet arguably the second result is better.

However, you would use this loss when dealing with non-exclusive classes (where the outputs would use sigmoid as opposed to softmax activation).

Neil Slater
  • 29,388
  • 5
  • 82
  • 101