11

Define a "probability vector" to be a vector $p = (p_1,\ldots, p_K) \in \mathbb R^K$ whose components are nonnegative and which satisfies $\sum_{k=1}^K p_k = 1$. We can think of a probability vector as specifying a probability mass function (PMF) for a random variable with $K$ distinct possible values.

A straightforward and intuitive way to compare two vectors $p$ and $q$ in $\mathbb R^K$ is to compute the quantity $$ d(p,q) = \frac12 \| p - q \|_2^2, $$ which is small when $p$ is close to $q$. However, if $p$ and $q$ are probability vectors, I think it is somehow more natural to compare them using the "cross-entropy loss function" $\ell$ defined by

$$ \ell(p,q) = -\sum_{k=1}^K q_k \log(p_k). $$ (This function is only defined when all components of $p$ are nonzero.)

Question: What is the motivation for using the cross-entropy loss function when comparing probability vectors? Is there a viewpoint that makes it directly obvious that this is the "correct" thing to do?


Some additional background information:

This method of comparing probability vectors is fundamental in machine learning, because we have the following "recipe" for a classification algorithm which classifies objects into one of $K$ distinct classes. Suppose that we are given a list of training examples $x_i \in \mathbb R^n$ and corresponding one-hot encoded label vectors $y_i \in \mathbb R^K$. (So if the $i$th training example belongs to class $k$, then the $k$th component of the vector $y_i$ is $1$ and the other components are $0$.) Let $S: \mathbb R^K \to \mathbb R^K$ be the softmax function defined by $$ S(u) = \begin{bmatrix} \frac{e^{u_1}}{\sum_k e^{u_k}} \\ \vdots \\ \frac{e^{u_K}}{\sum_k e^{u_k}} \end{bmatrix}. $$ The softmax function is useful because it converts a vector in $\mathbb R^K$ into a probability vector. To develop a classification algorithm, we attempt to find a function $f: \mathbb R^n \to \mathbb R^K$ such that for each training example $x_i$ the probability vector $p_i = S(f(x_i))$ is close to $y_i$ in the sense that $\ell(p_i, y_i)$ is small. For example, $f$ might be a neural network with a particular architecture, and the parameter vector $\theta$ which contains the weights of the neural network is chosen to minimize $$ \sum_{i = 1}^N \ell(p_i, y_i), $$ where $N$ is the number of training examples. (Multiclass logistic regression is the especially simple case where $f$ is assumed to be affine: $f(x_i) = A x_i + b$.)

One way to discover the cross-entropy loss function is to go through the steps of using maximum likelihood estimation to estimate the parameter vector $\theta$ which specifies $f$ (assuming that $f$ is restricted to be a member of a certain parameterized family of functions, such as affine functions or neural networks with a particular architecture). The cross-entropy loss function just pops out of the MLE procedure. This is the approach that currently seems the most clear to me. There is also an information theory viewpoint.

Is there any simple way to recognize that the cross-entropy loss function is a "natural" way to compare probability vectors?

littleO
  • 54,048
  • Perhaps something like: We care about the difference between $p_k$ and $q_k$ on a logarithmic scale, not an absolute scale, so look at $\log p_k-\log q_k$; the expected value of that leads to the KL divergence, and then cross-entropy is a more convenient variation of KL for whatever purpose. – Chris Culter Oct 11 '19 at 20:02
  • You might want to look up https://en.wikipedia.org/wiki/Cross_entropy#Motivation and "A mathematical theory of communication" by Shannon. – mbartczak Oct 22 '19 at 11:03
  • 1
    What does it mean to distinguish two probabilities? One approach is via hypothesis testing - if two distributions $P,Q$ are 'very different', then given samples, you should be able to 'easily' tell if the sample is drawn from $P$ or from $Q$. If this seems reasonable to you - it turns out that cross entropy has deep relationship to the rate at which probability of error decays with samples for such a hypothesis test - this is Sanov's theorem. See, e.g., ch.2 of this monograph of Csiszar and Shields. – stochasticboy321 Oct 23 '19 at 22:33
  • 1
    As a caveat - I find that while the above is quite elegant, it is still not clear to me that it makes it obvious that cross (or relative) entropy is 'the right measure', particularly for ML. At least in part it the reasons must also pass through convex optimisation theory - e.g. the KL divergence (which is cross entropy - H(q)) is a Bregman divergence induced by Shannon entropy, and the latter is known to be a good measure of 'size' of a distribution (and maybe this is yet another connection to the MLE idea) - I don't really know much convex analysis though... – stochasticboy321 Oct 23 '19 at 22:45

2 Answers2

8

Let me try with the following three-step reasoning process.

To measure probability value difference

Intuitively, what is best way to measure difference between two probability values?

The probability of a person's death is related to car accident is about $\frac{1}{77}$, and the odds of one stricken by lightening is about $\frac{1}{700,000}$. Their numerical difference (in terms of L2) is around 1%. Do you consider the two events similarly likely? Most people in this case might consider the two events are very different: the first type of events is rare but significant and worthy of attention, while most would not worry about the second type of events in their normal days.

Overall, the sun shines 72% of the time in San Jose, and about 66% of the time on the sunny side (bay side) of San Francisco. The two sun shine probabilities differ numerically by about 6%. Do you consider the difference significant? For some, it might be; but or me, both places get plenty of sun shine, and there is little material difference.

The take away is that we need to measure individual probability value difference not by subtraction, but by some sort of quantities related to their ratio $\frac{p_k}{q_k}$.

But there are problems with using ratio as the measurement quantity. One problem is that it could vary a lot, especially for rare events. It is not uncommon for one to assess a certain probability to be 1% the first day, and declare it to be 2% the second day. Taking a simple ratio of the probability values to probability value of another event would lead to the measurements to change by 100% between the two days. For this reason, the log of ratio $\ log(\frac{p_k}{q_k})$ is used for measuring difference between individual pair of probability values.

To measure probability distribution difference

The goal of your question is to measure the distance between two probability distributions, not two individual probability value points. For a probability distribution, we are talking about multiple probability value points. To most people, it should makes sense to first compute the difference at each probability value point, and then to take their average (weighted by their probability values, i.e. $p_k log(\frac{p_k}{q_k})$) as the distance between two probability distributions.

This leads to our first formula for measuring distribution differences. $$ D_{KL}(p \Vert q) = \sum_{k=1}^n p_k log\left( \frac{p_k}{q_k} \right). $$ This distance measure, called KL-divergence, (not a metric) is usually much better than L1/L2 distances, especially in the realm of Machine Learning. I hope, by now, you would agree that KL-divergence is a natural measure for probability distribution differences.

Finally the cross-entropy measure

There are two technical facts one needs to be aware.

First, KL-divergence and cross entropy is related by the following formula. $$ D_{KL}(p \Vert q) = H(p, q) - H(p). $$

Second, in ML practice, we often pass the ground truth label as the $p$ parameter and the model inference outputs as the $q$ parameter. And in majority of the cases, our training algorithms are based on gradient descent. If both of our assumptions are true (most likely), the term $H(p)$ term is a constant that does not affect our training results, and hence can be discarded to save computational resources. In this case, $H(p,q)$, the cross-entropy, can be used in place of $D_{KL}(p \Vert q)$.

If the assumptions are violated, you need to abandon the cross-entropy formula and revert back to the KL-divergence.

I think I can now end my wordy explanation. I hope it helps.

  • Very nice answer. Would you mind me to insert the definition of the Shannon entropy $H(p)$ into your answer? Something like "where $H(p) := \dots$ is the Shannon entropy". – Olivier Roche Feb 06 '20 at 07:09
3

Here is a "maximum likelihood estimation" viewpoint which is simple and clear, and which does not require any knowledge of information theory.

Imagine a $K$-sided die whose faces are labeled with integers from $1$ to $K$. The die is biased so that when we roll it, the probability that the result is $k$ is $p_k$. However, person $Q$ believes that the probability that the result is $k$ is $q_k$ (for $k = 1, \ldots, K$).

We roll the die $N$ times, where $N$ is a large positive integer. Let $y_i$ be the result of the $i$th roll, and let $N_k$ be the number of times that the die lands on face $k$. Person $Q$ would say that the probability of observing this particular sequence of values $y_1, \ldots, y_N$ is $$ L = \Pi_{k=1}^K q_k^{N_k}. $$ If $L$ is close to $1$, then person $Q$ is not very surprised by the results of our $N$ observations, so in that sense the probability vector $q = (q_1, \ldots, q_K)$ is consistent with the probability vector $p = (p_1, \ldots, p_K)$.

But note that $$ \frac{\log(L)}{N} = \sum_{k=1}^K \frac{N_k}{N} \log(q_k) \approx \sum_{k=1}^K p_k \log(q_k). $$ The approximation is good when $N$ is large. So we see that the quantity $H(p,q) = \sum_{k=1}^K p_k \log(q_k)$ can be used to measure the consistency of $p$ and $q$. The larger $H(p,q)$ is, the closer $L$ is to $1$. In other words, the larger $L$ is, the less surprised person $Q$ is by the results of our die rolls.

littleO
  • 54,048
  • I'm curious to hear if anyone can phrase the above argument more simply or concisely. – littleO Feb 05 '20 at 06:25
  • How is this a maximum likelihood viewpoint? The derivation of cross entropy formula in the maximum likelihood case assume that the target distribution is one-hot encoded (not really a distribution). – Adam Wilson Jul 13 '21 at 14:05