Define a "probability vector" to be a vector $p = (p_1,\ldots, p_K) \in \mathbb R^K$ whose components are nonnegative and which satisfies $\sum_{k=1}^K p_k = 1$. We can think of a probability vector as specifying a probability mass function (PMF) for a random variable with $K$ distinct possible values.
A straightforward and intuitive way to compare two vectors $p$ and $q$ in $\mathbb R^K$ is to compute the quantity $$ d(p,q) = \frac12 \| p - q \|_2^2, $$ which is small when $p$ is close to $q$. However, if $p$ and $q$ are probability vectors, I think it is somehow more natural to compare them using the "cross-entropy loss function" $\ell$ defined by
$$ \ell(p,q) = -\sum_{k=1}^K q_k \log(p_k). $$ (This function is only defined when all components of $p$ are nonzero.)
Question: What is the motivation for using the cross-entropy loss function when comparing probability vectors? Is there a viewpoint that makes it directly obvious that this is the "correct" thing to do?
Some additional background information:
This method of comparing probability vectors is fundamental in machine learning, because we have the following "recipe" for a classification algorithm which classifies objects into one of $K$ distinct classes. Suppose that we are given a list of training examples $x_i \in \mathbb R^n$ and corresponding one-hot encoded label vectors $y_i \in \mathbb R^K$. (So if the $i$th training example belongs to class $k$, then the $k$th component of the vector $y_i$ is $1$ and the other components are $0$.) Let $S: \mathbb R^K \to \mathbb R^K$ be the softmax function defined by $$ S(u) = \begin{bmatrix} \frac{e^{u_1}}{\sum_k e^{u_k}} \\ \vdots \\ \frac{e^{u_K}}{\sum_k e^{u_k}} \end{bmatrix}. $$ The softmax function is useful because it converts a vector in $\mathbb R^K$ into a probability vector. To develop a classification algorithm, we attempt to find a function $f: \mathbb R^n \to \mathbb R^K$ such that for each training example $x_i$ the probability vector $p_i = S(f(x_i))$ is close to $y_i$ in the sense that $\ell(p_i, y_i)$ is small. For example, $f$ might be a neural network with a particular architecture, and the parameter vector $\theta$ which contains the weights of the neural network is chosen to minimize $$ \sum_{i = 1}^N \ell(p_i, y_i), $$ where $N$ is the number of training examples. (Multiclass logistic regression is the especially simple case where $f$ is assumed to be affine: $f(x_i) = A x_i + b$.)
One way to discover the cross-entropy loss function is to go through the steps of using maximum likelihood estimation to estimate the parameter vector $\theta$ which specifies $f$ (assuming that $f$ is restricted to be a member of a certain parameterized family of functions, such as affine functions or neural networks with a particular architecture). The cross-entropy loss function just pops out of the MLE procedure. This is the approach that currently seems the most clear to me. There is also an information theory viewpoint.
Is there any simple way to recognize that the cross-entropy loss function is a "natural" way to compare probability vectors?