21

The Kullback-Leibler Divergence is defined as

$$K(f:g) = \int \left(\log \frac{f(x)}{g(x)} \right) \ dF(x)$$

It measures the distance between two distributions $f$ and $g$. Why would this be better than the Euclidean distance in some situations?

Robert
  • 283
  • 1
    Because $K(f\mid g)$ measures the ratio between the (un)likelihood that a $g$ sample is like an $f$ sample, and its typical likelihood as a $g$ sample. – Did Dec 11 '11 at 19:28
  • 4
    There is an interpretation in terms of information theory, see http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Motivation. – Jeff Dec 11 '11 at 20:17
  • 1
    One reason is that it’s useful. It is used to prove the fact that among all random variables with variance 1, the Gaussian is the unique one that maximizes Shannon entropy. Euclidean distance cannot be used to do this. – Deane Feb 09 '25 at 06:25
  • Another comment is that KL is a distance between (probability) measures (namely, $f(x)dF(x)$ and $g(x)dF(x)$ in your original notation) and doesn't depend on a choice of background measure on the underlying space, whereas, if by Euclidean distance you mean the $L^2$ metric $\int_X f(x)g(x) dF(x)$, it depends on a choice of background measure on $X$. So the two are really defined in somewhat different contexts. – Sam Lewallen May 12 '25 at 15:59
  • (I meant to add that in addition to requiring a background measure to be defined it, the $L^2$ metric is properly a distance between functions, not measures). – Sam Lewallen May 12 '25 at 16:10

4 Answers4

21

The short answer is that KL divergence has a probabilistic/statistical meaning (and a lot of them, in fact) while Euclidean distance has not. For example, a given difference $f(x)-g(x)$ has a whole different meaning depending on the absolute sizes of $f(x)$ and $g(x)$.

The WP page on the subject is a must read, naturally. Let me explain only one interpretation of KL divergence. Assume a random i.i.d. sample $\mathfrak X=(x_k)_{1\leqslant k\leqslant n}$ follows the distribution $f$ and a random i.i.d. sample $\mathfrak Y=(y_k)_{1\leqslant k\leqslant n}$ follows the distribution $g$. A way to distinguish $\mathfrak X$ from $\mathfrak Y$ is to ask for the likelihood that $\mathfrak Y$ behaves like $\mathfrak X$, that is, that $\mathfrak Y$ behaves like a typical sample from $f$.

More precisely, one wants to estimate how unlikely $\mathfrak Y$ becomes when one asks that $\mathfrak Y$ behaves like an $f$ sample, compared to its ordinary likelihood as a $g$ sample.

The computation is rather simple and based on the following. Assume $N(x,x+\mathrm dx)$ values from the sample fall in each interval $(x,x+\mathrm dx)$. Then, the likelihood scales like $$ \prod g(x)^{N(x,x+\mathrm dx)}=\exp\left(\sum N(x,x+\mathrm dx)\log g(x)\right). $$ For a typical $f$ sample, $N(x,x+\mathrm dx)\approx nf(x)\mathrm dx$ when $n\to\infty$, for every $x$, hence the likelihood of $\mathfrak Y$ masquerading as an $f$ sample scales like $$ \ell_n(f\mid g)\approx\exp\left(n\int f(x)\log g(x)\mathrm dx\right). $$ On the other hand, for a typical $g$ sample, $N(x,x+\mathrm dx)\approx ng(x)\mathrm dx$ when $n\to\infty$, for every $x$, hence the likelihood of $\mathfrak Y$ behaving like a typical $g$ sample scales like $$ \ell_n(g\mid g)\approx\exp\left(n\int g(x)\log g(x)\mathrm dx\right). $$ Thus $\ell_n(f\mid g)\ll\ell_n(g\mid g)$, as was to be expected, and the ratio $\dfrac{\ell_n(f\mid g)}{\ell_n(g\mid g)}$ decreases exponentially fast when $n\to\infty$, approximately like $\mathrm e^{-nH}$, where $$ H=\int f(x)\log f(x)\mathrm dx-\int f(x)\log g(x)\mathrm dx=K(f\mid g). $$

Did
  • 284,245
  • What do you mean by $N(x,dx)$ here? – WeakLearner Jul 21 '15 at 03:08
  • @dimebucker91 The thing is defined in the post. What is not clear about the definition? – Did Jul 21 '15 at 07:10
  • @Did So you are defining N as the number within that interval? I just thought that N represented some known quantiy – WeakLearner Jul 21 '15 at 07:22
  • @Did I guess I was just confused because you use the term $N(x,x+dx)$, is this a typo? Also, I'm confused when you say, the likelihood scales like.., how do you get that expression from the normal way to think about a likelihood as the product of $g(x_i)$'s? – WeakLearner Jul 21 '15 at 13:45
  • @dimebucker91 Were you in fact alluding to a typo? Anyway, I changed $N(x,\mathrm dx)$ into $N(x,x+\mathrm dx)$. – Did Jul 21 '15 at 13:57
  • @Did Could you elaborate on the statement " The likelihood scales like.." How do you get that from the normal definition of the likelihood? – WeakLearner Jul 22 '15 at 13:57
  • @dimebucker91 This approaches the continuous distributions with densities $f$ and $g$ by discrete distributions putting mass $f(x)dx$ and $g(x)dx$ on some regularly spaced values of $x$ each at distance $dx$ from the others. The usual likelihood for distributions $(p_i)$ and $(q_i)$ and numbers $(N_i)$ would be $$\prod_iq_i^{N_i}$$ By the law of large numbers, $N_i\approx np_i$ and we chose $p_i=f(x)dx$ and $q_i=g(x)dx$ at the $i$th point $x$, thus the "true" likelihood is approximately $$\prod_x(g(x)dx)^{nf(x)dx}\propto\prod_xg(x)^{nf(x)dx}$$ and you should see why the rest follows. Do you? – Did Jul 22 '15 at 14:06
  • @Did This definitely makes more sense now. Thank you. Just one last question about how you calculated the rate of decrease for the ratio of likelihoods. Could you point me in the direction or link me to a simple example of this kind of calculation ? – WeakLearner Jul 23 '15 at 03:03
  • @dimebucker91 Simply by forming the ratio of the approximate formulas for $\ell_n(f\mid g)$ and $\ell_n(g\mid g)$ given above. – Did Jul 23 '15 at 09:35
  • @Did so is due to the absolute continuity that we may say that the ratio of the likelihoods: $\exp (n \int ( f(x) \log g(x) - g(x) \log g(x) ) dx)$ is approximately equal to: $\exp (n \int ( f(x) \log f(x) - f(x) \log g(x) ) dx) $ ? – WeakLearner Jul 23 '15 at 13:16
  • @dimebucker91 Ahh, finally... 10 comments to arrive at the question of whether $(f,g)$ should not be $(g,f)$, in the end? Listen, I think I will let you decide the case since, between my post and the WP page, you have everything at hand you need to make up your mind. But do not forget to tell me your conclusion. Done deal? – Did Jul 23 '15 at 13:22
  • @Did I guess I am missing something obvious? Your answer implies that $H(f,g) - H(g) \approx H(f,g) - H(f) \implies H(f) \approx H(g)$ – WeakLearner Jul 24 '15 at 06:23
  • @dimebucker91 No, my answer does not imply that. Did you read my last comment? It could help. – Did Jul 24 '15 at 06:26
6

Kullback-Leibler divergence can be regarded better in the following sense:

For two probability measures $P$ and $Q$, Pinsker's inequality states that $$ |P-Q|\le [2 KL(P\|Q)]^{\frac{1}{2}},$$ where l.h.s. is the total variation metric (corresponds to $\ell_1$-norm). So convergence in KL-divergence sense is stronger than convergence in total variation. The motivation comes from information theory as Jeff pointed out.

Michael K
  • 249
Ashok
  • 1,981
1

Following a similar calculation to the great answer already posted here, the KL divergence is essential to maximum likelihood estimation (MLE) given a set of discrete data points. The probability distribution minimizing the KL divergence is the one that maximizes the likelihood of an estimator for the parameter(s) on which the probability distribution depend(s). This is in contrast to when the probabilities involved in MLE each have errors that follow a normal distribution, in which case minimizing the Euclidean distance is equivalent to performing MLE.

Suppose that you have $N$ i.i.d. events that each yield one of $M$ different results, and you measure result $i$ $n_i$ times. You have access to a theory that tells you the predicted probability of getting result $i$ given some underlying parameter(s) $\theta$, $p(i|\theta)$, and you are trying to construct an estimator $\hat{\theta}$ that maximizes the likelihood $$\mathcal{L}({\theta}|n_1,\cdots,n_M)=\prod_{k=1}^N p(i_k|{\theta})=\prod_{i=1}^M [p(i|{\theta})]^{n_i}.$$ Straightforward calculations provide us with an alternate quantity to maximize: \begin{aligned} \hat{\theta}&=\arg\max_{\theta}\mathcal{L}({\theta}|n_1,\cdots,n_M)\\ &=\arg\max_{\theta}\ln\mathcal{L}({\theta}|n_1,\cdots,n_M)\\ &=\arg\max_{\theta}\sum_{i=1}^M n_i \ln p(i|\theta)\\ &=\arg\max_{\theta}\sum_{i=1}^M q_i \ln p(i|\theta), \end{aligned} where we have defined the measured frequencies as $q_i=n_i/N$. We continue with the manipulations: \begin{aligned} \hat{\theta}&=\arg\max_{\theta}\left[\sum_{i=1}^M q_i \ln p(i|\theta)- \sum_{i=1}^M q_i\ln q_i\right]\\ &=\arg\max_{\theta}\sum_{i=1}^M q_i \ln \frac{p(i|\theta)}{q_i}\\ &=\arg\min_{\theta}\sum_{i=1}^M q_i \ln \frac{q_i}{p(i|\theta)}\\ &=\arg\min_{\theta}K[Q(n_1,\cdots,n_M):P(\theta)]. \end{aligned}

We immediately see that the estimator $\hat{\theta}$ arises from minimizing the KL divergence between the measured frequencies $q_i$ and the theoretical probabilities $p(i|\theta)$ that depend on the underlying parameter(s). This is preferable to forming an estimator by minimizing the Euclidean distance between $Q(n_1,\cdots,n_M)$ and $P(\theta)$ because that estimator will only maximize the likelihood function for normally-distributed errors.

1

Ref: "Weighing the odds" by Williams.

[This is similar to the answer here]

Let ${ Y _1, Y _2 , \ldots }$ be an iid sample with true pdf ${ f . }$ Let ${ g }$ be another pdf. By the Strong Law of Large Numbers, with ${ \mathbb{P} _f }$ probability ${ 1 }$ we have

$${ \frac{1}{n} \ln \frac{\text{lhd}(g; Y _1, \ldots, Y _n)}{\text{lhd}(f; Y _1, \ldots, Y _n)} = \frac{1}{n} \sum _{i = 1} ^{n} \ln \frac{g(Y _i)}{f(Y _i)} \longrightarrow \mathbb{E} _f \ln \frac{g(Y)}{f(Y)} . }$$

Note that by Jensen's inequality we have

$${ \mathbb{E} _f \ln \frac{g(Y)}{f(Y)} \leq \ln \mathbb{E} _f \frac{g(Y)}{f(Y)} = 0 . }$$

We define the relative entropy as

$${ \text{App}(f \leftarrow g) := - \mathbb{E} _f \ln \frac{g(Y)}{f(Y)} \geq 0 . }$$

So roughly speaking, the typical likelihood ratio ${ \frac{\text{lhd}(g; Y _1, \ldots, Y _n)}{\text{lhd}(f; Y _1, \ldots, Y _n)} }$ of a sample ${ Y _1, \ldots, Y _n }$ is of the order of magnitude ${ e ^{- n \text{App}(f \leftarrow g) } . }$ The larger the relative entropy ${ \text{App}(f \leftarrow g) , }$ the smaller the typical likelihood ratio of a sample, and the larger the deviation of the pdf ${ g }$ from the true pdf ${ f . }$