Is there a unified definition of entropy for arbitrary random variables?

Question

In most introductory textbooks on information theory, the entropy of a discrete random variable (r.v.) is defined as

$$H(X) \triangleq -\sum p(x)\log p(x)=-\mathbb E[\log p(X)],$$

where $p$ is the pmf of $X$; while that of a continuous random variable is given by

$$h(X)\triangleq -\int f(x)\log f(x) dx=-\mathbb E[\log f(X)],$$

where $f$ is the pdf of $X$.

Question: Is there a unified definition of entropy for arbitrary random variable?

My question is motivated by Robert M. Gray's book, "Entropy and Information Theory." In his book, he provide a unified definition of divergence and hence for mutual information:

Given a probability space $(\Omega, \mathcal B, P)$ and another probability measure defined on the same space, define the divergence of $P$ with respect to M by

$$D(P\, \Vert\, Q)\triangleq \sup_{\mathcal Q} \sum_{Q\in \mathcal Q}P(Q)\log\frac{P(Q)}{M(Q)},$$

where $\mathcal Q$ is any finite measurable partition of $\Omega.$

For any two random variables, define $$I(X;Y)\triangleq D(P_{XY}\Vert P_X\times P_Y),$$ where $P_{XY}$ and $P_X\times P_Y$ are the joint distribution and product distribution of $X$ and $Y$, respectively.

From my understanding, the nice thing about this definition of mutual information is that it is a unified definition which works for arbitrary random variables, and it reduces to the usual definition of $I(X;Y)$ when $X$ and $Y$ are discrete, or continuous r.v.'s.

In Gray's book, he then goes on to define the entropy in terms of mutual information defined above:

$$H(X)\triangleq I(X;X).$$

For a discrete r.v., this also reduces to the regular definition of entropy given at the very beginning of this question. Perfect. However, if $X$ is a continuous r.v., say Gaussian, then I think this definition gives $H(X)=\infty,$ since it implies that

$$H(X)=\sup_q H(q(X)),$$ where $q$ is any quantizer of $X$. So it appears inconsistent with $h(X)$, the usual finite (differential) entropy, doesn't it? Hence the question.

Or how do we reconcile this inconsistency? Does it make more sense to have the entropy of a continuous r.v. defined as infinity, don't use it, and just use its mutual information with other random variables? If we must, we just use its differential entropy, $h(X),$ and have it defined differently from its entropy $H(X)$?

I dont think that there is a "universal" definition because the concept itself has born from pure subjectivity. One defines entropy depending on what you want to measure. The different concepts of entropy measure different things that doesnt seems that they can be universally generalized, in the same way that the concept of integration cannot be universally generalized. — , Jun 01 '18 at 04:23

score 12 · Accepted Answer · answered Jun 01 '18 at 05:17

First of all, differential entropy almost can't be the right definition, and in my opinion it's mostly a historical accident. There are a number of reasons, but the clearest one is that it is coordinate dependent. Consider the random variable $X$ on $[0,1]$ whose cumulative distribution function is $F(x) = x^2$ and whose density function is $F'(x) = 2x$. Its differential entropy is:

$$H(X) = -\int_0^1 2x \log(2x)\, dx = \frac{1}{2} - \log 2$$

Now apply the coordinate change $u = x^2$. The new cumulative distribution function is $G(u) = u$, so the new density function is $G'(u) = 1$ and the differential entropy is:

$$H(X) = -\int_0^1 1 \log 1\, dx = 0$$

I've never heard a convincing explanation for why an information theoretic quantity should depend on the coordinate system. (Side note: in the first coordinate system $H(X) < 0$, which is also a bit pathological.)

Edwin Jaynes was the first to notice this problem, and he also figured out the solution (though I'm not sure he would have expressed it the same as I will). If you think carefully about how entropy arises in practice, it is almost always as a relative quantity: the quantity of interest in applications is usually the information gained relative to some prior. In the discrete case the prior is usually just the uniform distribution so you don't lose much by pretending that entropy is an absolute quantity, but in the continuous case you often don't have a uniform distribution (e.g. there is no uniform probability measure on $\mathbb{R}$) and even when you do it looks different in different coordinate systems.

So the solution is to work relative to an explicit prior by considering the KL-divergence to be the fundamental object. You gave one definition, but a better definition uses Radon-Nikodym derivatives. Assume that $P$ is a probability measure which is absolutely continuous with respect to $Q$ (if it is not then the relative entropy has to be infinite) and define:

$$D(P \| Q) = \mathbb{E}_P \left( \log \frac{dP}{dQ} \right) = \int P \log \frac{dP}{dQ}$$

where $\frac{dP}{dQ}$ is the Radon-Nikodym derivative.

Now, given a real-valued random variable $X$ on a probability space $(\Omega, \Sigma, P)$, push $P$ forward along $X$ to obtain a measure $P_X$ on $\mathbb{R}$. Formulate a prior as a probability distribution $Q$ on $\mathbb{R}$ (normalized Lebesgue measure on a large interval is a common choice) and consider the quantity $D(P_X \| Q)$. This is almost always a better object to work with in applications than the differential entropy of $X$.

Despite this nice answer, I'm puzzled about why maximum relative entropy is used by some people as a justification of certain distributions (like the normal). Could one also use relative entropy somehow to characterize/justify sich continuous max entropy distributions? — Michael Bächtold, Jun 27 '23 at 20:45

leonbloy · Answer 2 · 2018-06-04T15:04:48.707

As Paul Siegel's answer implies, the differential entropy is not a "real" entropy.

This also agrees with your observation, regarding Gray's book: yes, the mutual information can be "well defined" for any random variable, discrete or continuous (for example, it does not depend on the scale). And, yes, accepting $H(X)=I(X;X)$ implies $H(X)=+\infty$ for a (non-degenerate) continuous variable.

Then, again: the "true" entropy, in the Shannon information sense, of (say) a uniform random variable in $[0,1]$ is infinite. And this makes sense, because the amount of information of a real number in that interval (equivalently, and infinite string of bits) is infinite.

(Other facts that points to the differential entropy not being a true entropy, is that $h(X)=0$ does not imply lack of uncertainty (case in point, the uniform above), and we can also have $h(X)<0$.)

Is there a unified definition of entropy for arbitrary random variables?

There is a unified (and nice) definition of mutual information for arbitrary random variables; it gives finite values for (typical) discrete and continuous variables. From that you a unified (but not nice) definition of entropy: $H(X)=I(X;X)$. This is "not nice" ony in the sense it gives infinity for (non-degenerate) continuous variables - which is surely correct, but useless to compute the (typically finite) mutual information: $I(X;Y)=H(X)-H(X\mid Y)=\infty - \infty$. The differential entropy sort of takes away the infinite portion of the true entropy, so that, in spite of it not being a true entropy, the difference gives the correct mutual information $I(X;Y)=h(X)-h(X\mid Y)$

Is there a unified definition of entropy for arbitrary random variables?

2 Answers2

Linked