1

The data processing inequality states that if you have a Markov chain of random variable $X \rightarrow Y \rightarrow Z$, then $I(X;Y) \geq I(X;Z)$.

This all makes sense in the discrete case, but in the continuous case, which seems to be where it is actually used (in the case of neural networks https://arxiv.org/abs/1703.00810), there is a counter-example:

If I pick $X=unif(0,0.5)$, and $Y=X$, and $Z=c$ where $c$ is some constant.

then $I(X;Y)=I(X;X)=H(X)=-\log(2)$ and $I(X;Z)=0$ since $X$ and $Z$ are certainly independent.

but $-\log(2) \ngeq 0$. So the data processing inequality is wrong?

Is there any way to resolve this issue?

mathreadler
  • 26,534

1 Answers1

4

The line

$$I(X;Y)=I(X;X)=H(X)=-\log(2)$$

is wrong. Which equality is false, depends on what you mean by $H(X)$

If you mean the differential entropy (let's better write $h(X)$ in that case), then the equality $I(X;X)=h(X)$ is false. It's indeed true that $I(X;X)=h(X)-h(X\mid X)$, but $h(X\mid X)$ (which is the differential entropy of a constat, i.e, a Dirac delta density) is not zero but minus infinity. (If you are not convinced of this, compute the differential entropy of a uniform in $[0,a]$, and let $a\to 0$)

If you mean the true entropy (Shannon entropy), then you can indeed write $I(X;X)=H(X)$, but now $H(X) =+\infty$, because a continous variable (with support over an interval of positive length) has an infinite amount of information.

On both accounts, $I(X;Y) = +\infty$

The moral is : don't believe that the differential entropy is a (Shannon) entropy.

leonbloy
  • 66,202
  • 1
    Correct me if I'm wrong but that would mean that in the context of neural networks. If I have a random variable $X$ (representing my input distribution), and $Y$ = $f(X)$ for some deterministic map $f$ (the first hidden layer of the neural network), and $Z=g(Y)$ for some deterministic map $g$ (the second layer of my neural network). then $I(X;X)=\infty, I(X;Y)=\infty, I(X;Z)=\infty$. – puzzleshark Nov 28 '17 at 18:05
  • 1
    That would depend on the map. In your example where $g(Y)=c$, then $I(Y,Z)=0$ (they are independent). – leonbloy Nov 28 '17 at 18:33
  • 1
    I think it should be true for every non-constant measurable function. It should be whenever the joint distribution $P(f(X), X)$ is not absolutely continuous w.r.t $P(f(X))P(X)$ – puzzleshark Nov 29 '17 at 00:29