Neural Networks - Data Processing Inequality Issue

Question

The data processing inequality states that if you have a Markov chain of random variable $X \rightarrow Y \rightarrow Z$, then $I(X;Y) \geq I(X;Z)$.

This all makes sense in the discrete case, but in the continuous case, which seems to be where it is actually used (in the case of neural networks https://arxiv.org/abs/1703.00810), there is a counter-example:

If I pick $X=unif(0,0.5)$, and $Y=X$, and $Z=c$ where $c$ is some constant.

then $I(X;Y)=I(X;X)=H(X)=-\log(2)$ and $I(X;Z)=0$ since $X$ and $Z$ are certainly independent.

but $-\log(2) \ngeq 0$. So the data processing inequality is wrong?

Is there any way to resolve this issue?

$I(X;Y)$ is the mutual information and you meant $H(X) = 1$. — reuns, Nov 28 '17 at 06:22
https://en.wikipedia.org/wiki/Entropy_(information_theory)#Definition — reuns, Nov 28 '17 at 06:27
This is differential entropy https://en.wikipedia.org/wiki/Differential_entropy, it can be negative — puzzleshark, Nov 28 '17 at 06:29
in the wiki for differential entropy it has this example. although it says it equals -log(2) — puzzleshark, Nov 28 '17 at 06:33
Clearly it is not the definition of $H(X),I(X;Y)$ you are supposed to use. — reuns, Nov 28 '17 at 06:35
When sampling $X$ to a discrete r.v. you get $H(X_s) \ge 0$ thus your definition of $H(X)$ doesn't make sense since it is not continuous wrt to sampling. Next time make more efforts in your questions. — reuns, Nov 28 '17 at 06:44
Are you trying to say differential entropy doesn't make sense? — puzzleshark, Nov 28 '17 at 06:46
Which one ? Do you understand that in signal processing we want everything to be continuous with respect to sampling, whereas differential entropy is not ? (sampling means $X_s = \frac{1}{s} \lfloor X/s \rfloor$ where $s> 0$ is very small, $H(X_s) \ge 0$ is the usual discrete information entropy, and you want $H(X )= \lim_{s \to 0} H(X_s)$) — reuns, Nov 28 '17 at 06:51
I never said that it had that property. I'm just going by what is in the wiki article for differential entropy. — puzzleshark, Nov 28 '17 at 07:02

leonbloy · Answer 1 · 2017-11-30T12:25:17.183

4

The line

$$I(X;Y)=I(X;X)=H(X)=-\log(2)$$

is wrong. Which equality is false, depends on what you mean by $H(X)$

If you mean the differential entropy (let's better write $h(X)$ in that case), then the equality $I(X;X)=h(X)$ is false. It's indeed true that $I(X;X)=h(X)-h(X\mid X)$, but $h(X\mid X)$ (which is the differential entropy of a constat, i.e, a Dirac delta density) is not zero but minus infinity. (If you are not convinced of this, compute the differential entropy of a uniform in $[0,a]$, and let $a\to 0$)

If you mean the true entropy (Shannon entropy), then you can indeed write $I(X;X)=H(X)$, but now $H(X) =+\infty$, because a continous variable (with support over an interval of positive length) has an infinite amount of information.

On both accounts, $I(X;Y) = +\infty$

The moral is : don't believe that the differential entropy is a (Shannon) entropy.

edited Nov 30 '17 at 12:25

answered Nov 28 '17 at 16:42

leonbloy

66,202

1

Correct me if I'm wrong but that would mean that in the context of neural networks. If I have a random variable $X$ (representing my input distribution), and $Y$ = $f(X)$ for some deterministic map $f$ (the first hidden layer of the neural network), and $Z=g(Y)$ for some deterministic map $g$ (the second layer of my neural network). then $I(X;X)=\infty, I(X;Y)=\infty, I(X;Z)=\infty$. – puzzleshark Nov 28 '17 at 18:05
1

That would depend on the map. In your example where $g(Y)=c$, then $I(Y,Z)=0$ (they are independent). – leonbloy Nov 28 '17 at 18:33
1

I think it should be true for every non-constant measurable function. It should be whenever the joint distribution $P(f(X), X)$ is not absolutely continuous w.r.t $P(f(X))P(X)$ – puzzleshark Nov 29 '17 at 00:29

Neural Networks - Data Processing Inequality Issue

1 Answers1

Linked