2

In a paper for Transport Inequalities by Nathael Gozlan, the following assertion is made:

Let the relative entropy with respect to $\mu \in P(\mathcal X)$ be defined by $$ H(\nu \mid \mu) = \left\{ \begin{array}{@{}ll@{}} \int_\mathcal X \log(\frac{d\nu}{d\mu})d\nu, & \text{if}\ \nu \ll\mu \\ +\infty, & \text{otherwise} \end{array}\right. , \ \nu \in P(\mathcal X) $$

Now, make $\mu_1$ and $\mu_2$ defined on $\mathcal X_1$ and $\mathcal X_2$, respectively. For a measure $\nu$ on $\mathcal X_1 \times \mathcal X_2$, write the disintegration of $\nu$ (conditional expectation) with respect to the first coordinate as: $$ d\nu(x_1,x_2) = d\nu_1(x_1) d\nu^{x_1}(x_2) $$ Note that the disintegration is pretty much just a formal way of writing the conditional probability formula $P(X=x_1,Y=x_2) = P(X=x_1 \mid Y =x_2)P(Y=x_2)$.

Finally, the author asserts that for the product measure $\mu_1 \otimes\mu_2$ (this is equivalent to $\mu_1 \times \mu_2$, which is a diffrent notation, but with the same meaning), one can prove the following equality: $$ H(\nu \mid \mu_1 \otimes \mu_2) = H(\nu_1 \mid \mu_1) + \int_{\mathcal X_1} H(\nu_2^{x_1}\mid \mu_2)d\nu_1(x_1) $$

My question is how to prove this equality above.

Since the definition of a disintegration is not very common, I will give it here to save people the trouble of hunting it down:

Given two polish (complete and separable) measurable spaces $(\Omega, \mathcal F)$ and $(E, \mathcal A)$. If $P$ is a probability measure in $(\Omega \times E, \mathcal F \otimes \mathcal A)$, and $P_1$ the marginal distribution of the first coordinate. Then, there exists an unique probability kernel $K: \Omega \times \mathcal A \rightarrow [0,1]$, satisfying:

$$ P(A\times B) = \int_A K(\omega, B) P_1(d\omega), \ \forall A \in \mathcal F, \ B \in \mathcal A $$

In this case, we can define $$ P[X_2 \in B \mid X_1 = w] := K(w,B) $$

Where $X_1$ and $X_2$ represent the first and second coordinates respectively.

Rushabh Mehta
  • 13,845
  • 2
    This is chain rule of KL divergence, which follows from writing out the logs and noting that log(ab) = log(a) + log(b); see here lemma 3: https://homes.cs.washington.edu/~anuprao/pubs/CSE533Autumn2010/lecture3.pdf – E-A Sep 07 '20 at 19:00
  • Thanks for the comment! I saw that in the lecture notes the proof is done for the case of discrete random variables. Would you know how to prove for this more abstract case? I mean, using disintegration and all. – Davi Barreira Sep 07 '20 at 19:07
  • No worries; uh, is it not exactly the same? I mean instead of conditional probabilities, you can write dv^{x_1}(x_2). I guess if you can perhaps update your question with your attempt at translating that pdf into the continuous realm, I can check for its correctness. – E-A Sep 07 '20 at 19:49
  • cheers, I’ll give it a try – Davi Barreira Sep 07 '20 at 20:26
  • unfortunately it is not as straightforward as it seems. It’s not clear for example if you can use $\log \frac{d\nu_1d\nu_2^{x_1}}{d\mu_1\times\mu_2}= \log \frac{d\nu_1}{d\mu_1\times\mu_2} + \log \frac{d\nu_2^{x_1}}{d\mu_1\times\mu_2}$. – Davi Barreira Sep 07 '20 at 20:57
  • Your denominators need a correction. So, do you want to know why $\frac{dv}{d \mu_1 \times d \mu_2} (x_1, x_2) = \frac{d v_1}{d \mu_1} (x_1) \cdot \frac{d v_2^{x_1}}{d \mu_2} (x_2)$? if so, can you share your definition of a disintegration? – E-A Sep 07 '20 at 21:12
  • Thanks. Yes, that’s the step I don’t know how to prove. I added the definition of disintegration to the question. – Davi Barreira Sep 07 '20 at 22:19

1 Answers1

1

In order to answer this question, I answered this question on absolute continuity, and now we are ready to discuss the general chain rule of the relative entropy. In a nutshell: The $answer$ to this question is my question. As discussed in the answer to your question on absolute continuity, the conditional Radon-Nikodym derivative is well-defined for Borel spaces, and thus we get the following. Since the notation $H(\nu|\mu)$ may be easily confused with the conditional entropy, I'll write $D(\nu\|\mu)$ for the relative entropy instead, and $D(\nu\|\mu|\nu_1)$ for the conditional relative entropy (in lack of a better alternative). Notice that we have \begin{align*} D(\nu\|\mu) &=\int\log\left(\frac{\mathrm d\nu}{\mathrm d\mu}(x,y)\right)\nu(\mathrm dx,\mathrm dy) =\int\int\log\left(\frac{\mathrm d\nu_1}{\mathrm d\mu_1}(x)\frac{\mathrm d\nu_x}{\mathrm d\mu_x}(y)\right)\nu_x(\mathrm dy)\nu_1(\mathrm dx)\\ &=\int\log\left(\frac{\mathrm d\nu_1}{\mathrm d\mu_1}(x)\right)\nu_1(\mathrm dx) +\int\int\log\left(\frac{\mathrm d\nu_x}{\mathrm d\mu_x}(y)\right)\nu_x(\mathrm dy)\nu_1(\mathrm dx)\\ &=D(\nu_1\|\mu_1)+D(\nu\|\mu|\nu_1). \end{align*} The first step, where we replace the derivative by the product of derivatives, was thoroughly discussed in the other answer.

Matija
  • 3,663
  • 1
  • 5
  • 24