12

I understand the Kullback-Leibler divergence well enough when it comes to a probability distribution over a single variable. However, I'm currently trying to teach myself variational methods and the use of the KL divergence in conditional probabilities is catching me out. The source I'm working from is here.

Specifically, the author represents the KL divergence as follows:

$$ \operatorname{KL} (Q_{\phi} (Z|X) || P(Z|X)) = \sum_{z \in Z} q_{\phi} (z|x) \log\frac{q_{\phi} (z|x)}{p(z|x)} $$

Where the confusion arises is on the summation across $Z$. Given that $z \in Z$ and $x \in X$, I would have expected (by analogy with conditional entropy) a double sum here of the form:

$$ \operatorname{KL} (Q_{\phi} (Z|X)||P(Z|X)) = \sum_{z \in Z} \sum_{x∈X} q_{\phi} (z|x) \log\frac{q_{\phi} (z|x)}{p(z|x)} $$

Otherwise, it seems to me that KL is only being calculated for one sample from $X$. Am I missing something basic here? And if my intuitions are off, any tips on getting them back on track would be useful––I'm teaching myself this stuff, so I don't have the benefit of formal instruction.

Lodore66
  • 225
  • 1
    Would be useful to know a bit more about what's confusing you/not matching up with expectations. $KL(Q(Z|X) | P(Z|X)$ is the KL-divergence between two conditional probability distributions, you are conditioning on $X$ & so you don't marginalise it out. You would equally write $KL(\tilde{Q} | \tilde{P}) = \sum_z \tilde{q}(z) \log \frac{\tilde{q}(z)}{\tilde{p}(z)}$ where any mention of $x$ is thrown away until we actually need it again – Nadiels Nov 28 '18 at 00:32
  • Thanks for the input here. When you say "you are conditioning on X & so you don't marginalise it out"––that's where my confusion is. Let's say Z = {z1, z_2 , z_3} and X = {x_1 , x_2}. The conditional distributions Q(Z|X) and P(Z|X) must then––I think!––have as variables the set. – Lodore66 Nov 28 '18 at 16:52
  • [Same comment as above, but with better formatting.] Thanks for the input here. When you say "you are conditioning on X & so you don't marginalise it out"––that's where my confusion is. Let's say $Z = {z_1, z_2 , z_3}$ and $X = {x_1 , x_2}$. The conditional distributions Q(Z|X) and P(Z|X) must then––I think!––have as variables the set Z x X. But when the KL doesn't sum across all the values of X, it seems to me that it's just conditioning on one value of X––say, $x_1$. Hence, it leaves out most of the distribution. – Lodore66 Nov 28 '18 at 17:00
  • I think $X$ is a random variable, and $P(Z|X)$ is a posterior probability distribution. – GoingMyWay Apr 22 '20 at 03:05

2 Answers2

17

It depends on whether you are conditioning on a random variable or an event.

Given a random variable $x$,

$$ \operatorname{KL}[p(y \mid x) \,\|\, q(y \mid x)] \doteq \iint p(\bar{x},\bar{y}) \ln\frac{p(\bar{y} \mid \bar{x})}{q(\bar{y} \mid \bar{x})} \mathrm{d}\bar{x} \mathrm{d}\bar{y} \quad\text{or}\quad \sum_{\bar{x}}\sum_{\bar{y}} p(\bar{x},\bar{y}) \ln\frac{p(\bar{y} \mid \bar{x})}{q(\bar{y} \mid \bar{x})}. $$

Given an event $\bar{x}$,

$$ \operatorname{KL}[p(y \mid \bar{x}) \,\|\, q(y \mid \bar{x})] \doteq \int p(\bar{y}|\bar{x}) \ln\frac{p(\bar{y} \mid \bar{x})}{q(\bar{y} \mid \bar{x})} \mathrm{d}\bar{y} \quad\text{or}\quad \sum_{\bar{y}} p(\bar{y}|\bar{x}) \ln\frac{p(\bar{y} \mid \bar{x})}{q(\bar{y} \mid \bar{x})}. $$

Note how conditioning on an event is equivalent to changing the probability distribution over its variable to a point mass. This is what turns the joint into a conditional above,

$$ p'(x,y) \doteq p(y|x)\delta_{\bar{x}}(x)=p(y|\bar{x}). $$

To be more explicit, you can also choose instead of the KL conditioned on a random variable to use an expectation over event of the KL conditioned on those event,

$$ \operatorname{KL}[p(y \mid x) \,\|\, q(y \mid x)] =\operatorname{E}_{\bar{x}\sim p(x)}\big[ \operatorname{KL}[p(y \mid \bar{x}) \,\|\, q(y \mid \bar{x})] \big]. $$

Mixing up random variables and event is quite common but it's often easy to know from the context which is meant.

danijar
  • 721
  • 1
    Thanks, this is a source of some really good intuition, especially the way you pinpoint the error as a confusion of a variable and an outcome. – Lodore66 May 17 '20 at 09:32
  • @Lodore66 Glad it's helpful. Feel free to accept the answer so that it shows up on top for other visitors. – danijar May 17 '20 at 18:55
  • @danijar a bit late to the question, but if you're given a data set of realized events $\bar{x}1,..,\bar{x}_n$, you can compute the KL divergence at each event $\bar{x}_i$ correct? And then how would you compute $E{\bar{x}_1 \sim P(x)}[KL(P(y|\bar{x}_i) || q(y|\bar{x}_i)]$, if you do not have access to the distribution $P(x)$? I guess trivially you can assume uniform distribution and simply take the average .. is there a refinement of that? – chibro2 Apr 30 '24 at 15:16
1

I don't quite see what confuses you. Think about how we compute, for example, a conditional expectation: $E(Z \mid X)=\sum_Z P(Z \mid X) $ : that is, we sum only over $Z$, and the result is a function of the conditioning variable $X$. (Put in other way, your each value of $X$ we have that $P(Z \mid X=x)$ is a different probability distribution - and hence for each value of $X$ we have different values of the (conditioned to $X=x$) expectation, variance, etc). The same happens here. And the conditioned KL divergence is not a number, but a function of $X$.

leonbloy
  • 66,202
  • 1
    This is exactly what i wasn't getting:

    "each value of X we have that P(Z∣X=x) is a different probability distribution - and hence for each value of X we have different values of the (conditioned to X=x) expectation, variance, etc)."

    I thought the KL outputted a value for each P(Z∣X=x) , not a distribution, and that was why I kept thinking that the divergence calculated a unique value for P(Z|X), not a range of values for each x ∈ X. Thanks so much!

    – Lodore66 Dec 01 '18 at 12:01
  • I think that's reasonably confusing, given how $H(Y|X)$ in information theory is usually taken to be a value and not a random variable. It makes a lot of sense though, but the right hand side of the definition should really use capital $X$ rather than $x$. – Thomas Ahle Apr 07 '20 at 21:36
  • $P(Z|X)$ is a posterior distribution right. I think the op was confused by the expectation of posterior distribution. – GoingMyWay Apr 22 '20 at 03:03
  • This answer is not quite correct. You can condition a KL divergence either on an outcome or on a random variable, both are fine. The example of an expectation also confuses random variables and outcomes and missing a variable. It should probably be $E[Z|X]$ = sum_z P(Z=z | X) * z. – danijar May 16 '20 at 20:27