2

In general, the relative entropy is defined for all probability measures $\mu,\nu$ with $\mu\ll\nu$ (absolutely continuous), and given as $D(\mu\|\nu)=\int f\mathrm d\nu$, where $f(x)=\Lambda(\frac{\partial\mu}{\partial\nu}(x))$ and $\Lambda(x)=x\ln(x)$. Also, it's reasonable to set $D(\mu\|\nu)=\infty$ otherwise. Notice that we may equivalently understand $D(\mu\|\nu)=D(\nu,\frac{\partial\mu}{\partial\nu})$ as a function of a measure and a derivative for that measure, i.e. $D(\nu,f)$ is defined whenever $\int f\mathrm d\nu=1$.

For the question, I'd rather move to random variables, so let $(X,Y)\in\mathcal X\times\mathcal Y$. If we have $\mathbb E[R(X)]=1$ for some $R:\mathcal X\rightarrow\mathbb R_{\ge 0}$, the relative entropy $D(X,R)=D(\nu,R)$ is the relative entropy with respect to the distribution $\nu$ of $X$.

QUESTION: Suppose we have $\mathbb E[R(X,Y)]=1$, $R\ge 0$, wouldn't it make sense to define the conditional relative entropy as $$D_{\mathrm{c}}((X,Y),R)=\mathbb E\left[\mathbb E[R(X,Y)|X]\Lambda\left(\frac{R(X,Y)}{\mathbb E[R(X,Y)|X]}\right)\right],$$ using $\frac{0}{0}=0$?

MOTIVATION: Here's why I think that this make sense.

  1. This definition is general, it matches $D((X,Y),R)=\mathbb E[\Lambda(R(X,Y))]$.
  2. It's non-negative, since with the tower property and Jensen's inequality we have $$\mathbb E\left[\mathbb E[R(X,Y)|X]\mathbb E\left[\Lambda\left(\frac{R(X,Y)}{\mathbb E[R(X,Y)|X]}\right)\middle|X\right]\right]\ge 0.$$
  3. With the factorization lemma we get $R_X:\mathcal X\rightarrow\mathbb R_{\ge 0}$ with $\mathbb E[R(X,Y)|X]=R_X\circ X$, and $\mathbb E[R_X(X)]=\mathbb E[\mathbb E[R(X,Y)|X]]=1$, the Radon-Nikodm derivative for the first coordinate, which gives $$D_{\mathrm{c}}((X,Y),R)=\mathbb E\left[R(X,Y)\ln\left(\frac{R(X,Y)}{R_X(X)}\right)\right] =D((X,Y),R)-D(X,R_X)$$ using the tower property for the second part, meaning that the chain rule holds.
  4. Since the chain rule holds, this is always the correct definition. There's a closely related problem with conditional Radon-Nikodym derivatives. If the conditional derivative exists, it is equal to $R(x,y)/R_X(x)$ (as for regular conditional distributions). So, if we work on Borel spaces (or equivalently countably generated), these definitions coincide with the kernel based definitions.
  5. It clearly directly generalizes to $f$-divergences. Since these are usually treated in a very general manner in the first place, I think it would be a fit, however it doesn't seem to be in use (says Google). Also, since this definition does not rely on kernels, it does not rely on the theory and thus prevents questions like this one.

Since the actual question is, well, not of a mathematical nature, I'll add this question. Can we establish the other basic properties of conditional $f$-divergences? In particular, I was not able to show convexity without additional assumptions. This means, writing $D_{\mathrm{c}}(\mu,\nu)=D_{\mathrm{c}}((X,Y),R)$, where $\mu$ is the distribution of $(X,Y)$ and $\nu$ is given by the Radon-Nikodym derivative with respect to $\mu$, can we show that $D_{\mathrm{c}}(\mu,\nu)$ is jointly convex in its arguments?

Matija
  • 3,663
  • 1
  • 5
  • 24

0 Answers0