What is the relation between (smoothed) max-entropy and source compressibility?

Question

One thing mentioned in the comments of What's the rationale behind the definitions of min- and max-entropies? is the fact that the max-entropy quantifies the number of bits needed to compress a given source with zero error, in the single-shot regime. The max-entropy of a random variable $X$ with pdf $P$ (assumed finite and discrete) is here defined as $$H_{\rm max}(X)_P \equiv \log|\{x : \,\, P(x)>0\}| = \log\lvert\operatorname{supp}(P)\rvert.\tag1$$

On the one hand, it seems relatively straightforward that this is the case: to "compress the source" with zero error means to find a bijective map sending the possible outcomes of the source, that is, the elements of $\operatorname{supp}(P)$, into some possibly smaller alphabet. But clearly, there cannot be any bijective mapping whose image contains less elements than its support.

Similarly, if we consider the corresponding smoothed max-entropy: $$H_{\rm max}^\varepsilon(X)_P \equiv \min_{Q: \, \|Q-P\|_1\le \varepsilon} H_{\rm max}(X)_{Q},\tag2$$ what we are doing is consider the size of the support of $P$, when we are allowed to "neglect" probabilities up to some amount $\varepsilon$. So we just remove lowest-probability events, being careful to not commit an error larger than $\varepsilon$.

At the same time, in (Konig et al. 2009: The operational meaning of min- and max-entropy) the authors state in Equation (11) that the "minimum length $\ell_{\rm compr}^\varepsilon(X)$ of an encoding from which the value of $X$ can be recovered with probability at least $1-\varepsilon$" can be written as $$\ell_{\rm compr}^\varepsilon(X)= H_{\rm max}^{\varepsilon'}(X) + O(\log(1/\varepsilon)),\tag3$$ for some $\varepsilon'\in[\varepsilon/2,2\varepsilon]$. As the authors remark, this is equivalent to stating that $$\ell_{\rm compr}^\varepsilon(X) \in [H_{\rm max}^{2\varepsilon}(X),H_{\rm max}^{\varepsilon/2}(X)].\tag4$$ In stating this, the authors cite (Renner and Wolf 2004: Smooth Renyi entropy and applications). In this paper I only see the result about source compression mentioned in-passing towards the end of the second page, however.

My question is then, why does my naive argument leading me to believe that $H_{\rm max}^\varepsilon(X)$ tells us directly how much a source can be compressed within total error $\varepsilon$ fail? It seems that the analogous reasoning for the zero-error case, which amounts to $\varepsilon=0$, is indeed consistent with (4). However, even though obviously $H_{\rm max}^\varepsilon(X)\in [H_{\rm max}^{2\varepsilon}(X),H_{\rm max}^{\varepsilon/2}(X)]$, and thus with my conclusion is technically in agreement with (4), the fact that the result was given as in (3) and (4) would make me think that my argument is not correct.

In equation 2 you are still optimizing over normalized $Q$. So if you remove $\epsilon$ probability from somewhere then you gain $\epsilon$ probability elsewhere. Hence your distance is actually $|Q-P| = 2 \epsilon$. — Rammus, Sep 06 '22 at 14:24
@Rammus ah, good point! It seems people define smooth entropies with $\varepsilon$ rather than $2\varepsilon$ though. So we'd being saying that the amount of compression achievable with total error $\le\varepsilon$ should be given by $\ell_{\rm compr}^\varepsilon(X)=H_{\rm max}^{2\varepsilon}(X)$ then, right? That would explain one side of (3,4) — glS, Sep 06 '22 at 14:49
Looking at the 2004 paper you cite perhaps the interval comes from the fact that they assume the encoding is independent of the distribution of $X$. The encoding your assuming here is dependent on the distribution — Rammus, Sep 06 '22 at 15:50
@Rammus mmh. What would such a distribution-independent encoding look like, though? You mean like a fixed function which maps each symbol into some other symbol in a specific way, but only using information about that specific symbol? Because if we allow for arbitrary (bijective) functions, I can just define one that neglects lowest probability events up to $\varepsilon$, and acts as the identity on the others. The decoder is then also just trivial. Isn't this already a distribution-independent encoding? — glS, Sep 06 '22 at 16:27
@Rammus upon some reflection, I suppose you meant a coding scheme that only operates on the basis of the individual observed samples, without assuming pregress knowledge of the underlying prob distribution. Eg if the probs are $(0.5,0.25,0.25)$, the function would only get one of "1", "2", or "3", and have to somehow "compress" it. Knowing the distribution, this function would essentially just neglect lower prob events up to a threshold. But without knowing it, I'm not sure how one would even start building such a compression scheme in the single-serving scenario — glS, Sep 07 '22 at 08:52

What is the relation between (smoothed) max-entropy and source compressibility?

0 Answers0