Entropy of a binary random vector with exactly $k$ non-zeros

Question

Suppose $\mathbf{X} = [X_1, \cdots, X_n]^T$ is a binary random vector of dimension $n$ ($\mathbf{X} \in \{ 0,1\}^n$), and for all realizations of $\mathbf{X}$, there exists exactly $k$ ones (and $n-k$ zeros).

Furthermore, suppose that each $X_i$, which is a Bernoulli random variable, has a different probability of success, say $\mathbb{P}[X_i = 1] = p_i$.

My question is, how to calculate entropy of $\mathbf{X}$? Notice that if $p_i$ was equal for all $i = 1, \cdots, n$, we have $n \choose k$ different combinations where each of them would be with equal probability, so we would have that $H(\mathbf{X}) = \log_2 ({n \choose k})$ bits. However, since $p_i$ can be different (and in my actual application highely non-uniform), these combinations have very different probabilities, so the actual entropy is far less than $\log_2 ({n \choose k})$.

Notice that this is different than the binomial distribution since all realizations have exactly $k$ non-zeros.

What is actually the distribution of $\mathbf{X}$ then?

what do you mean by "suppose that each $X_i$, which is a Bernoulli random variable, is independent". Independent from what? If exactly $k$ of the $X_i$ are ones, then they are not independent! — Cettt, Mar 07 '18 at 13:53
@Cettt That's a very good point actually! I should edit the post. Do you have any idea what is the maximal degree to which they can be independent? — Sohrab, Mar 07 '18 at 13:59
@Sohrab What's a "degree" of independency? Either they are independent… or not. — Gono, Mar 07 '18 at 15:18
@Gono Like having observed $k$ of them to be all one, the rest $(n-k)$ of them would be surely zero. So something like the notion of rank of a matrix. Suppose I gather all the realizations in one matrix, I want the rank of this matrix to be as high as possible. — Sohrab, Mar 07 '18 at 15:24

Gono · Answer 1 · 2018-03-07T17:04:57.497

Ok… so let's try to formulate this a bit more accurate:

Let's consider $n$ independent Bernoulli r.v. $$(X_i)_{1\le i \le n}$$ with $$P[X_i = 1] = p_i \in [0,1]$$

and $$\mathbf{X} = [X_1, \ldots, X_n]^T$$.

Now we have the condition $$\sum_{j=1}^n X_j = k \in \{0,\ldots,n\}$$ and we want to know the distribution of $\mathbf{X}$ under that condition, hence under $\tilde{P}$ where $$\tilde{P}(A) := P\left(A \,\Bigg| \sum_{j=1}^n X_j = k\right)$$

We know that $\mathbf{X}$ takes values in $\{0,1\}^n$ but under $\tilde{P}$ only those $$x = [x_1, \ldots, x_n]^T \in \{0,1\}^n$$ matters which fulfill $\sum\limits_{j=1}^n x_j = k$ and for such $x$ we have: $$\begin{align} \tilde{P}(\mathbf{X} = x) &= \frac{P\left(\mathbf{X} = x, \sum\limits_{j=1}^n X_j = k\right)}{P\left(\sum\limits_{j=1}^n X_j = k\right)} \\\\ &= \frac{P(\mathbf{X} = x)}{P\left(\sum\limits_{j=1}^n X_j = k\right)}\end{align}$$

Because the $X_i$ were assumed independent under $P$ we have on the one hand, that $$P(\mathbf{X} = x) = \prod_{j=1}^n P(X_i = x_i)$$ and on the other hand that $$\sum_{j=1}^n X_j$$ is Poisson Binomial distributed hence we have $$P\left(\sum\limits_{j=1}^n X_j = k\right) = \begin{cases} \prod\limits_{i=1}^n (1-p_i) & k=0 \\ \frac{1}{k} \sum\limits_{i=1}^k (-1)^{i-1}P\left(\sum\limits_{j=1}^n X_j=k-i\right)\sum\limits_{j=1}^n \left( \frac{p_j}{1-p_j} \right)^i & k>0 \end{cases}$$

Now you have all necessary to calculate the entropy of $\mathbf{X}$…

Thank's for the answer. As far as I can see, the situation is very hopeless for typical values of my application like $n = 1000$ and $k = 20$.. Any approximation? Something like finding an equivalent $n'$ for which the ${n' \choose k}$ events are almost uniformly distributed, given that most of the $p_i$'s are close to zero? — Sohrab, Mar 07 '18 at 17:40

leonbloy · Accepted Answer · 2018-03-09T11:24:09.530

Given the comment to (apt) Gono's answer, I add that for large $k,n$, one would expect that the approximation of assuming independent variables is reasonable. This, of course, assuming that $\sum_{i=1}^n p_i = k$ (a necessary condition, given that $E (\sum X_i) = k = \sum E(X_i)=\sum p_i $).

A heuristic justification is that, for large $n$ the sum of the independent (unconditioned) variables will be near the expected value, hence the conditioning turns rather irrelevant (a rather similar argument as used in "Poissonization" approximations).

In this approximation, then

$$H(\mathbf{X}) \approx \sum_{i=1}^n h(p_i) \tag{1}$$

where $h()$ is the binary entropy function.

Further, as a quick sanity check, notice that for the case of constant $p_i=\frac{k}{n}$, the exact entropy you got was $H=\log_2 ({n \choose k})$ - which for large $n,k$ can be approximated by $ n \, h(\frac{k}{n}) $. This coincides with the above approximation, which assumes (approximate) independence.

Edit: let it make a little more rigorous

Let $\mathbf{Y}$ be the independent Bernoullis, with $\sum_{i=1}^n p_i = k$. And let $S=\sum_{i=1}^n Y_i$.

From the chain rule $H(\mathbf{Y},S)= H(\mathbf{Y})+ H(S \mid \mathbf{Y}) = H(S)+H(\mathbf{Y} \mid S) $

But $H(S \mid \mathbf{Y}) =0$. Hence

$$ H(\mathbf{Y} \mid S)=\sum_s H( \mathbf{Y} \mid S=s) P(S=s)=H(\mathbf{Y}) - H(S) \tag{2}$$

For large $n$, $P(S)$ will be gaussian-like, with mean $\mu_S=k$. Assuming $H( \mathbf{Y} \mid S=s)$ is well behaved, smooth and not too asymmetric around $s=\mu_S$, then we should be able to approximate the weighted sum by the value corresponding to the mean. And this is precisely the entropy of $\mathbf{X}$:

$$\sum_s H( \mathbf{Y} \mid S=s) P(S=s) \approx H( \mathbf{Y} \mid S=k)=H(\mathbf{X}) \tag{3}$$

Also, we know $H( \mathbf{Y}) =\sum_{i=1}^n h(p_i) $, then

$$ H(\mathbf{X}) \approx \sum_{i=1}^n h(p_i) - H(S) \tag{4}$$

It remains to compute $H(S)$ - there is no simple formula, but it can be bounded above (probably tightly) by the Binomial distribution. Hence we can refine the approximation (plugging yet another approximation):

$$ H(\mathbf{X}) \approx \sum_{i=1}^n h(p_i) - \frac12 \log_2(2 \pi e k (1 - \frac{k}{n})) \tag{5}$$

It looks reasonable that $H(\mathbf{X}) < H(\mathbf{Y}) $, because we are adding a restriction.

Also, again, this result can be sanity-checked against the case of constant $p_i$. The term $H(S)$ gives a (fair) second order correction to $(1)$.

This is a great answer! I assume $k$ does not have to be large, it suffices only that $n$ is large. — Sohrab, Mar 09 '18 at 10:42
@Sohrab The approximation might work for small $k$, but perhaps not this justification. The argument here (eq $(3)$ esp) looks more convincing for both $n,k$ large (say, both growing with $k/n$ approx. constant). — leonbloy, Mar 09 '18 at 12:57

Entropy of a binary random vector with exactly $k$ non-zeros

2 Answers2