0

I have $d$ items (say numbers 1 to $d$). I would like to uniformly randomly sample $k$ items out of $d$, without replacement. Suppose I do such draws independently $n$ times. I now want to take the union of the draws, and I'm interested in the size of this union $u \le kn$.

For example, if $d = 10$, $k = 2$, $n = 3$, I may draw [[1, 2], [4, 1], [5, 9]], and the union is [1, 2, 4, 5, 9] and $u = 5$.

Q: What distribution does $u$ follow, and what are its parameters? Or can we show that there's no closed-form distribution?

My intuitions:

  • If $kn \ll d$, then the $u \approx kn$ since we don't expect too much overlap between the draws, but it becomes tricky as $kn$ gets to a "significant" fraction of $d$ since overlap happens more often.
  • Also, since we draw $k$ items without replacement, larger $k$ should mean more frequent overlaps than larger $n$.
  • Perhaps $d - u$ follows a Binomial: if considering every item's probability of not being drawn (over all the draws) as an independent Bernoulli, then the number of items not drawn (i.e. $d - u$) should follow a Binomial
kzliu
  • 3

1 Answers1

0

This can be done using inclusion–exclusion.

The probability that $\ell$ particular numbers haven’t been sampled after $n$ draws of $k$ items is

$$ \left(\frac{\binom{d-\ell}k}{\binom dk}\right)^n\;. $$

Thus, by the Generalised inclusion-exclusion principle, the probability that exactly $t=d-u$ numbers haven’t been sampled is

$$ \sum_{\ell=t}^d(-1)^{\ell-t}\binom\ell t\binom d\ell\left(\frac{\binom{d-\ell}k}{\binom dk}\right)^n\;. $$

In your example, with $d=10$, $k=2$, $n=3$, $u=5$ and thus $t=10-5=5$, this probability is

\begin{eqnarray*} &&\binom{10}2^{-3}\sum_{\ell=5}^{10}(-1)^{\ell-5}\binom\ell5\binom{10}\ell\binom{10-\ell}2^3 \\ &=& \binom{10}2^{-3}\left(\binom55\binom{10}5\binom52^3-\binom65\binom{10}6\binom42^3+\binom75\binom{10}7\binom32^3-\binom85\binom{10}8\binom22^3\right) \\ &=& \frac{1\cdot252\cdot10^3-6\cdot210\cdot6^3+21\cdot120\cdot3^3-56\cdot45\cdot1^3}{45^3} \\ &=& \frac{112}{225} \\[4pt] &\approx& 50\%\;. \end{eqnarray*}

For this simple case we could also have calculated the pedestrian way: Choose one of $10$ numbers to double, choose one of $3$ draws in which it doesn’t appear, choose the remaining numbers in one of $\binom94=126$ ways, and distribute them over the three draws in one of $4\cdot3=12$ ways, for a probability of

$$ \frac{10\cdot3\cdot126\cdot12}{\binom{10}2^3}=\frac{112}{225}\;. $$

Edit in response to comment:

As usual, due to the linearity of expectation, the expectation and variance are easier to determine than the entire distribution.

The probability that a given number hasn’t been sampled after $n$ draws of $k$ items is

$$ \left(1-\frac kd\right)^n\;, $$

so by linearity of expectation the expected number of unsampled numbers after $n$ draws is

$$ d\left(1-\frac kd\right)^n\;. $$

To obtain the variance, let $I_j$ be the indicator variable of the event that the number $j$ hasn’t been sampled. Then $X=\sum_jI_j$ is the number of unsampled numbers, and

\begin{eqnarray*} \mathsf E\left[X^2\right] &=& \mathsf E\left[\left(\sum_jI_j\right)^2\right] \\ &=& \mathsf E\left[\sum_jI_j^2+\sum_{i\ne j}I_iI_j\right] \\ &=& \mathsf E\left[\sum_jI_j+\sum_{i\ne j}I_iI_j\right] \\ &=& \mathsf E[X]+\sum_{i\ne j}\mathsf E\left[I_iI_j\right] \\ &=& d\left(1-\frac kd\right)^n+d(d-1)\left(\frac{\binom{d-2}k}{\binom dk}\right)^n \\ &=& d\left(1-\frac kd\right)^n+d(d-1)\left(\frac{(d-k)(d-k-1)}{d(d-1)}\right)^n \\ &=& d\left(1-\frac kd\right)^n+d(d-1)\left(1-\frac kd\right)^n\left(1-\frac{k}{d-1}\right)^n \end{eqnarray*}

(where the product $I_iI_j$ is the indicator variable of the event that neither $i$ nor $j$ has been drawn), so the variance is

\begin{eqnarray*} \mathsf{Var}\left[X\right] &=& \mathsf E\left[X^2\right]-\mathsf E[X]^2 \\ &=& d\left(1-\frac kd\right)^n+d(d-1)\left(1-\frac kd\right)^n\left(1-\frac{k}{d-1}\right)^n-\left(d\left(1-\frac kd\right)^n\right)^2 \\ &=& d\left(1-\frac kd\right)^n\left((d-1)\left(1-\frac{k}{d-1}\right)^n+1-d\left(1-\frac kd\right)^n\right)\;. \end{eqnarray*}

In your example, with $d=10$, $k=2$, $n=3$, these are

\begin{eqnarray*} \mathsf E[X] &=& 10\left(1-\frac2{10}\right)^3 \\ &=& \frac{128}{25} \\[4pt] &=& 5.12 \end{eqnarray*}

and

\begin{eqnarray*} \mathsf{Var}\left[X\right] &=& 10\left(1-\frac2{10}\right)^3\left((10-1)\left(1-\frac29\right)^3+1-10\left(1-\frac2{10}\right)^3\right) \\ &=& \frac{29696}{50625} \\[4pt] &\approx& 0.59\;. \end{eqnarray*}

joriki
  • 242,601
  • Thank you! This looks like a hairy distribution. How would one go about characterizing the mean & std of the distribution? – kzliu Jan 24 '24 at 22:26
  • @kzliu: I've added the calculation of the mean and variance. – joriki Jan 24 '24 at 23:17