0

If I generate uniform random integers from 1 to K and count how many unique numbers I get $n_\mathrm{unique}$, I empirically obtain:

  • the mean is: $\frac{2K}{\pi}$
  • the variance is $\frac{K}{\pi^{2}}$.

Matching these with the mean and variance formulas of a binomial distribution, I get:

$$p = 1 - 1 / (2 \pi)$$ $$n=\left\lfloor \frac{2K}{\pi-\frac{1}{2}}\right\rfloor$$

This works, but is not perfect.

Is there a way to derive the probability distribution of $n_\mathrm{unique}$?


Attempt

The first integer is always unique and added to the training set. The second has a probability 1/\left(K-1\right) to be added, and a probability 1/K of not being added.

Therefore probability is $$P_{2}(n)=\begin{cases}1/K & n=1\\(K-1)/K & n=2\end{cases}$$

Subsequently, we have two cases, either in iteration j nothing is added, which happens with probability ($n_{j}+1)/K$, where $n_{j}$ is the size of the set so far, or something is added, which happens with probability $$1-(n_{j}+1)/K=(K-n_{j}+1)/K$$.

Let's then consider the probability of obtaining k unique numbers: $$P(X=k)=\frac{K}{K}\times\frac{K-1}{K}\times\frac{K-2}{K}\times...\times\frac{K-k+1}{K}$$

This is wrong or at least incomplete.


Related problems

This is related to the generalised birthday problem and the number of expected collisions, previously discussed in these posts, which however obtain different formulas, possibly because they focus on the expected number of people with shared birthdays, not the number of shared birthdays:

The coupon's collector problem seems related, but this has the number of trials unbounded. I am essentially looking for the coupon's collector's success after n trials of collecting n coupons.

j13r
  • 365
  • If you have $k$ people each independently equally likely to have any of $d$ days as their birthday (your $K$), then the probability of seeing a total of $b$ days as somebody's birthday (your $n_{\text{unique}}$) is $\frac{d! ,S_2(k,b)}{(d-b)! ,d^k}$ where $S_2(k,b)$ is a Stirling number of the second kind. When these numbers start getting large, then a binomial distribution with parameters $d$ and $1-e^{-k/d}$ provides a reasonable though not perfect approximation. – Henry May 30 '23 at 13:31
  • @Henry: where do you have this formula from? – j13r May 30 '23 at 19:06
  • @Henry: Not sure if this is imprecise wording, but I am looking for the total number of dates in collisions, not the number of days colliding with somebody's birthday. – j13r May 30 '23 at 19:06
  • That expression (which appears several times on this site, such as this) was for the probability of seeing $b$ days which are at least one person's birthday; the expectation is $d\left(1- \left(1-\frac1d\right)^k\right)$. If you are looking for days which are at least two people's birthday, then the distribution gets more complicated; the expected number is $d\left(1- \frac{k+d-1}{d-1} \left(1-\frac1d\right)^k\right)$. – Henry May 30 '23 at 21:27
  • With the query "Stirling number of the second kind probability" I found another question which is essentially identical to mine: https://math.stackexchange.com/questions/227556/the-exact-probability-of-observing-x-unique-elements-after-sampling-with-repla – j13r May 31 '23 at 07:04
  • ... which was based on https://math.stackexchange.com/a/32816/6460 – Henry May 31 '23 at 08:29

0 Answers0