Approximate the size of a set given random items from the set.

Question

I'd like to know how would it be possible to approximate the size of a set that has no duplicate elements.

We can make a limited amount of requests. Each request gives us a random element from the set (Without removing it).

How do we approximate the size of the set given the amount of duplicate elements gotten from the set when doing $x$ requests?

I tried to use thinking similar to the linked question. I tried using $\sum_{i=1}^{x-1}(1-\frac{N-i}{N})$ (In hope to derive a formula for $N$) where $N$ represents the amount of elements in the set. This would only work for finding the probability ANY duplicate exists. I tried making a sum of sums solution, but I realized it might not work. I thought about calculating a different equation for every possible value of $d$ (duplicates), but I'm not sure how unbiased that solution would be. I'm hoping for a solution from somebody who knows how to handle probability well. — Mateon1, Feb 23 '15 at 00:41
I'd also like an explanation for all the downvotes, if my question is either not clear or doesn't fit the standards here, please point it out. I believe it's generally frowned upon on Stack Exchange not to leave one when downvoting. — Mateon1, Feb 23 '15 at 00:45
Chances are the downvotes are from people who are unhappy that you haven't shown any evidence of having put any more work into your question than that required to copy'n'paste it from its source. Oh, and your question is related to The German Tank Problem, http://en.wikipedia.org/wiki/German_tank_problem — Gerry Myerson, Feb 23 '15 at 06:33
See also http://math.stackexchange.com/questions/455931/german-tank-problem-simple-derivation and http://math.stackexchange.com/questions/65398/why-does-this-expected-value-simplify-as-shown and http://math.stackexchange.com/questions/455840/why-are-these-estimates-to-the-german-tank-problem-different and http://math.stackexchange.com/questions/75758/estimate-the-size-of-a-set-from-which-a-sample-has-been-equiprobably-drawn and several other question that have appeared on this website. — Gerry Myerson, Feb 23 '15 at 06:38

score 0 · Answer 1 · answered Feb 23 '15 at 05:47

There are "capture/recapture" techniques that are generally used in ecological studies (population of birds, or what have you), but those approaches are subtly different than what you are proposing (since your "recapture" set is not predefined).

From a maximum likelihood perspective, as long as you have at least one duplicate value, then you would estimate the number of elements in the set $N$ as the number of unique values in your sample ($\hat N$). To infer any more than $\hat N$ would make your sample less likely, and to infer less than $\hat N$ would make your sample impossible. However, this approach will have a negative bias, since it will, on average, underestimate the actual N. Of course, it is asymptotically unbiased, so larger samples will yield less biased estimates.

Another approach is the counting approach:

If you sample $M$ times, then there are $N^M$ possible samples. Now, if you only sample $U<M$ unique values, so there are $D=M-U$ duplicate values, then there are only ${M\choose D}U!$ ways to create a sample of size $M$ with these particular $U$ unique values. However, there are a total of ${N \choose U}$ ways to select these $U$ unique values from the original set $N$. Therefore, the probability you observe $D$ duplicates by sampling $M$ times from a set of size $N$ is:

$${N\choose U}{M \choose D} \frac{U!}{N^M}$$

Your task is then to find the value of $N$ that maximizes this probability. This will be the maximum likelihood estimate of $N$, and will likely be biased for small samples as well.

Approximate the size of a set given random items from the set.

1 Answers1