2

I am aware that the average bootstrap sample (meaning sampling $n$ unique objects $n$ times with replacement) has an expected number of unique samples:

$$n(1-(1-1/n)^n)$$

However I am having trouble justifying it.

I agree that the probability of a specific object $m$ being present in a given sample is $$1-(1-1/n)^n$$

But I don't understand the justification for then saying the probability of $n$ objects being present is just $n$ times that quantity.

I tried justifying it as follows:

Let $S$ be a random variable representing a bootstrap sample.

Let $B_m = I(m \in S)$

According to the above logic, $B_m \sim Bernoulli(1-(1-1/n)^n)$

claim: The random variable for the total number of unique samples is

$$ \sum_{m=1}^N B_m$$

So by linearity of expected value, the expected value of the number of unique samples is

$$n(1-(1-1/n)^n)$$

My problem with the claim is that there is nothing taking into the account the fact that each $B_m$ is drawn from the same bootstrap sample. I feel there should be some kind of interaction or dependance that needs to be accounted for but I'm not sure how to make that rigorous or concrete. Any help solidifying this would be much appreciated!

Edit: Actually I now know the claim to be false, as it would imply there is a non-zero probability of having $0$ unique values, which is obviously false! So I really just have no clue how the expected value might be justified.

I did find one answer which seems to justify it more rigorously. Is this line of reasoning necessary, or is there a simpler way to justify the expected value?

1 Answers1

1

I think your justification is correct. Linearity of expectation doesn't care about dependencies. The variables can be arbitrarily dependent and the property still holds.

$B_m$ are dependent on each other, but their marginals can be interpreted as Bernoulli random variables

The claim is not false. Due to the dependency of these Bernoulli random variables you cannot have 0 selected samples.

Cris
  • 351