Probability that $\frac{n}{2}$ bins are empty [close]

Question

A Bloom filter of length $n$ was built. I have only the first $\frac{n}{2}$ bits of this filter. How will the false positive probability change?

For the whole Bloom filter, the false positive probability is $\leq [1-e^{\frac{-l\cdot m}{n}}]^l$, where $l$ is the number of hash functions and $m$ is the number of elements which were inserted into the Bloom filter.

I tried to evaluate this, but I think that the false positive probability will be the same. Can anybody confirm or explain how can I count this probability?

For one thing, your exponent $l$ will be halved to $l/2$, because when checking for presence, you will be able to find the bits/positions corresponding to only (on average) $l/2$ of the hash functions. The others will lie outside the first half, and it's as if they didn't exist when checking. However, they did exist when setting the bits, so the false positive probability should be even higher than that. — ShreevatsaR, Dec 11 '13 at 02:34
I started thinking about this question, and realized that you need to define what you count as "positive". When you have the full Bloom filter, a positive for an item is defined as checking for all the $l$ positions (given by the $l$ hash functions, on this item), and finding that all of them are set. Here, you'll only be able to check $l' \le l$ of the positions; the rest may fall outside the first $n/2$ bits. Suppose all of them are set: will you assume that all the rest are set as well? Or not set? — ShreevatsaR, Dec 11 '13 at 17:48
@ShreevatsaR I have no information about the rest $\frac{n}{2}$ bits,so I can't assume that these bits are set. — SugerBoy, Dec 11 '13 at 20:11
@Mandlbrot: If you don't want false negatives, you pretty much have to do so. — ShreevatsaR, Dec 12 '13 at 02:00
@Mandlbrot: BTW, have left an answer below. Feel free to ask if you have any questions. — ShreevatsaR, Dec 13 '13 at 04:38

score 2 · Accepted Answer · edited Apr 13 '17 at 12:19

The question is not well-posed, as it doesn't specify how a "positive" is determined. You are trying to say, based on just the first $n/2$ bits, whether a particular item has been added to the Bloom filter. When you check the bit positions given by the $l$ hash functions, some of them will lie in the first half, and (unless you have been improbably lucky and all of them were in the first half) some in the second. Of course, if even one of those bits in the first half is not set, then you can immediately declare a negative. The question is what you do when all of them are set.

There are several options:

declare a negative. This would be foolish, as you'd almost always declare a negative. The Bloom filter's crucial property that you never have false negatives, only false positives, is destroyed.
estimate the number of bits set in the second half, and calculate the probability that the other bits are set. This would be better accuracy-wise, but this too allows false negatives.
declare a positive. If you want to maintain the crucial—almost defining—property of a Bloom filter, of not having false negatives, then this is your only choice.

With the assumption that you do (3), the analysis is as in the classical case.

Suppose, after the Bloom filter is created, that a fraction $q$ of bits are still empty (unset), among the first $n/2$. (So $q$ is some number of the form $\frac{r}{n/2}$, for $0 \le r \le n/2$.) You now test for an item which happens not to have been added.

For a particular one of the $l$ hash functions, the probability that it contributes to a negative is $\frac12q$ (the bit position given by the hash function should lie in the first half, and then moreover among the $q$ fraction of unset bits). [Or, saying this differently, the probability that it counts as positive is $\frac12 + \frac12(1-q)$: either in the second half, or in the first half and among the fraction $1-q$ of set bits. Either way, the probability of contributing to a positive is $1-\frac12q$.]

The probability that all $l$ of them count as positive is therefore $\left(1-\frac12q\right)^l$.

Now for the approximations. The expected value of $q$ is the probability that a certain bit is left untouched by all of the $l$ hash functions for all of the $m$ items: that is $$ E[q] = \left(1 - \frac1n\right)^{lm} \approx \exp(-lm/n)$$ We can prove as in the usual case that $q$ is very strongly concentrated around its expected value. So the probability of a false positive is therefore

$$\begin{align} \left(1-\frac12q\right)^l & \approx \left(1 - \frac{\exp(-lm/n)}{2}\right)^l \end{align} $$

as opposed to the $\left(1 - \exp(-lm/n)\right)^l$ of the full-Bloom-filter case.

Thanks for your explanation! I reach for the Mitzenmacher and Upfal, studied it and I understand all, I can now make this evaluation on my own. — SugerBoy, Dec 15 '13 at 17:24
@Mandlbrot: The two are related by a simple relation, so at the level of approximation we are operating in, it doesn't make much of a difference; in fact it makes the analysis slightly simpler. I've updated the answer. — ShreevatsaR, Dec 16 '13 at 03:12

score 1 · Answer 2 · answered Dec 11 '13 at 01:44

A Bloom filter is...

an array of $n$ bits with
$k$ random hash functions, $f_i: S \to \{ 1, \dots , n\}$ with $i = 1, \dots, \ell$
no too many "collisions" $|f^{-1}(k)| < M$ for $1 \leq k \leq n$.

We then add elements of $S_0 \subseteq S$ by "flipping" each of the $k$-hash values for our inputes. So $ T = f_1(S_0) \cup \dots \cup f_n(S_0) $.

Calculating False Positives

A "false" positive occurs when $ \{ f_1(s), \dots, f_n(s) \} \in T$ but $s \notin S_0$.

For one element, a single element gets flipped with probability $\frac{1}{m}$ and not flipped with probability $1 - \frac{1}{m}$.

Since there are $k$ independent hash functions, the probability is $\left(1 - \frac{1}{n}\right)^k$.

Since we have inserted $|S_0| = \ell$ elements, the odds are $\left(1 - \frac{1}{n}\right)^{k\ell}$ a particular bit was not flipped.

For our test element $s \in S$, the odds that all hash functions test positive is $$ \mathbb{P}( f_1(s) \in T, \dots, f_k(s) \in T ) \approx \mathbb{P}( f_1(s) \in T ) \dots \mathbb{P}( f_k(s) \in T )$$

It's not really rigorous to way these events are independent, but continuing:

$$ \approx \left( 1 - \left(1 - \frac{1}{n}\right)^{k\ell}\right)^k \approx \left(1 - e^{-k\ell/n} \right)^k $$

Assuming all the bits were flipped independently, it is as if we have a bloom filter with $\frac{n}{2}$ bits, but half the time nothing gets flipped at all. So it should be more likely we get false positives. I got

$$ \mathbb{P}[ \{f_1(s), \dots, f_n(s)\} \in T \wedge s \notin S_0] = \left(1 - e^{-2k\ell/n} \right)^k$$

Is this Rigorous?

No. Anywhere we assumed independence is up for grabs... It is not clear that $k$ hash functions exist which are both "random" and have few "collisions".

I think your $k$ is the question's $l$, and your $\ell$ (except in the third line) is the question's $m$. I think your final number also needs justification (I get a different answer): the part where you say "I got" needs some explanation as to how you got that. :-) — ShreevatsaR, Dec 12 '13 at 02:38

Probability that $\frac{n}{2}$ bins are empty [close]

2 Answers2

A Bloom filter is...

Calculating False Positives

Is this Rigorous?