34

A Bloom filter makes it possible to efficiently keep track of whether various values have already been encountered during processing. When there are many data items then a Bloom filter can result in a significant memory saving over a hash table. The main feature of a Bloom filter, which it shares with a hash table, is that it always says "not new" if an item is not new, but there is a non-zero probability that an item will be flagged as "not new" even when it is new.

Is there an "anti-Bloom filter", which has the opposite behaviour?

In other words: is there an efficient data structure which says "new" if an item is new, but which might also say "new" for some items which are not new?

Keeping all the previously seen items (for instance, in a sorted linked list) satisfies the first requirement but may use a lot of memory. I am hoping it is also unnecessary, given the relaxed second requirement.


For those who prefer a more formal treatment, write $b(x) = 1$ if the Bloom filter thinks $x$ is new, $b(x) = 0$ otherwise, and write $n(x) = 1$ if $x$ really is new and $n(x) = 0$ otherwise.

Then $Pr[b(x) = 0 | n(x) = 0] = 1$; $Pr[b(x) = 0 | n(x) = 1] = \alpha$; $Pr[b(x) = 1 | n(x) = 0] = 0$; $Pr[b(x) = 1 | n(x) = 1] = 1 - \alpha$, for some $0 < \alpha < 1$.

I am asking: does an efficient data structure exist, implementing a function $b'$ with some $0 < \beta < 1$, such that $Pr[b'(x) = 0 | n(x) = 0] = \beta$; $Pr[b'(x) = 0 | n(x) = 1] = 0$; $Pr[b'(x) = 1 | n(x) = 0] = 1 - \beta$; $Pr[b'(x) = 1 | n(x) = 1] = 1$?


Edit: It seems this question has been asked before on StackExchange, as https://stackoverflow.com/questions/635728 and https://cstheory.stackexchange.com/questions/6596 with a range of answers from "can't be done" through "can be done, at some cost" to "it is trivial to do, by reversing the values of $b$". It is not yet clear to me what the "right" answer is. What is clear is that an LRU caching scheme of some sort (such as the one suggested by Ilmari Karonen) works rather well, is easy to implement, and resulted in a 50% reduction in the time taken to run my code.

András Salamon
  • 3,532
  • 1
  • 21
  • 37

5 Answers5

15

Going with Patrick87's hash idea, here's a practical construction that almost meets your requirements — the probability of falsely mistaking a new value for an old one is not quite zero, but can be easily made negligibly small.

Choose the parameters $n$ and $k$; practical values might be, say, $n = 128$ and $k = 16$. Let $H$ be a secure cryptographic hash function producing (at least) $n+k$ bits of output.

Let $a$ be an array of $2^k$ $n$-bit bitstrings. This array stores the state of the filter, using a total of $n2^k$ bits. (It does not particularly matter how this array is initialized; we can just fill it with zeros, or with random bits.)

  • To add a new value $x$ to the filter, calculate $i \,\|\, j = H(x)$, where $i$ denotes the first $k$ bits and $j$ denotes the following $n$ bits of $H(x)$. Let $a_{i} = j$.

  • To test whether a value $x'$ has been added to the filter, calculate $i' \,\|\, j' = H(x')$, as above, and check whether $a_{i'} = j'$. If yes, return true; otherwise return false.

Claim 1: The probability of a false positive (= new value falsely claimed to have been seen) is $1/2^{n+k}$. This can be made arbitrarily small, at a modest cost in storage space, by increasing $n$; in particular, for $n \ge 128$, this probability is essentially negligible, being, in practice, much smaller than the probability of a false positive due to a hardware malfunction.

In particular, after $N$ distinct values have been checked and added to the filter, the probability of at least one false positive having occurred is $(N^2-N) / 2^{n+k+1}$. For example, with $n=128$ and $k=16$, the number of distinct values needed to get a false positive with 50% probability is about $2^{(n+k)/2} = 2^{72}$.

Claim 2: The probability of a false negative (= previously added value falsely claimed to be new) is no greater than $1-(1-2^{-k})^N \approx 1-\exp(-N/2^k) < N/2^k$, where $N$ is the number of distinct values added to the filter (or, more specifically, the number of distinct values added after the specific value being tested was most recently added to the filter).


Ps. To put "negligibly small" into perspective, 128-bit encryption is generally considered unbreakable with currently known technology. Getting a false positive from this scheme with $n+k=128$ is as likely as someone correctly guessing your secret 128-bit encryption key on their first attempt. (With $n=128$ and $k=16$, it's actually about 65,000 times less likely than that.)

But if that still leaves you feeling irrationally nervous, you can always switch to $n=256$; it'll double your storage requirements, but I can safely bet you any sum you'd care to name that nobody will ever see a false positive with $n=256$ — assuming that the hash function isn't broken, anyway.

Ilmari Karonen
  • 2,195
  • 12
  • 18
9

No, it is not possible to have an efficient data structure with these properties, if you want to have a guarantee that the data structure will say "new" if it is really new (it'll never, ever say "not new" if it is in fact new; no false negatives allowed). Any such data structure will need to keep all of the data to ever respond "not new". See pents90's answer on cstheory for a precise justification.

In contrast, Bloom filters can get a guarantee that the data structure will say "not new" if it is non-new, in an efficient way. In particular, Bloom filters can be more efficient than storing all of the data: each individual item might be quite long, but the size of the Bloom filter scales with the number of items, not their total length. Any data structure for your problem will have to scale with the total length of the data, not the number of data items.

jbapple
  • 3,390
  • 18
  • 21
6

What about just a hash table? When you see a new item, check the hash table. If the item's spot is empty, return "new" and add the item. Otherwise, check to see if the item's spot is occupied by the item. If so, return "not new". If the spot is occupied by some other item, return "new" and overwrite the spot with the new item.

You'll definitely always correctly get "New" if you've never seen the item's hash before. You'll definitely always correctly get "Not New" if you've only seen the item's hash when you've seen the same item. The only time you'll get "New" when the correct answer is "Not New" is if you see item A, then see item B, then see item A again, and both A and B hash to the same thing. Importantly, you can never get "Not New" incorrectly.

Patrick87
  • 12,924
  • 1
  • 45
  • 77
4

In the case where the universe of items is finite, then yes: just use a bloom filter that records which elements are out of the set, rather than in the set. (I.e., use a bloom filter that represents the complement of the set of interest.)

A place where this is useful is to allow a limited form of deletion. You keep two bloom filters. They start out empty. As you insert elements you insert them into bloom filter A. If you later want to delete an element you insert that element into bloom filter B. There is no way to undelete. To do a lookup you first lookup in bloom filter A. If you find no match, the item was never inserted (with probability 1). If you do find a match the element may (or may not) have been inserted. In that case you do a lookup in bloom filter B. If you find no match, the item was never deleted. If you do find a match in bloom filter B, the item was probably inserted and then deleted.

This doesn't really answer your question, but, in this limited case, bloom filter B is performing exactly the "anti-bloom filter" behavior you are seeking.

Real Bloom filter researchers use much more efficient ways of representing deletion, see Mike Mitzenmacher's publication's page.

Wandering Logic
  • 17,863
  • 1
  • 46
  • 87
1

I just want to add in here, that if you are in the fortunate situation, that you know all of the values $v_i$ that you might possibly see; then you can use a counting bloom filter.

An example might be ip-addresses, and you want to know every time one appears that you have never seen before. But it is still a finite set, so you know what you can expect.

The actual solution is simple:

  1. Add all your items to the counting bloom filter.
  2. When you see a new item, it will have values $\ge1$ in all slots.
  3. After seeing an actual new item, subtract it from the filter.

So you might have 'false positives' values that were actually old, but recognized as new. However you will never get 'not new' for a new value, since its value will still be in all the slots, and nobody else could have taken that away.

Thomas Ahle
  • 216
  • 1
  • 5