7

A Bloom filter uses a hash function to test membership in a given set $S$, by checking if an item is present of not at the specified position.

To mitigate the effect of hash collision, multiple functions are used, yielding probabilistic bound if using universal hash.

We can use 10 bits per elements to have 'reasonable' error rate.

If we could directly build a perfect hash function for the set $S + \infty$, where the last element is one not present in $S$, then we could use only 1 bit per element and have perfect recovery.

What are the fundamental reasons why this reasoning is wrong ?

Erel Segal-Halevi
  • 6,088
  • 1
  • 25
  • 60
nicolas
  • 325
  • 1
  • 6

2 Answers2

7

I think your reasoning is in principle correct. Perfect hashing is an alternative to Bloom filters. However, classical dynamic perfect hashing is rather a theoretical result than a practical solution. Cuckoo hashing is probably the more "reasonable" alternative.

Note that both dynamic perfect hashing and standard cuckoo hashing performance is only expected amortized (you might need to rebuild the data structure completely from time to time). Also Bloom filter are easier to implement. This might be arguments for using a Bloom filter, especially if you can live with false positives.

A.Schulz
  • 12,252
  • 1
  • 42
  • 64
2

I think the Bloom filter gives you something the perfect hash function does not - it can test membership.

The PHFs I know return some answer for any key you apply them to. If the key you supplied is not in your hash set, some value is still supplied. This is fine if you are storing all of the keys that are in your set somewhere and the PHF just gives a pointer, or if you're only using the PHF to look up satellite data of size $O(1)$ on keys you happen to know to be in your structure. However, membership testing is harder.

In particular, storing $n$ distinct elements without error requires $n \log_2 n$ bits of storage.

jbapple
  • 3,390
  • 18
  • 21