6

I'm trying to solve the computer science problem "Minimal perfect hash function" (MPHF). I have an algorithm that can generate a MPHF for very large sets in $O(n)$ that only needs 1.54 bits/key, very close to the theoretical limit of 1.44 bits/key, and it is relatively fast. But I would like to make it faster. To do that, I need to find MPHFs for "small" sets, lets say of size $n=16$. Right now I can do that in $O(\frac{n^n}{n!})$, but maybe there is a $O(n \log{n})$ algorithm.

I would like to reformulate the problem to another domain (for example Chinese remainder theorem, prime number factorization, Galois Field, finding the GCD), and solve it like this, or prove brute force is the the best algorithm.

Warning: I'm not good at math, so the following is probably ridiculous, but maybe somebody can help to improve my question and use the right tags.

So I came up with a (integer only) formula that has multiple solutions, and one unknown, $x$. I'm trying to find the smallest $x$ quickly. There are a number of known, randomly distributed, positive integers $h_n$, in the range [0 .. 65536). The size $s$ is quite small, let's say 16 (the higher the better). The right hand side of the equation is a constant.

$$ \sum_{n=1}^{s} {s ^ {(\lfloor\frac{h_n x}{65536}\rfloor \bmod s)}} = \sum_{n=1}^{s} {s ^ {n}} $$

How can the smallest $x$ be found quickly. For $s = 10$, $x$ is probably below 12'000, with $s = 11$ below 32'000, with $s = 12$ below 85'000, and so on.

The $\sum_{n=1}^{s} {s ^ {n}}$ is a weird way to do "bitwise OR". What I'm trying to do is ensure there are no collisions, or written in another way, in case $s=3$:

$$ \lfloor\frac{h_1 x}{65536}\rfloor \not\equiv \lfloor\frac{h_2 x}{65536}\rfloor \not\equiv \lfloor\frac{h_3 x}{65536}\rfloor \bmod 3 $$

The function $f = (\lfloor\frac{h x}{65536}\rfloor \bmod s)$ is just an example. It is a "supplemental hash function", and is supposed to returns an integer between $0$ and $s$ with about equal probability. Any other function with the same property can be used instead. A good supplemental hash function is a bit more complex, but this one is good enough for my use case (it doesn't need to be perfect).

  • 1
    If you are really close to the 1.44 bits/key limit, that is significantly better than the 2.6 bits/key of what seems to be the currently known best general purpose solution. – justanotherhagman Dec 22 '15 at 13:14
  • @justanotherhagman, Wikipedia is not up-to-date, the current best is Hash, displace, and compress with 2.07 bits/key. I'm at 1.54 bits/key right now. – Thomas Mueller Dec 22 '15 at 13:40
  • I know this is 3 years later, but for the mod 3 example, consider the case h = [0, 655363, 655366], where for all x, floor(h[i]*x/65536)%3 = 0. – kylefinn Dec 12 '18 at 20:22
  • @kylefinn yes, I forgot to say, multiplication is GF(2^64) – Thomas Mueller Dec 13 '18 at 15:17
  • @ThomasMueller Then consider 4 instead of 3, h=65536[0,4,8,12]. Now for all x, floor(h[i]x/65536)%(2^64)%4 = 0 – kylefinn Mar 07 '19 at 18:58
  • @kylefinn there is a solution with x = 16384. But let's assume there is no solution for some cases - as long as there is likely a solution for random data, then that's sufficient. If there is no solution with a low probability, then we can retry with other values. – Thomas Mueller Mar 08 '19 at 08:23
  • @kylefinn ah you mean h_1 = 4 * 65536... I used h_1 = 4. Yes, all h need to be smaller than 65536. I will edit the question. – Thomas Mueller Mar 08 '19 at 08:27

2 Answers2

2

The approach described in the paper "Constructing Minimal Perfect Hash Functions Using SAT Technology" describes a way to find an MPHF for a small set that is faster than brute force. There is also a reference implementation. It works for sets up to about 40 keys. Still, it only finds a solution with a certain probability at a certain size. In practise, it may not be faster than brute force, and it may not use less space: as far as I understand, it depends a lot on the SAT solver implementation.

1

A new algorithm has been proposed for this problem, in ShockHash: Towards Optimal-Space Minimal Perfect Hashing Beyond Brute-Force. This is done using a small, heavily overloaded cuckoo hash table. The entries of the small set (e.g. 24 entries each time) are hashed using a seed, and then added to a cuckoo hash table that is 100% full. Each entry can be in two positions. This will likely not work, so a new seed is used. At some point this works, and the seed is stored. Storing this seed uses about 0.5 bits per entry. Also stored is, for each entry, at which location it is stored. This information is stored using a retrieval data structure - e.g. a ribbon filter, with a bit more than one bit per entry.