112

If I have a list of key values from 1 to 100 and I want to organize them in an array of 11 buckets, I've been taught to form a mod function

$$ H = k \bmod \ 11$$

Now all the values will be placed one after another in 9 rows. For example, in the first bucket there will be $0, 11, 22 \dots$. In the second, there will be $1, 12, 23 \dots$ etc.

Let's say I decided to be a bad boy and use a non-prime as my hashing function - take 12. Using the Hashing function

$$ H = k \bmod \ 12$$

would result in a hash table with values $0, 12, 24 \dots $ in the first bucket, $1, 13, 25 \dots$ etc. in the second and so on.

Essentially they are the same thing. I didn't reduce collisions and I didn't spread things out any better by using the prime number hash code and I can't see how it is ever beneficial.

Paresh
  • 3,368
  • 1
  • 21
  • 33
CodyBugstein
  • 3,017
  • 11
  • 31
  • 46

7 Answers7

132

Consider the set of keys $K=\{0,1,...,100\}$ and a hash table where the number of buckets is $m=12$. Since $3$ is a factor of $12$, the keys that are multiples of $3$ will be hashed to buckets that are multiples of $3$:

  • Keys $\{0,12,24,36,...\}$ will be hashed to bucket $0$.
  • Keys $\{3,15,27,39,...\}$ will be hashed to bucket $3$.
  • Keys $\{6,18,30,42,...\}$ will be hashed to bucket $6$.
  • Keys $\{9,21,33,45,...\}$ will be hashed to bucket $9$.

If $K$ is uniformly distributed (i.e., every key in $K$ is equally likely to occur), then the choice of $m$ is not so critical. But, what happens if $K$ is not uniformly distributed? Imagine that the keys that are most likely to occur are the multiples of $3$. In this case, all of the buckets that are not multiples of $3$ will be empty with high probability (which is really bad in terms of hash table performance).

This situation is more common that it may seem. Imagine, for instance, that you are keeping track of objects based on where they are stored in memory. If your computer's word size is four bytes, then you will be hashing keys that are multiples of $4$. Needless to say that choosing $m$ to be a multiple of $4$ would be a terrible choice: you would have $3m/4$ buckets completely empty, and all of your keys colliding in the remaining $m/4$ buckets.

In general:

Every key in $K$ that shares a common factor with the number of buckets $m$ will be hashed to a bucket that is a multiple of this factor.

Therefore, to minimize collisions, it is important to reduce the number of common factors between $m$ and the elements of $K$. How can this be achieved? By choosing $m$ to be a number that has very few factors: a prime number.

Mario Cervera
  • 3,804
  • 2
  • 20
  • 22
19

Whether a collision is less likely using primes depends on the distribution of your keys.

If many of your keys have the form $a+k\cdot b$ and your hash function is $H(n)=n \bmod m$, then these keys go to a small subset of the buckets iff $b$ divides $n$. So you should minimize the number of such $b$, which can be achieved by choosing a prime.

If on the other hand you like to have $11$ to $12$ buckets and you know that differences which are multiples of $11$ are more likely than differences which are multiples of $2$ and $3$, you may choose $12$ for your very special application.

frafl
  • 2,339
  • 1
  • 17
  • 32
11

Whether this has an impact (also) depends on how you treat collisions. When using some variants of open hashing, using primes guarantees empty slots are found as long as the table is sufficiently empty.

Try to show the following, for instance:

Assume we want to insert an element that hashes to address $a$ and resolve collisions by trying positions $a + i^2$ subsequently for $i=1,2,\dots$.

Show that this procedure always yields an empty position if the hash table is of size $p$, $p$ a prime larger than $3$, and at least half of all positions are free.

Hint: Use the fact that the residue class ring modulo $p$ is a field if $p$ is prime and therefore $i^2=c$ has at most $2$ solutions.

Raphael
  • 73,212
  • 30
  • 182
  • 400
5

First of all, the question is phrased incorrectly. The following are equivalent and correct expressions of the intended question:

  • why must we use a prime number as the modulo of the hash value (not "in the hashing function)

or equivalently and more succinctly:

  • why must the size of a hash table be a prime number?

The proper answer to this question lies in the context, and that context is open addressing with double hashing collision resolution.

Where you compute:

hashval1 = hashfcn1(k) // for initial location
hashval2 = hashfcn2(k) // probing increment

hashval1 is used for the initial insertion probe. If that location is empty, then you insert the (k, v) and you’re done, never using hashval2. If that location is occupied, then you jump hashval2 addresses to (hashval1 + hashval2) % arraysize, and repeat to see if the location is empty. If that location is again not empty, then you jump another hashval2 addresses to (hashval1 + 2 * hashval2) % arraysize. And so on.

Here’s the key: HOW MANY TIMES DO YOU PROBE BEFORE GIVING UP?

If every element of the array is potentially a target, then the answer is: you probe n-times. This is easy to visualize with linear probing, you write a for (i = 0; i < n; ++i) loop for probing which breaks early if a empty slot is found. It’s no different with double hashing, with the exception that the probe increment is hashval2 rather than 1. But this leads to a problem, you’re not guaranteed to probe every slot in the array when hashval2 > 1. Simple example of this: array size = 10 and hashval2 = 2; in this example you only ever probe the even or odd slots, in fact, you end up probing each slot twice, which is pointless. Even if you make the array size an odd number like 15, if probe increment is 5 then the probe sequence will be (let’s use hashval1 = 0): 0, 5, 10, 0, 5, 10, 0 5, 10, 0, 5, 10, 0, 5, 10. You keep probing the same three occupied slots over and over.

What’s the problem here? The problem is that the hash table size and hashval2 share common factors, in this case 5. So we need the hash table size and hashval2 to share no factors. Since hashval2 is a function of k and post modulo lies in the range of 0 < hashval2 < array size, there’s nothing we can do about hashval2. That leaves the array size. The array size in double hashing must be guaranteed to have no factors in common with hashval2. Thus by simple math, the array size must be a prime number.

Example: Array size = 11 (a prime number). hashval1 = 1.

hashval2 = 1; // 1,2,3,4,5,6,7,8,9,10,0 <-- probing sequence
hashval2 = 2; // 1,3,5,7,9,0,2,4,6,8,10
hashval2 = 3; // 1,4,7,10,2,5,8,0,3,6,9
hashval2 = 4; // 1,5,9,2,6,10,3,7,0,4,8
hashval2 = 5; // 1,6,0,5,10,4,9,3,8,2,7
hashval2 = 6; // 1,7,2,8,3,9,4,10,5,0,6
hashval2 = 7; // 1,8,4,0,7,3,10,6,2,9,5
hashval2 = 8; // 1,9,6,3,0,8,5,2,10,7,4
hashval2 = 9; // 1,10,8,6,4,2,0,9,7,5,3
hashval2 = 10;// 1,0,10,9,8,7,6,5,4,3,2

You always probe a maximum of the number of slots in the hash table; i.e., every slot has an equal chance of hosting a value. But since the probe increment is itself a hash value, it is uniformly distributed over the size of the hash table which eliminates the clustering you get with other collision resolution algorithms.

TL;DR: a hash table size that is a prime number is the ONLY way to guarantee that you do not accidentally re-probe a previously probed location.

bking
  • 51
  • 1
  • 2
2

If your hash function is of the form $h(k)=a\times k \mod m$ where $m$ is prime and $a$ is chosen at random, then the probability that 2 distinct keys hash to the same bucket is $1\over m$. So for $m=1009$, $Pr\{h(x)=h(y), x\neq y\}=0.00099108027$ which is very small.

This scheme is known as: Universal Hashing.

mrk
  • 3,748
  • 23
  • 35
1

Just to complement Mario Cervera's answer.

Let's consider a hash function given by hash(k) = k mod 12, so our hash table has 12 buckets.

First, we need to understand how collisions occurr in hash tables. Two numbers with keys $k_1$ and $k_2$ are said to collide when hash(k_1)=hash(k_2) which is $k_1 \mod 12 = k_2 \mod 12$.

Now imagine that we have a list of values to add to our hash table, with the following keys respectively: $ [ 0, 1, 2, \dots, 100]. $

The following plot shows the size of each bucket in the hash table, after adding the above values.

As can be seen, the bucket size is fairly even accross the hash table. In practice however, it is seldom the case that the data is so neatly distributed.

It is for example, more common that there are values that are more likely to occur than others. An example where this occurs is if the keys are strings, and the hash function is just adding the ascii code of each character, and then modding the result.

In any case, let's assume the extreme case where the set of keys consists of only numbers that are multiples of 3.

So we construct the list as follows:

[0, 3, 6, 9, 12, 15, 18, ...]

Now we try filling in the hash table again, using the same hash function as before: enter image description here

As you can see, it is less balanced. In fact, the uneven buckets are 0, 3, 6, and 9.

To see why, suppose we only have multiples of 3 as keys, and observe how they are assigned to buckets:

Keys:     0  3  6  9  12 15 18 ...
          ↓  ↓  ↓  ↓  ↓  ↓  ↓ 
Bucket:   0  3  6  9  0  3  6  ...

Which explains the uneven buckets above.

Now let's try with a bucket size of 10, and the same unbalanced data.

enter image description here

It is again well balanced. And 10 is not a prime. So what's going on?

For the unbalanced data, the main difference between bucket size 10 and bucket size 12, is that the buckets that are multiples of 3 do not line up neatly inside the bucket size 10, as marked in bold:

Keys:     0  3  6  9  12 15 18 ...
          ↓  ↓  ↓  ↓  ↓  ↓  ↓ 
Bucket:   0  3  6  9  2  5  8  ...

Which explains the nicely balanced result above.

So the main issue is that 12 (the first bucket size) is a multiple of 3.

So the general rule is: since we know that the data might be biased, we want to decrease the chance that it is biased towards numbers that divide the bucket size. And which numbers are famous for having few divisors? The prime numbers!

Claudio
  • 11
  • 2
-2

I think we have a short answer by using probability. we use % (modulo) in hashing to calculate remainder of a division(our data % chosen number). if our mod be a prime number which also is our divisor it is more likely that division has a remainder and less numbers has same factor like our divisor So less collision happens.

Sizerta
  • 1
  • 1