10

Disclaimer: I know there are similar sounding questions already here and on Stackoverflow. But they are all about collisions, which is not what I am asking for.

My question is: why is collision-less lookup O(1) in the first place?

Let's assume I have this hashtable:

Hash  Content
-------------
ghdjg Data1
hgdzs Data2
eruit Data3
xcnvb Data4
mkwer Data5
rtzww Data6

Now I'm looking for the key k where the hash function h(k) gives h(k) = mkwer. But how does the lookup "know" that the hash mkwer is at position 5? Why doesn't it have to scroll through all keys in O(n) to find it? The hashes can't be some kind of real hardware addresses because I'd lose the abbility to move the data around. And as far as I know, the hashtable is not sorted on the hashes (even if it was, the search would also take O(log n))?

How does knowing a hash help finding the correct place in the table?

Foo Bar
  • 203
  • 1
  • 7

3 Answers3

24

The hash function doesn't return some string such as mkwer. It directly returns the position of the item in the array. If, for example, your hash table has ten entries, the hash function will return an integer in the range 0–9.

David Richerby
  • 82,470
  • 26
  • 145
  • 239
7

Hash function calculates array position from given string. If this is perfect hash it means that there are for sure no collisions, the most probably array is at least twice bigger than number of elements.

For example I will give very poor hash for letters, just to ilustrate mechanism:
0) $x = 0;$
1) for each character in string take ascii value, subtract 'a' if it is lower case, subtract 'A' if upper case, add value to x. $x = x mod 52$
2) the resulting number e.g. 15 is index of array.

This very simple hash (limited and prone to collisions) differs from other hashes in mechanism of hashing, does not consider given input. In more advanced scheme the hash is bigger number, adjusted to number of elements. Perfect hash is generated for all inputs to guarantee no collisions.

This is $O(1)$ because calculating hash from string depends on how sophisticated is function computed, but does not depend on number of elements.

In case of perfect hash, when elements are added $h(k)$ is recalculated, the simpler case with collisions when array load is big the array size increases, function takes bigger modulo of output, and elements are shifted to the new places.

Array is continous memory fragment, to get $n-th$ element you take address of the first element (array start) and then add to this address $n * (size of element)$ so you have explicit memory cell.

Evil
  • 9,525
  • 11
  • 32
  • 53
4

To expand on David Richerby's answer, the term "hash function" is a little overloaded. Often, when we talk about a hash function we think of MD5, SHA-1, or something like Java's .hashCode() method, which turns some input into a single number. However the domain of this number (i.e. is maximum value) is very unlikely to be the same size as the hashtable you're trying to store data in. (MD5 is 16 bytes, SHA-1 is 20 bytes, and .hashCode() is an int - 4 bytes).

So your question is about that next step - once we have a hash function that can map arbitrary inputs to numbers, how do we put them into a data structure of a particular size? With another function, also called a "hash function"!

A trivial example of such a function is modulo; you can easily map a number of arbitrary size to a specific index in an array with modulo. This is introduced in CLRS as "the division method":

In the division method for creating hash functions, we map a key $k$ into one of $m$ slots by taking the remainder of $k$ divided by $m$. That is, the hash function is

$ h(k) = k$ mod $m$.

...

When using the division method we usually avoid certain values of $m$. For example, $m$ should not be a power of 2, since if $m = 2^p$ then $h(k)$ is just the $p$ lowest-order bits of $k$.

~Introduction to Algorithms, §11.3.1 - CLRS

So modulo isn't a great hash function, since it restricts what sizes we can safely use for our underlying data structure. The next section introduces a slighly more complex "multiplication method", which also uses modulo but is advantageous because "the value of $m$ is not critical". It however works best with some prior knowledge of "characteristics of the data being hashed" - something we often don't know.

Java's HashMap uses a modified version of the division method that does a pre-processing step to account for weak .hashCode() implementations so that it can use power-of-two sized arrays. You can see exactly what's happening in the .getEntry() method (comments are mine):

 // hash() transforms key.hashCode() to protect against bad hash functions
 int hash = (key == null) ? 0 : hash(key.hashCode());
 // indexOf() converts the resulting hash to a value between 0 and table.length-1
 for (Entry<K,V> e = table[indexFor(hash, table.length)];
     ...

Java 8 brought along a rewrite of HashMap which is even faster, but a little harder to read. It uses the same general principle for index lookup, however.

dimo414
  • 173
  • 7