Hash table collisions: why use a linked list if we can use a hash set?

Question

One way to deal with the problem of collisions for a hash table is to have a linked list for each bucket. But then the lookup time is no longer constant. Why not use a hash set instead of a linked list? The lookup time would then be constant. For example, in C++ the hash table could be defined as:

unordered_map<int, unordered_set<int>> m;

score 8 · Answer 1 · answered Dec 29 '15 at 12:55

An hash set is an hash table. Using an hash set to handle collisions in an hash table is equivalent to using a bigger hash table, with an hashing function which is a combination of the hashing functions of both level.

In other words, you'd probably be better with a bigger initial table (for instance there is no risk of resonance between the two hash functions which could lead to an higher collision rate than expected at the second level).

score 7 · Answer 2 · edited Apr 13 '17 at 12:48

But then the lookup time is no longer constant

Not worst-case constant -- which it never is for (basic) hashtables -- but it is still average-case constant, provided the usual assumptions on input distribution and hashing function.

Why not use a hash set instead of a linked list?

And how do you implement that one? You have created a circular definition.

The lookup time would then be constant.

Nope, see above.

Your major confusion seems to be about what people mean when they say that hashtables have constant lookup time. Read here on how this is true and false.

score 5 · Answer 3 · answered Dec 29 '15 at 19:19

You absolutely can do this. You just have to be careful with how you set things up.

There's a type of hash table called a dynamic perfect hash table that, with some modifications, is essentially what you're describing. It works by having a two-layer hash structure, where collisions in the top level are resolved by building a hash table for the second level that is guaranteed to have no collisions.

In order to get this to work, you need to have access not just to a single hash function, but to a family of different hash functions. There's two reasons for this: first, you need to ensure that if you get a collision in the top-level hash table, you don't then get the same collisions in the second-level hash table. Second, the second-level hash table needs to not have any collisions in it (after all, we're trying to resolve collisions via a second round of hashing!), so the hash table works by choosing new hash functions until it finds one with no collision.

This system gives guaranteed O(1) lookups - you need to do at most two hashes - with expected O(1)-time insertions and deletions.

In practice, this isn't used much because you need to have a family of hash functions available and in most programming languages objects just have a single hash function available to them.

score 3 · Answer 4 · answered Dec 29 '15 at 14:05

It would be pointless to use a HashSet. If a number of objects land in the same bucket it is because they hash to the same value (mod nBuckets but likely the same actual value). What hash value would you use for the inner hash-set? You are in danger of forcing the inner HashSet to also collide.

Cort Ammon · Answer 5 · 2015-12-30T06:36:16.317

There are hierarchical hash maps as you describe. However, there are some caveats.

First is that a hash map is only guaranteed constant time if you can limit the cost of a collision to constant time. If there are no collisions, then using a linked-list as the next layer never comes up, because you never need to deal with collisions at all. If you want to use a hash map for collision resolution rather than a linked list, we have to consider the case where there are collisions, so we have to either find a way to control the collisions in this new map, or accept worse runtime bounds.

Linked lists are fast and easy to manipulate and traverse. They require no spatial locality, so it is easy to pool the spare nodes for many linked lists and draw from that pool as needed. Compare that to resolving collisions with hash maps which require contiguous blocks of memory to be sized and managed. It'd be virtually impossible to manage these collision resolution maps without writing a malloc like function, which is one of the more expensive operations you can put into a high-speed structure like hash mapping!

Also, what are you hashing with anyway? If there is too much of a relationship between the hashing and binning procedures for the outer hash map and the inner hash map, collisions that cause us to need a collision resolution procedure are likely to collide again in the second layer, or the third layer. You may find that what looks optimal actually becomes very suboptimal in realistic situations.

In all, it tends to be easier and faster to maintain data using a linked list or a probing technique, monitor the depth of the lists, and perhaps re-hash if they get too non-ideal. The savings in memory management complexity far outweigh the behavior of using hash maps for collision resolution.

Hash table collisions: why use a linked list if we can use a hash set?

5 Answers5