4

Is the Simple Uniform Hashing Assumption (SUHA) sufficient to show that the worst-case time complexity of hash table lookups is O(1)?

It says in the Wikipedia article that this assumption implies that the average length of a chain is $\alpha = n / m$, but...

  • ...this is true even without this assumption, right? If the distribution is [4, 0, 0, 0] the average length is still 1.
  • ...this is a probabilistic statement, which is of little use when discussing worst case complexity, right?

It seems to me like a different assumption would be needed. Something like:

The difference between the largest and smallest bucket is bounded by a constant factor.

Maybe this is this implied by SUHA? If so, I don't see how.

aioobe
  • 239
  • 1
  • 10

3 Answers3

1

I think the answer is no, SUHA does not imply anything regarding worst-case time complexity.

The bottom line is that

  • the hashing is (even though uniform) still viewed as random, and
  • the keys are unknown.

Regardless of how small the probability is for all keys to hash to the same bucket, it's still a theoretical possibility, thus the theoretical worst-case is still O(n).

When discussing hash tables and complexity, I think this is mentioned briefly, regarded as irrelevant in practice and then the discussion moves on to expected run time complexity, and we're in O(1) land.

With the assumption mentioned in the question...

The difference between the largest and smallest bucket is bounded by a constant factor.

...(which is strictly stronger than SUHA) it can indeed be shown that the worst-case run time complexity is O(1). This assumption implies that even the largest bucket is within a constant factor of the load factor limit. This in turn means all linked lists are bounded by a constant and we get O(1) as worst case.

This assumption however is somewhat unreasonable. While it's quite probable that it holds true for any descent hash function and non-pathological set of keys, how would one guarantee such property? One would have to push this assumption further down and make constraints on what keys are allowed to be stored in the hash table and so on. This is what makes it unreasonable in a sense.

aioobe
  • 239
  • 1
  • 10
1

Not only is SUHA insufficient to show that the largest bucket has size $O(\alpha)$, but if $m = n$, then there is a high probability that for any particular set of keys, the longest chain has length $\Theta(\lg n / \lg \lg n)$. There is a proof of this in CLRS and many proofs available in lecture notes.

jbapple
  • 3,390
  • 18
  • 21
1

Let $h:[U]\to[m]$ be a uniform hash function, and $x_1,\dots,x_n\in[U]$ be the elements currently stored in the hash table, and we want to look up the query $q\in[U]$. Finally let $X_i$ be the indicator variable for the event that $x_i$ collides with $q$: $$X_i = \begin{cases}1 && h(x_i)=h(q)\\0 &&h(x_i)\ne h(q)\end{cases} \Rightarrow P(X_i=1) = \frac{1}{m}$$ Then the time $T$ to loop up $q$ is equal to the number of elements that collide with it: $$T=\sum_{i=1}^n X_i$$ By linearity of expectation we have: $$\mathbb{E} [T] = \sum_i\mathbb{E} X_i = \frac{n}{m}=\alpha$$ And finally by Markov's inequality: $$ P(T>t) \le \frac{\alpha}{t}$$ Now if we set $t:={\alpha}/{\delta}$, we will have: $$P(\text{look-up time}\ge \frac{\alpha}{\delta}) \le \delta$$ Think of $\delta$ as the failure rate. So for $\delta=0.1$, the query time is bounded by $10\alpha$ with probability $0.9$. If you strengthen the assumption about $h$ to a $k$-wise independent hash, the failure rate rouhgly drops to $P(\text{fail})\approx\delta^k$ and so you can have a much better complexity/failure-rate trade-off.

Now, a more conservative analysis is, what is the length of the longest chain in your table? In this case, we can define indicator variables $X_{i,j}$ for $i,j\in[n]$, that indicate the event that $x_i$ and $x_j$ collide. Clearly, $P(X_{i,j})=1/m$. Then the expected total number of collisions $S$ is: $$\mathbb{E} S = \sum_{i,j\le n} \mathbb{E} X_{i,j} = \binom{n}{2}\frac{1}{m}$$ If the maximum chain length is $\ell$, there is $\binom{\ell}{2}$ pairs of elements in that chain that collide with each other, implying $S\ge \binom{\ell}{2}$. We have: $$P(\text{max chain length}\ge \ell) \le P\left(S\ge \binom{\ell}{2}\right) \le\frac{ \mathbb{E} S} { \binom{\ell}{2}}\approx (n^2/m\ell^2) $$ If you set $\ell:=n^2/m\delta$ the probablistic bound is: $$P(\text{max chain length}\ge \ell) \le \delta \qquad \text{ for } \ell = \frac{n}{\delta \sqrt{m}}$$ So if $m\approx n$, your maximum length grows like $\ell\approx \sqrt{n}/\delta$ with $1-\delta$ probability. This is because the simple uniform hash assumption is really weak. The more independence you assume, the better the guarantee. For a universally uniform (meaning completely iid) hash function, you will get a logarithmic maximum chain $log(n)$ for $m\approx n$.

Ameer Jewdaki
  • 539
  • 2
  • 14