6

I asked a similar question here on Rabin Karp algorithm. My present question is, how do we find the best $q$ (i.e modulus)? What is the criterion? We need to choose a $q$ which will be quick to calculate and also must result in lesser number of spurious hits, right?

Wow do we ensure these things?

gpuguy
  • 1,819
  • 3
  • 22
  • 32

2 Answers2

5

The Rabin-Karp algorithm looks for a substring by computing a rolling hash, of the form $h(a_{n - 1} a_{n - 2} \ldots a_0) = \sum_{0 \le k \le n - 1} a_k q^k$ for a prime $q$. Note that the algorithm works just the same if $h$ is computed modulo the word size, so using e.g. unsigned int in C would be wise. What we'd like is $q^{n - 1}$ not too large (so the needed $q^{n - 1} a_{n - 1}$ doesn't overflow, and computations are in normal unsigned integers). Clearly $n$ (the length of the pattern) is limiting here, so you want a smallish prime. One option is to use the largest prime $q$ such that if $w$ is the word size and $A$ the largest character value (for UTF-8 it's essentially $2^8$) $$ 2^w > q^{n - 1} A $$ This guarantees no collisions, but is very restrictive. Perhaps just taking $q = 3$ or $q = 5$ is enough (the hash value is still a full word, so collisions should be rare anyway). As they have just 2 bits one (Fermat primes, next one is 17, then 257), perhaps the compiler (or even the programmer) replaces the multiplication by the constant prime by shift and add, but if that is a net gain depends on architecture...

Note that the cited article says this makes sense only for multiple pattern search, as there are faster alternatives for single patterns.

To really answer you'd have to run tests with representative haystacks and patterns

vonbrand
  • 14,204
  • 3
  • 42
  • 52
3

Although @vonbrand has given a satisfactory explanation, in simple words if

  • $d$ is the number of characters in alphabet
  • $q$ is modulus

then $dq$ should be representable in the available word size.

David Richerby
  • 82,470
  • 26
  • 145
  • 239
Saurabh Jain
  • 291
  • 2
  • 7