2

I'm looking for an algorithm that will help me determine substring matches at scale.

I have a pool of 100+ million "needles" (strings). I can do as much pre-processing on them as I want, and storage is cheap.

On detection side, I have both a very large pool (hundreds of TB) of strings to search for needles in, and also want to be able to stream detection as text comes in. So it's important that the detection be very fast (algorithmically). I can also pre-process this text as part of the detection, obviously.

I can store a copy of all needles, so a probabilistic algorithm would be fine (say, instead of the correct N strings, the algorithm returns some false positives -- I can always do a plain string search afterwards).

There is significant structure to the strings (they happen to be source code -- snippets on the needle side and files in the haystack).

Appreciate any thoughts on where to explore.

Scovetta
  • 123
  • 3

1 Answers1

2

Rabin-Karp

Rabin-Karp string search would be a good candidate, because it can use a rolling hash function.

You pick a segment length $\ell$. For each "needle", you hash its first $\ell$ characters, and store it in a hashtable keyed on the hash.

Then, as each character of text arrives, you compute the rolling hash of the last $\ell$ characters, look it up in the hash table, and possibly find one or more candidate needles that might match; then you test them directly.

This requires that the segment length be shorter than the shortest needle, but long enough to make hash collisions rare (i.e., so that the first $\ell$ characters of a needle mostly uniquely determines the needle). If you have a wide range of needle lengths, I suppose you could pick two segment lengths $\ell_1,\ell_2$, have two hash tables, and store each needle in appropriate hashtable according to its length.

Aho-Corasick

The Aho-Corasick algorithm would also be a good candidate. It solves exactly this problem, and doesn't require any assumptions on the lengths of the needles.

You might also look at Commentz-Walter.

D.W.
  • 167,959
  • 22
  • 232
  • 500