7

This question arose from a practical problem: given a set of texts, find one, which contains a given string (not word).

Let $S$ be a set of $n$ strings, and $l$ the length of the longest string in $S$. What will be the best data structure to efficiently search a string that contains a substring of length $k$ in $S$?

Sequential scan with Knuth–Morris–Pratt algorithm yields $O(n(l+k))$ searching complexity. I thought maybe something similar to tries exists.

D.W.
  • 167,959
  • 22
  • 232
  • 500
Somnium
  • 285
  • 2
  • 11

3 Answers3

6

You could append all of the $n$ strings together, and add an arbitrary character '\$' not in the pattern to separate them. Then you could apply the Z algorithm on your original pattern and this new string to yield time complexity $O(nl+k)$ which is more efficient than your suggested $O(nl+nk)$.

Actually, you could still iterate with KMP for $O(nl+k)$ time, beceause the $k$ component is preprocessing the pattern you are trying to find. This only has to be done once, regardless of the number of strings you are checking.

Riley
  • 280
  • 1
  • 11
4

You can use a suffix tree, which is a compressed trie that allows you to store all possible suffixes from the strings in $S$. To search for a substring $q$, you can simply traverse the suffix tree from the root1 to the leaves, making comparisons with the edge keys (each of which represents one or more characters). If a branch does not exist in the tree, then $q$ is not a substring in $S$; otherwise, it is.

If you want to find out the exact string in $S$ that contains $q$, you can annotate the nodes of the suffix tree so that they contain this information (e.g., you can add a string number and a starting position). This is the idea generalized suffix trees are built upon.

The running time of this approach is $O(n \: l)$ for the construction of the suffix tree, plus $O(k)$ for the substring search, for a total of $O(n \:l + k)$.


1 Note that every substring of a string $T$ is the prefix of exactly one suffix of $T$. Therefore, substring search can always be performed starting at the root of a suffix tree.

Mario Cervera
  • 3,804
  • 2
  • 20
  • 22
3

The answer depends on the number of patterns, the total size of the strings $nl$, and how many times you expect the pattern to occur in each string.

  • If you have a few patterns, a standard string matching algorithm such as KMP is sufficient.
  • If you have many patterns at the same time, you should use an algorithm that preprocesses the patterns (e.g. Aho-Corasick).
  • If you have even more patterns or the patterns come in multiple batches, you should preprocess the strings instead. The suffix tree is nice in theory, but it's slow and memory-hungry in practice. The suffix array works better, and the FM-index is a nice substitute if you don't have enough memory for the suffix array.
  • If you have really many patterns in a single batch, you should preprocess both the strings and the patterns and match the resulting structures against each other. The basic idea is straightforward, but you'll have to do some work to get the details right.
  • If you preprocess the strings and expect the patterns to have many occurrences in each string they match, you should augment the index (e.g. suffix array or FM-index) with a document retrieval structure.
  • If you preprocess the strings and even the FM-index won't fit into memory, the string B-tree should work reasonably well.
Jouni Sirén
  • 472
  • 2
  • 5