3

Consider the following algorithmic problem: Given a list of strings $L = [s_1, s_2, \dots, s_n]$, we want to know all pairs $(x,y)$ where $x$ is a substring of $y$. We can assume all strings are of length at maximum $m$, where $m << n$ and are all over a finite alphabet $\Sigma$ with $|\Sigma| << n$. We may also assume that the number of pairs $(x,y)$ where $x$ is a sub-string of $y$ is much smaller than $n$.

A trivial algorithm would be this:

1. foreach x in L:
2.   foreach y in L:
3.      if x is substring of y:
4.         OUTPUT x,y

However, this has complexity $O(n^2 \cdot m)$ - I am curious to know whether there is a faster algorithm?

Edit: As pointed out in the comments, there can be at most $n^2$ such pairs, so I don't see how there can be an algorithm faster than $O(n^2)$. However, I was wondering if there is something like a $P-FPT$ algorithm where the squared complexity is dependent on the number of output pairs, rather than $n$? Or at least an algorithm that reduces the complexity to something better than $O(n^2 \cdot m)$.

1 Answers1

5

This can be solved with Aho-Corasick algorithm in $O(nm + Mm)$ time, where $M$ is the number of pairs outputted.

First build the Aho-Corasick automaton for the set of strings in $O(nm)$ time. Then run each string through the automaton - this takes $O(nm)$ time for running the strings through the automaton and $O(Mm)$ time for outputting the matches because the same string can match $m$ times in the worst case. (For example ab matches ababab 3 times.)


This can be improved to true linear $O(L + M)$ time, where $L$ is the total length of the strings and $M$ is the number of matches:

When running the strings through the automaton, store for each node of the automaton the index of the previous string for which this node was visited. When outputting the matches, stop following the dictionary links if the link leads to a node that has already been visited for this string - you have already outputted all matches that are upwards from that node. Now each match gets outputted exactly once, and we traverse dictionary links only when new matches are outputted.

Laakeri
  • 1,339
  • 1
  • 10
  • 19