2

Consider the following algorithmic problem: Given a list of strings $L = [s_1, s_2, \dots, s_n]$, we want to know all pairs $(x,y)$ where $x$ is a subsequence of $y$. We can assume all strings are of length at maximum $m$, where $m << n$ and are all over a finite alphabet $\Sigma$ with $|\Sigma| << n$. We may also assume that the number of pairs $(x,y)$ where $x$ is a subsequence of $y$ is much smaller than $n$.

A trivial algorithm would be this:

1. foreach x in L:
2.   foreach y in L:
3.      if x is subsequence of y:
4.         OUTPUT x,y

However, this has complexity $O(n^2 \cdot m)$ - I am curious to know whether there is a faster algorithm (Faster given that the number of pairs $(x,y)$ is much smaller than $n$, so for example an algorithm with complexity depending on the number of output pairs).

Note that this question is a follow up to this question, which is about the same problem but for substrings (not subsequences). There, the Aho-Corasick algorithm solved my problem perfectly - is there maybe somethign like this but for subsequences?

1 Answers1

4

No, it is not possible to do better unless the Strong Exponential Time Hypothesis (SETH) fails. If we could solve this problem substantially faster than $O(n^2)$ we would immediately obtain a much faster algorithm for solving the NP-complete problem Satisfiability. This is true even for $m$ slightly more than $\log(n)$ and the case in which we want to decide whether such a pair $(x,y)$ exists at all.

See, e.g., these lecture notes under section 3 "Tight Lower Bounds for Orthogonal Vectors". The proof is analogous to the proof of Theorem 2 in these lecture notes.

First, we consider the more general problem of given two sets of strings $X,Y$, finding whether some string in $X$ is a subsequence of a string in $Y$.

Given a SAT formula, we split its $n$ variables into two equal sets of $n/2$ variables. In $\Sigma$ we take a character corresponding to every clause. In $X$ we add a string for every possible assignment to the first half of the variables, with a character corresponding to every clause not satisfied by those variables. Meanwhile, in $Y$, we add a string for every assignment to the second half of the variables, with a character for every clause that is satisfied by those variables. Clearly, the formula is satisfiable if and only if some string in $X$ is a subsequence of some string in $Y$.

If this problem can be solved substantially faster than $O(n^2)$, then this gives a substantially faster algorithm for Satisfiability than $2^n$. Suppose the problem could be solved in $O(n^{1.99})$ time, then Satisfiability could be solved in $(2^{n/2})^{1.99}=O(2^{0.996n})$ which contradicts SETH.

In your problem, there is only a single set of strings, all of which may be a subsequence of each other. This is however not a problem, as we can simply modify the strings in our instance such that no string $Y$ is a subsequence of any other string (for instance by padding all strings in $Y$ to have the same length), and similarly padding every string in $X$ to the same length as other strings in $X$ (but substantially shorter than strings in $Y$).

This can probably also be done with a constant-size (likely even binary) alphabet but this requires more clever encoding.

Tom van der Zanden
  • 13,493
  • 1
  • 39
  • 56