Count the unique subsequences

Question

I'm trying to count the number of subsequences of a long string.

e.g. BANANA -> B, A, N, BA, BN, AA, NN, NA, BAN, BAA, BNA, BNN, ANA, AAA, NAN, NNA, etc

There's the horribly inefficient way, which is to iterate all of the substrings and keep hash of some sort to keep track of the ones already seen. The size of the hash explodes with any non-trivial length strings, so I need to figure out something more efficient. I tried dynamic programming, but could not find a good recursive formula.

This problem is originally from http://ahslaaks.users.cs.helsinki.fi/oh/, and they don't use the term "subsequences" when describing it.

j_random_hacker · Answer 1 · 2021-08-29T09:23:51.637

Let the given string be $S$. For a string $T$ that appears as a subsequence of $S$, we say that the list of positions $(x_1, \dots, x_{|T|})$ is an appearance of $T$ in $S$ if $S[x_1]S[x_2]\dots S[x_{|T|}] = T$. The leftmost appearance of $T$ in $S$ is the unique one for which each position is minimal over corresponding positions in all appearances -- that is, $X$ is the leftmost appearance of $T$ in $S$ iff for all $1 \le i \le |T|$ and all appearances $Y$ of $T$ in $S$, $X[i] \le Y[i]$. This leftmost appearance can be calculated with a simple greedy algorithm in which we loop through the characters of $T$, pairing each with the next available matching character in $S$. Define $L(T)$ to be the last (rightmost) position in the leftmost appearance of $T$ in $S$.

Example: $S = abbaba$, $T = ba$. The appearances of $T$ in $S$ are $(2, 4), (2, 6), (3, 4), (3, 6)$ and $(5, 6)$. The unique leftmost appearance is $(2, 4)$, so $L(T) = 4$.

We say a subsequence $T$ of $S$ requires position $i$ of $S$ iff $L(T) = i$. What this means is that $T$ is not a subsequence of $S[1 \dots i-1]$ -- the only ways to produce $T$ from $S$ involve including either the character at position $i$ in $S$, or some identical character further to the right. The important thing to notice is that any subsequence $T$ of $S$ requires a unique position: the one given by by $L(T)$. So if for each $1 \le i \le |S|$ we can count the number of subsequences of $T$ that require $i$, we can add these all up to obtain the total number of unique subsequences of $S$.

Let $r(i)$ be the number of subsequences of $S$ that require position $i$. We can compute $r(i)$ by counting the total number of subsequences of $S$ that include position $i$ and end there, and then subtracting off the number of these subsequences that don't require $i$. Calculating the number of subsequences that end at $i$ but don't require it is surprisingly easy to do, thanks to the following observation:

For every $1 \le j < i$ such that $S[j] = S[i]$, every subsequence that requires $j$ can be turned into an appearance of the same subsequence that ends at $i$ but doesn't require $i$ -- and there are no other subsequences that end at $i$ but don't require it. (Specifically, we could remove the final character (taken from position $j$ of $S$) and append position $i$, and the appearance formed this way yields the same subsequence.)

So we have $r(i) = \sum_{1\le k<i} r(k) - \sum_{1\le j<i, S[j]=S[i]} r(j)$.

(The first summation is the number of subsequences we get by appending $S[i]$ to each of the unique subsequences we have already seen; the second subtracts off subsequences we have seen before.)

The first summation is easy to optimise: It's just a running total. The second summation can also be optimised further by tracking, for each distinct character, the sum so far of unique subsequences ending with that character:

$r(0) \gets 1, t \gets 1, u[x] \gets 0$ for each character $x$ appearing in $S$
For $i$ from 1 to $|S|$:
- $r(i) \gets t - u[S[i]]$
- $t \gets t + r(i)$
- $u[S[i]] \gets u[S[i]] + r(i)$

Summing $r(i)$ values into $t$ as they are computed allows the answer to be computed in linear time.

score 0 · Answer 2 · answered Mar 20 '18 at 06:47

One approach is to build a NFA (nondeterministic finite automaton) that describes all subsequences of the long string, then determinize it. Once you have a DFA, you can count the number of strings it accepts (see Why isn't it simple to count the number of words in a regular language? and https://cstheory.stackexchange.com/q/8200/5038).

Whether this is effective depends on how large the resulting DFA turns out to be.

Count the unique subsequences

2 Answers2