4

Given two strings $x$ and $y$ over the alphabet $\{A,C,G,T\}$, I'm trying to determine a shortest string $z$ such that $z$ is a subsequence of $x$ and not a subsequence of $y$.

Example: a shortest string that is a subsequence of AATGAAC but not a subsequence of ATCGATC is AAG.

D.W.
  • 167,959
  • 22
  • 232
  • 500
Mike
  • 205
  • 2
  • 6

1 Answers1

5

A dynamic programming algorithm seems to work.

Given a prefix $X_k = x_1 x_2\dots x_k$ of string $X$ ($X$ is of length $s$, so $X = X_s$), a prefix $Y_m = y_1 y_2 \dots y_m$ of string $Y$ ($Y$ is of length $t$, so $Y_t = Y$), and a length $L$, we determine if there is a subsequence of $X_k$ which is also a subsequence of $Y_m$, such that the sub-sequence is of length exactly $L$ and ends at $x_k$. Call that (boolean) value: $\text{is_there}[k,m,L]$.

We will also need to know if there is a subsequence of $X_k$ which is also a subsequence of $Y_m$ such that the subsequence is of length exactly $L$ and ends at or before $x_k$. Call that boolean value $\text{is_there_any}[k,m,L]$.

These satisfy the recurrences:

$$\text{is_there}[k,m,L] = \text{is_there}[k, m-1, L] \vee (x_k = y_m \wedge \text{is_there_any}[k-1,m-1,L-1])$$

and

$$\text{is_there_any}[k, m, L] = \text{is_there}[k, m, L] \vee \text{is_there_any}[k, m-1, L]$$

The smallest $L$ such that $\text{is_there}[k, t, L] = false$ for all $k$ gives you the result. (Note that $t$ is the length of $Y$).

If the length of the shortest such subsequence is $S$, This can be implemented in $O(stS)$ time and $O(stS)$ space by a triply nested for-loop with $L$ on the outside, then $k$, then $m$.

Something like:

Compute isThere for L = 1.
foreach L in (2 ... s)
    foreach k in (1 ... s)
       foreach m in (1 ... t)
         is_there[k,m,L] = blah
         is_there_any[k,m,L] = blah
j_random_hacker
  • 5,509
  • 1
  • 17
  • 22
Aryabhata
  • 6,291
  • 2
  • 36
  • 47