7

This question has been prompted by Efficient data structures for building a fast spell checker.

Given two strings $u,v$, we say they are $k$-close if their Damerau–Levenshtein distance¹ is small, i.e. $\operatorname{LD}(u,v) \geq k$ for a fixed $k \in \mathbb{N}$. Informally, $\operatorname{LD}(u,v)$ is the minimum number of deletion, insertion, substitution and (neighbour) swap operations needed to transform $u$ into $v$. It can be computed in $\Theta(|u|\cdot|v|)$ by dynamic programming. Note that $\operatorname{LD}$ is a metric, that is in particular symmetric.

The question of interest is:

Given a set $S$ of $n$ strings over $\Sigma$ with lengths at most $m$, what is the cardinality of

$\qquad \displaystyle S_k := \{ w \in \Sigma^* \mid \exists v \in S.\ \operatorname{LD}(v,w) \leq k \}$?

As even two strings of the same length have different numbers of $k$-close strings² a general formula/approach may be hard (impossible?) to find. Therefore, we might have to compute the number explicitly for every given $S$, leading us to the main question:

What is the (time) complexity of finding the cardinality of the set $\{w\}_k$ for (arbitrary) $w \in \Sigma^*$?

Note that the desired quantity is exponential in $|w|$, so explicit enumeration is not desirable. An efficient algorithm would be great.

If it helps, it can be assumed that we have indeed a (large) set $S$ of strings, that is we solve the first highlighted question.


  1. Possible variants include using the Levenshtein distance instead.
  2. Consider $aa$ and $ab$. The sets of $1$-close strings over $\{a,b\}$ are $\{ a, aa,ab,ba,aaa,baa,aba,aab \}$ (8 words) and $\{a,b,aa,bb,ab,ba,aab,bab,abb,aba\}$ (10 words), respectively .
Raphael
  • 73,212
  • 30
  • 182
  • 400

2 Answers2

1

See Levenshtein's paper. It contains bounds on the number strings obtained from insertion and deletion of a string. If $n$ is the length of the string and the string is binary, then the maximum number of nearest neighbors in the Levenshtein distance is $\Theta(n^2)$. It is comparatively harder to say anything about $k$-nearest neighbours, but one can get bounds. These should give you an estimate on the complexity.

Ankur
  • 163
  • 6
0

If your $k$ is fixed and you are allowed to do pre-processing, then this is something you might be able to try

  1. Construct a graph such that the nodes are words and an edge exists between two nodes if the distance between those two words is 1.
  2. Get the adjacency matrix corresponding to that graph (say $M$)
  3. Compute $M^k$

Now, you may be able to use the final matrix to answer all the queries. If you can store $M, M^2, M^4, M^8 \ldots$ etc. You might be able to answer for larger range of $k$ instead of fixed $k$, of course one will pay here with the cost of matrix multiplication.

Raphael
  • 73,212
  • 30
  • 182
  • 400
TenaliRaman
  • 124
  • 2