Count of distinct substrings in string inside range

Question

Having string $S$ of length $n$, finding the count of distinct substrings can be done in linear time using LCP array. Instead of asking for unique substrings count in whole string $S$, query $q$ containing indexing $(i,j)$ where $0 \le i \le j < n$ is asking for count of distinct substring inside given query range for string $S[i..j]$.

My approach is just applying linear time construction of LCP array to each query. It gives complexity $O(|q|n)$. Number of queries could raise to order of $n$ so answering all queries makes it $O(n^2)$.

Can it be done better, than linear time for every query?

In general, if one process substring of string for which we already have suffix array, suffix tree, lcp array, are those structures not relevant anymore, and must be build from scratch again?

Dmitri Urbanowicz · Answer 1 · 2018-07-07T07:07:00.153

There is $O(n \sqrt{n} + |Q| \sqrt{n})$ offline solution.

Sort elements $(i,j)$ of $Q$ in ascending order of $j$.
Distribute them into $\sqrt{n}$ buckets so, that $(i,j)$ goes into bucket number $\lfloor \frac{i}{\sqrt{n}} \rfloor$.
For each bucket starting at $b$ and each query $(i,j)$ in it, build a suffix tree for $S[b,j]$.
For each query in a bucket, remove redundant characters from the left and report the answer.

Step 3 takes $O(n)$ for each bucket, because we use Ukkonen's algorithm and $j$ goes in ascending order.

Step 4 takes $O(\sqrt{n})$ for each query, because removing $\sqrt{n}$ longest suffixes from the tree takes $O(\sqrt{n})$. Note that you can use an indirection layer to avoid modifications to the original suffix tree.

user11171 · Answer 2 · 2018-07-05T21:00:13.657

The question does not motivate the number of queries being $O(n)$, which seems an arbitrary worst case since the number of unique possible queries is the number of ordered pairs and thus $O(n^2)$.

Here are two different solutions with better time complexity for the $O(n^2)$ case based on (implicit) suffix trees constructed incrementally with Ukkonen's algorithm. Both solutions are based on preprocessing and have complexity $O(n^2 + |Q|)$ where $Q$ is the set of queries. The second solution runs in $O(n + |Q|)$ if all queries have the same width.

Solution 1 - Preprocess all unique queries

Iterate over the suffixes of $S$. For each suffix $S_i=S[i..n]$, build the suffix tree of $S_i$ with Ukkonen's algorithm. After update $j$ to the current suffix tree, store the tree size in a matrix at position $(i,i+j-1)$. A query for the range $[x,y]$ is answered by the matrix element at $(x,y)$.

Suffix tree size can be stored along with the suffix tree and updated in constant time at each step by modifying the update procedure in Ukkonen's algorithm. For each update the size increases by the current number of leaves.

Solution 2 - Preprocess unique query widths

This solution is harder to implement but requires less preprocessing work if there are few query widths. Preprocessing takes $O(n)$ time if there is only one query width.

For each query width $w$, use a sliding window of width $w$ and incrementally build a suffix tree. Remove the suffix starting one character to the left of the window by remove the longest suffix from the tree. At each step, the current number of substrings within the sliding window is the tree size.

All queries can then be answered in linear time by using the results of the precomputation.

Note: removing the longest suffix can be done by removing the oldest leaf of the suffix tree. It is not easy to implement correctly.

Count of distinct substrings in string inside range

2 Answers2

Solution 1 - Preprocess all unique queries

Solution 2 - Preprocess unique query widths