Efficiently find longest common substring for all substring pairs of S and T

Question

I am trying to find the Gesalt similarity of a string $S$ and all substrings of $T$ using Gestalt Pattern Matching (Ratcliff Obershelp Algorithm)

This algorithm requires me to find the matches of S and T which is defined as follows

$$ matches(S,T) = \begin{cases} matches(S1,T1) + [LCS(S,T)] + matches(S1,T2)&\text{if } LCS(S,T) \neq \epsilon\\ [] &\text{if } LCS(S,T) = \epsilon\\ \end{cases} $$ where

$LCS(S,T)$ is the longest common substring of $S$ and $T$

$S = S1 + LCS(S,T) + S2$

and

$T = T1 + LCS(S,T) + T2$

Informally, matches are calculated by finding $LCS(S,T)$ and then by recursively finding matches on the left and right of $LCS(S,T)$

The similairty ratio is defined as $$ sim(S,T) = \begin{cases} \frac{2 * K}{length(S) + length(T)}&\text{if }length(S) + length(T) \neq 0\\ 1 &\text{otherwise} \end{cases} $$

where $K$ is the sum of the length of all matches

Example

$S$ = "Altanta" $T$ = "Atlantis"

$$ \begin{equation*} \begin{split} matches(S,T) &= matches(\text{'Alt', 'Atl'}) + [\text{'ant'}] + matches(\text{'a', 'is'}) \\ &= matches(\text{'Alt', 'Atl'}) + [\text{'ant'}] + [] \\ &= matches(\text{'Alt', 'Atl'}) + [\text{'ant'}] \\ &= (matches(\epsilon, \epsilon) + [\text{'A'}] + matches(\text{'lt', 'tl'})) + [\text{'ant'}] \\ &= [] + [\text{'A'}] + matches(\text{'lt', 'tl'})) + [\text{'ant'}] \\ &= [\text{'A'}] + matches(\text{'lt', 'tl'})) + [\text{'ant'}] \\ &= [\text{'A'}] + (matches(\epsilon,\text{'t'}) + [\text{'l'}] +matches(\text{'t'},\epsilon)) + [\text{'ant'}] \\ &= [\text{'A'}] + [] + [\text{'l'}] + [] + [\text{'ant'}] \\ &= [\text{'A'}, \text{'l'} ,\text{'ant'}] \\ \end{split} \end{equation*} $$

Thus $K= length(\text{'A'}) + length(\text{'l'}) + length(\text{'ant'}) = 5$

Hence,

$$ sim(S,T) = \frac{2*K}{length(S) + length(T)} = \frac{2 * 5}{7 + 8} = 0.666... $$

Now, I want to be able to find the similarity for $S$ and all substrings of $T$

Eg. $sim(\text{'Altanta'}, \text{'Alta'})$, $sim(\text{'Altanta'}, \text{'lantis'})$, e.t.c.

But to calculate these pairs efficiently I believe that I'll need to be able to quickly find the matches of $S$ and substrings of $T$, which in turn means that I'll need to be able to find $LCS(S',T')$ for all $S'$ and $T'$ where $S$' is substring of $S$ and $T'$ is a substring of $T$.

How can I efficiently find $LCS(S',T')$ for all $S'$ and $T'$ where $S$' is a substring of $S$ and $T'$ is a substring of $T$?

I went through Dynamic and Internal Longest Common substring which looked promising at first but only explained algorithms for constrainted cases.

clinux · Answer 1 · 2022-06-11T03:30:29.200

I'll throw a naive algorithm out there.

One way would be to create a hash set of all possible substrings from your first input. Then check if each possible substring from the second input is in that hash set.

Time complexity: $O(N_1^2 + N_2^2)$

Memory usage: $O(min(N_1^2, N_2^2))$ (use shortest input as first input)

You can all so break the problem down in common substrings of length 1, then length 2, then length 3, etc. While clearing and reusing the hash set between each length change. That will help lower the memory usage.

Efficiently find longest common substring for all substring pairs of S and T

1 Answers1