Data structure or algorithm for quickly finding differences between strings

Question

I have an array of 100,000 strings, all of length $k$. I want to compare each string to every other string to see if any two strings differ by 1 character. Right now, as I add each string to the array, I'm checking it against every string already in the array, which has a time complexity of $\frac{n(n-1)}{2} k$.

Is there a data structure or algorithm that can compare strings to each other faster than what I'm already doing?

Some additional information:

Order matters: abcde and xbcde differ by 1 character, while abcde and edcba differ by 4 characters.
For each pair of strings that differ by one character, I will be removing one of those strings from the array.
Right now, I'm looking for strings that differ by only 1 character, but it would be nice if that 1 character difference could be increased to, say, 2, 3, or 4 characters. However, in this case, I think efficiency is more important than the ability to increase the character-difference limit.
$k$ is usually in the range of 20-40.

Simon Prins · Answer 1 · 2018-06-25T15:00:07.847

My solution is similar to j_random_hacker's but uses only a single hash set.

I would create a hash set of strings. For each string in the input, add to the set $k$ strings. In each of these strings replace one of the letters with a special character, not found in any of the strings. While you add them, check that they are not already in the set. If they are then you have two strings that only differ by (at most) one character.

An example with strings 'abc', 'adc'

For abc we add '*bc', 'a*c' and 'ab*'

For adc we add '*dc', 'a*c' and 'ad*'

When we add 'a*c' the second time we notice it is already in the set, so we know that there are two strings that only differ by one letter.

The total running time of this algorithm is $O(n*k^2)$. This is because we create $k$ new strings for all $n$ strings in the input. For each of those strings we need to calculate the hash, which typically takes $O(k)$ time.

Storing all the strings takes $O(n*k^2)$ space.

Further improvements

We can improve the algorithm further by not storing the modified strings directly but instead storing an object with a reference to the original string and the index of the character that is masked. This way we do not need to create all of the strings and we only need $O(n*k)$ space to store all of the objects.

You will need to implement a custom hash function for the objects. We can take the Java implementation as an example, see the java documentation. The java hashCode multiplies the unicode value of each character with $31^{k-i}$ (with $k$ the string length and $i$ the one-based index of the character. Note that each altered string only differs by one character from the original. We can easily compute the contribution of that character to the hash code. We can subtract that and add our masking character instead. This takes $O(1)$ to compute. This allows us to bring the total running time down to $O(n*k)$

D.W. · Accepted Answer · 2018-06-25T14:30:55.680

It's possible to achieve $O(nk \log k)$ worst-case running time.

Let's start simple. If you care about an easy to implement solution that will be efficient on many inputs, but not all, here is a simple, pragmatic, easy to implement solution that many suffice in practice for many situations. It does fall back to quadratic running time in the worst case, though.

Take each string and store it in a hashtable, keyed on the first half of the string. Then, iterate over the hashtable buckets. For each pair of strings in the same bucket, check whether they differ in 1 character (i.e., check whether their second half differs in 1 character).

Then, take each string and store it in a hashtable, this time keyed on the second half of the string. Again check each pair of strings in the same bucket.

Assuming the strings are well-distributed, the running time will likely be about $O(nk)$. Moreover, if there exists a pair of strings that differ by 1 character, it will be found during one of the two passes (since they differ by only 1 character, that differing character must be in either the first or second half of the string, so the second or first half of the string must be the same). However, in the worst case (e.g., if all strings start or end with the same $k/2$ characters), this degrades to $O(n^2 k)$ running time, so its worst-case running time is not an improvement on brute force.

As a performance optimization, if any bucket has too many strings in it, you can repeat the same process recursively to look for a pair that differ by one character. The recursive invocation will be on strings of length $k/2$.

If you care about worst-case running time:

With the above performance optimization I believe the worst-case running time is $O(nk \log k)$.

j_random_hacker · Answer 3 · 2018-06-26T11:46:20.523

I would make $k$ hashtables $H_1, \dots, H_k$, each of which has a $(k-1)$-length string as the key and a list of numbers (string IDs) as the value. The hashtable $H_i$ will contain all strings processed so far but with the character at position $i$ deleted. For example, if $k=6$, then $H_3[ABDEF]$ will contain a list of all strings seen so far that have the pattern $AB\cdot DEF$, where $\cdot$ means "any character". Then to process the $j$-th input string $s_j$:

For each $i$ in the range 1 to $k$:
- Form string $s_j'$ by deleting the $i$-th character from $s_j$.
- Look up $H_i[s_j']$. Every string ID here identifies an original string that is either equal to $s$, or differs at position $i$ only. Output these as matches for string $s_j$. (If you wish to exclude exact duplicates, make the value type of the hashtables a (string ID, deleted character) pair, so that you can test for those that have had the same character deleted as we just deleted from $s_j$.)
- Insert $j$ into $H_i$ for future queries to use.

If we store each hash key explicitly, then we must use $O(nk^2)$ space and thus have time complexity at least that. But as described by Simon Prins, it's possible to represent a series of modifications to a string (in his case described as changing single characters to *, in mine as deletions) implicitly in such a way that all $k$ hash keys for a particular string need just $O(k)$ space, leading to $O(nk)$ space overall, and opening the possibility of $O(nk)$ time too. To achieve this time complexity, we need a way to compute the hashes for all $k$ variations of a length-$k$ string in $O(k)$ time: for example, this can be done using polynomial hashes, as suggested by D.W. (and this is likely much better than simply XORing the deleted character with the hash for the original string).

Simon Prins's implicit representation trick also means that the "deletion" of each character is not actually performed, so we can use the usual array-based representation of a string without a performance penalty (rather than linked lists as I had originally suggested).

score 2 · Answer 4 · answered Jun 26 '18 at 15:15

Here is a more robust hashtable approach than the polynomial-hash method. First generate $k$ random positive integers $r_{1..k}$ that are coprime to the hashtable size $M$. Namely, $0 \le r_i < M$. Then hash each string $x_{1..k}$ to $(\sum_{i=1}^k x_i r_i ) \bmod M$. There is almost nothing an adversary can do to cause very uneven collisions, since you generate $r_{1..k}$ on run-time and so as $k$ increases the maximum probability of collision of any given pair of distinct strings goes quickly to $1/M$. It is also obvious how to compute in $O(k)$ time all the possible hashes for each string with one character changed.

If you really want to guarantee uniform hashing, you can generate one random natural number $r(i,c)$ less than $M$ for each pair $(i,c)$ for $i$ from $1$ to $k$ and for each character $c$, and then hash each string $x_{1..k}$ to $(\sum_{i=1}^k r(i,x_i) ) \bmod M$. Then the probability of collision of any given pair of distinct strings is exactly $1/M$. This approach is better if your character set is relatively small compared to $n$.

score 2 · Answer 5 · answered Jun 27 '18 at 13:11

A lot of the algorithms posted here use quite a bit of space on hash tables. Here's an $O(1)$ auxiliary storage $O((n \lg n) \cdot k^2)$ runtime simple algorithm.

The trick is to use $C_k(a, b)$, which is a comparator between two values $a$ and $b$ that returns true if $a < b$ (lexicographically) while ignoring the $k$th character. Then algorithm is as follows.

First, simply sort the strings regularly and do a linear scan to remove any duplicates.

Then, for each $k$:

Sort the strings with $C_k$ as comparator.
Strings that differ only in $k$ are now adjacent and can be detected in a linear scan.

MSalters · Answer 6 · 2018-06-25T13:52:30.643

Two strings of length k, differing in one character, share a prefix of length l and a suffix of length m such that k=l+m+1.

The answer by Simon Prins encodes this by storing all prefix/suffix combinations explicitly, i.e. abc becomes *bc, a*c and ab*. That's k=3, l=0,1,2 and m=2,1,0.

As valarMorghulis points out, you can organize words in a prefix tree. There's also the very similar suffix tree. It's fairly easy to augment the tree with the number of leaf nodes below each prefix or suffix; this can be updated in O(k) when inserting a new word.

The reason you want these sibling counts is so you know, given a new word, whether you want to enumerate all strings with the same prefix or whether to enumerate all strings with the same suffix. E.g. for "abc" as input, the possible prefixes are "", "a" and "ab", while the corresponding suffixes are "bc", "c" and "". As it obvious, for short suffixes it's better to enumerate siblings in the prefix tree and vice versa.

As @einpoklum points out, it's certainly possible that all strings share the same k/2 prefix. That's not a problem for this approach; the prefix tree will be linear up to depth k/2 with each node up to k/2 depth being the ancestor of 100.000 leaf nodes. As a result, the suffix tree will be used up to (k/2-1) depth, which is good because the strings have to differ in their suffixes given that they share prefixes.

[edit] As an optimization, once you've determined the shortest unique prefix of a string, you know that if there's one different character, it must be the last character of the prefix, and you'd have found the near-duplicate when checking a prefix that was one shorter. So if "abcde" has a shortest unique prefix "abc", that means there are other strings that start with "ab?" but not with "abc". I.e. if they'd differ in only one character, that would be that third character. You don't need to check for "abc?e" anymore.

By the same logic, if you would find that "cde" is a unique shortest suffix, then you know you need to check only the length-2 "ab" prefix and not length 1 or 3 prefixes.

Note that this method works only for exactly one character differences and does not generalize to 2 character differences, it relies one one character being the separation between identical prefixes and identical suffixes.

tessi · Answer 7 · 2018-06-25T15:07:41.153

Storing strings in buckets is a good way (there are already different answers outlining this).

An alternative solution could be to store strings in a sorted list. The trick is to sort by a locality-sensitive hashing algorithm. This is a hash algorithm which yields similar results when the input is similar[1].

Each time you want to investigate a string, you could calculate its hash and lookup the position of that hash in your sorted list (taking $O(log(n))$ for arrays or $O(n)$ for linked lists). If you find that the neighbours (considering all close neighbours, not only those with an index of +/- 1) of that position are similar (off by one character) you found your match. If there are no similar strings, you can insert the new string at the position you found (which takes $O(1)$ for linked lists and $O(n)$ for arrays).

One possible locality-sensitive hashing algorithm could be Nilsimsa (with open source implementation available for example in python).

[1]: Note that often hash algorithms, like SHA1, are designed for the opposite: producing greatly differing hashes for similar, but not equal inputs.

Disclaimer: To be honest, I would personally implement one of the nested/tree-organized bucket-solutions for a production application. However, the sorted list idea struck me as an interesting alternative. Note that this algorithm highly depends on the choosen hash algorithm. Nilsimsa is one algorithm I found - there are many more though (for example TLSH, Ssdeep and Sdhash). I haven't verified that Nilsimsa works with my outlined algorithm.

Ritu Kundu · Answer 8 · 2018-06-26T08:27:56.953

One could achieve the solution in $O(nk+ n^2)$ time and $O(nk)$ space using enhanced suffix arrays (Suffix array along with the LCP array) that allows constant time LCP (Longest Common Prefix) query (i.e. Given two indices of a string, what is the length of the longest prefix of the suffixes starting at those indices). Here, we could take advantage of the fact that all strings are of equal length. Specifically,

Build the enhanced suffix array of all the $n$ strings concatenated together. Let $X = x_1.x_2.x_3 .... x_n$ where $x_i, \forall 1 \le i \le n $ is a string in the collection. Build the suffix array and LCP array for $X$.

Now each $x_i$ starts at position $(i-1)k$ in the zero-based indexing. For each string $x_i$, take LCP with each of the string $x_j$ such that $j<i$. If LCP goes beyond the end of $x_j$ then $x_i = x_j$. Otherwise, there is a mismatch (say $x_i[p] \ne x_j[p]$); in this case take another LCP starting at the corresponding positions following the mismatch. If the second LCP goes beyond the end of $x_j$ then $x_i$ and $x_j$ differ by only one character; otherwise there are more than one mismatches.

for (i=2; i<= n; ++i){
    i_pos = (i-1)k;
    for (j=1; j < i; ++j){
        j_pos = (j-1)k;
        lcp_len = LCP (i_pos, j_pos);
        if (lcp_len < k) { // mismatch
            if (lcp_len == k-1) { // mismatch at the last position
            // Output the pair (i, j)
            }
            else {
              second_lcp_len = LCP (i_pos+lcp_len+1, j_pos+lcp_len+1);
              if (lcp_len+second_lcp_len>=k-1) { // second lcp goes beyond
                // Output the pair(i, j)
              }
            }
        }
    }
}

You could use SDSL library to build the suffix array in compressed form and answer the LCP queries.

Analysis: Building the enhanced suffix array is linear in the length of $X$ i.e. $O(nk)$. Each LCP query takes constant time. Thus, querying time is $O(n^2)$.

Generalisation: This approach can also be generalised to more than one mismatches. In general, running time is $O(nk + qn^2)$ where $q$ is the number of allowed mismatches.

If you wish to remove a string from the collection, instead of checking every $j<i$, you could keep a list of only 'valid' $j$.

score 1 · Answer 9 · edited Jun 27 '18 at 16:50

One improvement to all the solutions proposed. They all require $O(nk)$ memory in the worst case. You can reduce it by computing hashes of strings with * instead each character, i.e. *bcde, a*cde... and processing at each pass only variants with hash value in certain integer range. F.e. with even hash values in the first pass, and odd hash values in the second one.

You can also use this approach to split the work among multiple CPU/GPU cores.

Bananach · Answer 10 · 2018-06-26T09:17:13.027

This is a short version of @SimonPrins' answer not involving hashes.

Assuming none of your strings contain an asterisk:

Create a list of size $nk$ where each of your strings occurs in $k$ variations, each having one letter replaced by an asterisk (runtime $\mathcal{O}(nk^2)$)
Sort that list (runtime $\mathcal{O}(nk^2\log nk)$)
Check for duplicates by comparing subsequent entries of the sorted list (runtime $\mathcal{O}(nk^2)$)

An alternative solution with implicit usage of hashes in Python (can't resist the beauty):

def has_almost_repeats(strings,k):
    variations = [s[:i-1]+'*'+s[i+1:] for s in strings for i in range(k)]
    return len(set(variations))==k*len(strings)

Bulat · Answer 11 · 2018-06-28T05:05:17.873

I work everyday on inventing and optimizing algos, so if you need every last bit of performance, that is the plan:

Check with * in each position independently, i.e. instead of single job processing n*k string variants - start k independent jobs each checking n strings. You can spread these k jobs among multiple CPU/GPU cores. This is especially important if you are going to check 2+ char diffs. Smaller job size will also improve cache locality, which by itself can make program 10x faster.
If you are going to use hash tables, use your own implementation employing linear probing and ~50% load factor. It's fast and pretty easy to implement. Or use an existing implementation with open addressing. STL hash tables are slow due to use of separate chaining.
You may try to prefilter data using 3-state Bloom filter (distinguishing 0/1/1+ occurrences) as proposed by @AlexReynolds.
For each i from 0 to k-1 run the following job:
- Generate 8-byte structs containing 4-5 byte hash of each string (with * at i-th position) and string index, and then either sort them or build hash table from these records.

For sorting, you may try the following combo:

first pass is MSD radix sort in 64-256 ways employing TLB trick
second pass is MSD radix sort in 256-1024 ways w/o TLB trick (64K ways total)
third pass is insertion sort to fix remaining inconsistencies

Bulat · Answer 12 · 2018-06-28T02:04:09.817

Here is my take on 2+ mismatches finder. Note that in this post I consider each string as circular, f.e. substring of length 2 at index k-1 consists of symbol str[k-1] followed by str[0]. And substring of length 2 at index -1 is the same!

If we have M mismatches between two strings of length k, they have matching substring with length at least $mlen(k,M) = \lceil{k/M}\rceil-1$ since, in the worst case, mismatched symbols split (circular) string into M equal-sized segments. F.e. with k=20 and M=4 the "worst" match may have the pattern abcd*efgh*ijkl*mnop*.

Now, the algorithm for searching all mismatches up to M symbols among strings of k symbols:

for each i from 0 to k-1
- split all strings into groups by str[i..i+L-1], where L = mlen(k,M). F.e. if L=4 and you have alphabet of only 4 symbols (from DNA), this will make 256 groups.
- Groups smaller than ~100 strings can be checked with brute-force algorithm
- For larger groups, we should perform secondary division:
  - Remove from every string in the group L symbols we already matched
  - for each j from i-L+1 to k-L-1
    - split all strings into groups by str[i..i+L1-1], where L1 = mlen(k-L,M). F.e. if k=20, M=4, alphabet of 4 symbols, so L=4 and L1=3, this will make 64 groups.
    - the rest is left as exercise for the reader :D

Why we don't start j from 0? Because we already made these groups with the same value of i, so job with j<=i-L will be exactly equivalent to job with i and j values swapped.

Further optimizations:

At every position, also consider strings str[i..i+L-2] & str[i+L]. This only doubles amount of jobs created, but allows to increase L by 1 (if my math is correct). So, f.e. instead of 256 groups, you will split data into 1024 groups.
If some $L[i]$ becomes too small, we can always use the * trick: for each i in in 0..k-1, remove i'th symbol from each string and create job searching for M-1 mismatches in those strings of length k-1.

Data structure or algorithm for quickly finding differences between strings

12 Answers12