How to compare/cluster millions of strings?

Question

I have around 1,000,000 of strings of variable length (from 200 to 50000) that can contain 5 characters (A, B, C, D, E).

What I actually want is to cluster them together if they are similar enough. By similar enough I mean they have an edit distance < $t$. This $t$ should be dependent on the length of the strings being compared. So for example, if I'm comparing two 200-length strings, $t$ could be 20, while comparing two 1000-length strings $t$ could be $100$

I've been reading a lot of literature lately but can only seem to find approaches that are not actually viable for so many strings, and I'm not much into probabilistic approaches so maybe you can help me out there.

The problem is that comparing them in pairs to cluster them would already be very costly even if the comparison was performed in $O(1)$ time (which is obviously not the case).

I've been reading about using indexes such as BWT and mixing it with things like the pigeonhole principle (Given two strings $P$ and $T$, let $p_1$, $p_2$, ..., $p_{k+1}$ be a partitioning of $P$ into $k+1$ non-overlapping non-empty substrings. If $P$ occurrs in $T$ with up to $k$ edits, then at least one of $p_1$, $p_2$, ..., $p_{k+1}$ must match exactly) but as I said before, I think it is unfeasible because of the quantity of strings and their length.

Also, some algorithms perform comparison in $O(t \cdot min(m,n))$ where $t$ is the desired edit distance (they will be clustered together if edit distance <= $t$) and $m$ and $n$ are the length of the strings. But this can also be a problem because with large strings, $t$ could also be large.

So, do you know about anything that could help me in my case? Also, algorithms that can be made parallel are very welcome, as I intend to run this in a very large cluster with 400+ cores. Is there any fast probabilistic algorithm that can reduce the amount of comparisons needed for example?

D.W. · Accepted Answer · 2017-07-10T17:07:03.250

The length of each string is at least 200, and you only consider two strings similar if their edit distance is less than 20. So if $S,T$ are similar, then there must exist $U$ such that (1) $U$ is a substring of $S$, (2) $U$ is a substring of $T$, and (3) the length of $U$ is at least 10.

This suggests that one heuristic is to find all length-10 substrings and look for collisions. In other words, build a hashmap whose keys are substrings and values are the containing strings; for each string $S$, compute all length-10 substrings $U$, and add a mapping from key $U$ to string $S$ into a hashmap. This can be done in linear time; a rolling hash may improve efficiency.

Now iterate through the hashmap and find all pairs of strings $S,T$ that share a common length-10 substring. For each pair $S,T$, compute the edit distance and check whether the edit distance is < 20. I suggest using the optimized edit distance computation with "early out": if the edit distance is $\ge 20$ you don't care what its actual value is. This computation can be done in $O(20 \cdot (n+m)) = O(n+m)$ time per pair, where $m,n$ are the lengths of $S,T$. Length 10 is large enough that probably won't be too many pairs $S,T$ that contain a common length-10 substring but whose edit distance is $\ge 20$, so you probably won't waste too much time computing the edit distance on pairs you aren't interested in.

There are other approaches, based on metric trees, BK trees, and other methods. See, e.g., Efficient map data structure supporting approximate lookup. You might also look into the computational biology literature, which deals with a similar problem when clustering DNA sequences. There the alphabet has 4 characters, rather than 5 characters, and strings can be much longer, but they also want to find similar pairs from a large database, and they have developed practical and efficient algorithms.

How to compare/cluster millions of strings?

1 Answers1

Linked