Algorithm for grouping "similar" strings

Question

I have a situation where I am trying to group similar log error messages so that I can enable metrics on them. So, here are some live sample string:

GROUP 1:
Error:ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request 50958581. Job error is: <env:Body xmlns:env="http://schemas.xmlsoap.org/soap/envelope/">;  <ns0:processOrderImportRequestAsyncResponse xmlns:ns0="http://xmlns.oracle.com/apps/scm/doo/decomposition/receiveTransform/receiveSalesOrder/service/types/">;    <ns1:result xmlns:ns0="http://xmlns.oracle.com/apps/scm/doo/decomposition/receiveTransform/receiveSalesOrder/service/" xmlns:xsi="http
Error:ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request 50958580. Job error is: <env:Body xmlns:env="http://schemas.xmlsoap.org/soap/envelope/">;  <ns0:processOrderImportRequestAsyncResponse xmlns:ns0="http://xmlns.oracle.com/apps/scm/doo/decomposition/receiveTransform/receiveSalesOrder/service/types/">;    <ns1:result xmlns:ns0="http://xmlns.oracle.com/apps/scm/doo/decomposition/receiveTransform/receiveSalesOrder/service/" xmlns:xsi="http
Error:ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request 50958576. Job error is: <env:Body xmlns:env="http://schemas.xmlsoap.org/soap/envelope/">;  <ns0:processOrderImportRequestAsyncResponse xmlns:ns0="http://xmlns.oracle.com/apps/scm/doo/decomposition/receiveTransform/receiveSalesOrder/service/types/">;    <ns1:result xmlns:ns0="http://xmlns.oracle.com/apps/scm/doo/decomposition/receiveTransform/receiveSalesOrder/service/" xmlns:xsi="http
GROUP 2:
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request 28928106. Job error is: BIP job failed.
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request 987285. Job error is: BIP job failed.
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request 2893694. Job error is: BIP job failed.
GROUP 3:
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request 1967906. Job error is: BIP job failed with delivery error.
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request 6396771. Job error is: BIP job failed with delivery error.
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request 3759205. Job error is: BIP job failed with delivery error.

As you can see these are very similar, with only ID's differentiating them, or URLs that might be specific to a client.

In general, uniqueness may be due to timestamps, UUIDs, URLs, etc.

If I have this same sample, and "group" them, I get 3 distinct groups:

GROUP 1:
Error:ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request {request_id}. Job error is: <env:Body xmlns:env="{url}">; <ns0:processOrderImportRequestAsyncResponse xmlns:ns0="{url}">; <ns1:result xmlns:ns0="{url}" xmlns:xsi="http
GROUP 2:
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request {request_id}. Job error is: BIP job failed with delivery error.
GROUP 3:
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request {request_id}. Job error is: BIP job failed.

Right now, these logs all go through a series of reductive regular expressions that are extremely CPU intensive and slow, and require maintenance (these logs can change, and so must too the regular expressions).

I have been trying to think of a way to have these automatically group by some level of similarity. Initially, it seems like a Trie might work, but then what would be the similarity scoring? At some point anything past a certain length is just gibberish and up to that point we could measure similarity.

So, the question:

What algorithm could be:

a lot less CPU intensive than a series of reductive regular expression?
A lot less maintenance than hand-coding and evaluating regular expressions
Can account for a subjective, and morphing, understanding of "similar"
Each grouping has to be uniquely identifiable

I was thinking that the "similarity score" would be something like, in words:

Break the string into tokens
More identical tokens in identical positions, means higher the score
Tokens matching further down the tree carry less-and-less weight

So, if you look at these 3 groupings, with the exception of the ID, they're identical within their group

But groups 2 and 3 are more similar to each other than they are to group 1, because only the last few words are different

score 1 · Answer 1 · answered Jan 28 '25 at 20:19

There are many approaches you could experiment with or tweak. It's hard to know what might work best a priori; you might need to try several approaches and see which seems to perform the best.

You might try a locality sensitive hash, such as MinHash or Nilsimsa hash, combined with clustering.

You might try an embedding model combined with clustering.

You could store the messages in a trie, and use as a similarity metric between two messages the length of the longest common prefix of both messages. Probably you would want to supplement this with other metrics, though, and I suspect that a better metric might be something like the Jaccard distance applied to the set of words/tokens in each message.

You could also try a method for log parsing, specifically, inferring a small number of templates from these log messages, matching each log message to a template, and then using this parsed form to compute a similarity metric and cluster log messages. See, e.g., LILAC (code).

Algorithm for grouping "similar" strings

1 Answers1