I have a situation where I am trying to group similar log error messages so that I can enable metrics on them. So, here are some live sample string:
GROUP 1:
Error:ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request 50958581. Job error is: <env:Body xmlns:env="http://schemas.xmlsoap.org/soap/envelope/">; <ns0:processOrderImportRequestAsyncResponse xmlns:ns0="http://xmlns.oracle.com/apps/scm/doo/decomposition/receiveTransform/receiveSalesOrder/service/types/">; <ns1:result xmlns:ns0="http://xmlns.oracle.com/apps/scm/doo/decomposition/receiveTransform/receiveSalesOrder/service/" xmlns:xsi="http
Error:ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request 50958580. Job error is: <env:Body xmlns:env="http://schemas.xmlsoap.org/soap/envelope/">; <ns0:processOrderImportRequestAsyncResponse xmlns:ns0="http://xmlns.oracle.com/apps/scm/doo/decomposition/receiveTransform/receiveSalesOrder/service/types/">; <ns1:result xmlns:ns0="http://xmlns.oracle.com/apps/scm/doo/decomposition/receiveTransform/receiveSalesOrder/service/" xmlns:xsi="http
Error:ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request 50958576. Job error is: <env:Body xmlns:env="http://schemas.xmlsoap.org/soap/envelope/">; <ns0:processOrderImportRequestAsyncResponse xmlns:ns0="http://xmlns.oracle.com/apps/scm/doo/decomposition/receiveTransform/receiveSalesOrder/service/types/">; <ns1:result xmlns:ns0="http://xmlns.oracle.com/apps/scm/doo/decomposition/receiveTransform/receiveSalesOrder/service/" xmlns:xsi="http
GROUP 2:
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request 28928106. Job error is: BIP job failed.
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request 987285. Job error is: BIP job failed.
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request 2893694. Job error is: BIP job failed.
GROUP 3:
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request 1967906. Job error is: BIP job failed with delivery error.
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request 6396771. Job error is: BIP job failed with delivery error.
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request 3759205. Job error is: BIP job failed with delivery error.
As you can see these are very similar, with only ID's differentiating them, or URLs that might be specific to a client.
In general, uniqueness may be due to timestamps, UUIDs, URLs, etc.
If I have this same sample, and "group" them, I get 3 distinct groups:
GROUP 1:
Error:ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request {request_id}. Job error is: <env:Body xmlns:env="{url}">; <ns0:processOrderImportRequestAsyncResponse xmlns:ns0="{url}">; <ns1:result xmlns:ns0="{url}" xmlns:xsi="http
GROUP 2:
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request {request_id}. Job error is: BIP job failed with delivery error.
GROUP 3:
Error:ESS-07034 Job logic indicated a business error occurred while executing an asynchronous java job for request {request_id}. Job error is: BIP job failed.
Right now, these logs all go through a series of reductive regular expressions that are extremely CPU intensive and slow, and require maintenance (these logs can change, and so must too the regular expressions).
I have been trying to think of a way to have these automatically group by some level of similarity. Initially, it seems like a Trie might work, but then what would be the similarity scoring? At some point anything past a certain length is just gibberish and up to that point we could measure similarity.
So, the question:
What algorithm could be:
- a lot less CPU intensive than a series of reductive regular expression?
- A lot less maintenance than hand-coding and evaluating regular expressions
- Can account for a subjective, and morphing, understanding of "similar"
- Each grouping has to be uniquely identifiable
I was thinking that the "similarity score" would be something like, in words:
- Break the string into tokens
- More identical tokens in identical positions, means higher the score
- Tokens matching further down the tree carry less-and-less weight
So, if you look at these 3 groupings, with the exception of the ID, they're identical within their group
But groups 2 and 3 are more similar to each other than they are to group 1, because only the last few words are different