Higher level sentence similarity (meaning instead of 'just' embeddings)

Question

I am looking for the correct model / approach for the task of checking if two sentences have the same meaning

I know I can use embeddings to check similarity, but that is not what I am after. I suspect BERT style LLM have nice higher level vector that mights be useful, but I'm not sure how to apply that.

For example this sentence:

I am very lazy

Has a somewhat similar meaning as:

I don't like to work hard

But not

A lazy horse is not very useful

Using 'just' embeddings (for example HF: allMiniLM-L6-v2) gives results that are not useful.

What would be a good appoarch?

Valentas · Accepted Answer · 2023-12-08T12:17:15.770

The similarity used to train this model might be different from the similarity you expect.

A better approach would be create your own large and good quality training set of similar and dissimilar sentences and fine-tune a pretrained model (the one from your question or some other) using the same sentence transformers library (https://www.sbert.net).

Another currently available alternative is to play with prompts for the huge commercial models (ChatGPT, GPT-4, Google Bard, etc) and hopefully they can understand what you want and do the task for you without any additional effort. For example ChatGPT said sentence B is more similar to A in 10/10 retries in my test.

Higher level sentence similarity (meaning instead of 'just' embeddings)

1 Answers1