How can I group texts with similar content together?

Question

I need to find a solution to group a corpus of texts according to document similarity. Premising I have no experience in ML - only a few readings - I'd like to know if calculating the tf-idf on each text is the right approach. I've read something about calculating and compare the cosine similarity on tf-idf, but I don't know how to further process the results.

Saying "almost equal" means "the order of the words matters", it' not just the same words in different order.

I was thinking about using MLlib from Apache Spark to do the job (or at least use a Scala library).

Can anybody head me to the right direction, or even better add a link to a tutorial to this page?

Max · Answer 1 · 2016-05-06T13:37:21.770

4

Well, after further googling I found the solution: MinHash or SimHash will do the job and I also found a tool implementing MinHash written in Scala on GitHub right at this link

edited May 06 '16 at 13:37

answered May 03 '16 at 15:02

Max

191
1
6

score 1 · Answer 2 · answered Feb 27 '20 at 06:30

1

You have to Build Term Matrix with TF-IDF and N-Grams. Once the matrix is build then you have to calculate the proximity between string and based on proximity you have to group together those strings.

answered Feb 27 '20 at 06:30

Mangesh Gaikwad

11
1

score 0 · Answer 3 · answered May 08 '16 at 02:58

0

If order of words matter, then I would say try tf-idf but as phrases. Take 2-word or 3-words as features and generate tf-idf for them and do similarity matching.

answered May 08 '16 at 02:58

The Wanderer

103
2

How can I group texts with similar content together?

3 Answers3