5

I need to find a solution to group a corpus of texts according to document similarity. Premising I have no experience in ML - only a few readings - I'd like to know if calculating the tf-idf on each text is the right approach. I've read something about calculating and compare the cosine similarity on tf-idf, but I don't know how to further process the results.

Saying "almost equal" means "the order of the words matters", it' not just the same words in different order.

I was thinking about using MLlib from Apache Spark to do the job (or at least use a Scala library).

Can anybody head me to the right direction, or even better add a link to a tutorial to this page?

Brandon Loudermilk
  • 1,216
  • 8
  • 19
Max
  • 191
  • 1
  • 6

3 Answers3

4

Well, after further googling I found the solution: MinHash or SimHash will do the job and I also found a tool implementing MinHash written in Scala on GitHub right at this link

Max
  • 191
  • 1
  • 6
1

You have to Build Term Matrix with TF-IDF and N-Grams. Once the matrix is build then you have to calculate the proximity between string and based on proximity you have to group together those strings.

0

If order of words matter, then I would say try tf-idf but as phrases. Take 2-word or 3-words as features and generate tf-idf for them and do similarity matching.

The Wanderer
  • 103
  • 2