1

Do people use n-grams or 1,2,3,...n-grams in both matrix factorisation and generative models in Topic Modeling?

I've been trying to understand the basics of Topic Modeling and came to know that there are two ways - Matrix Factorisation like LSA and NNMF and generative models like LDA and pLSA.

However, while reading the texts, I had a question - Do people use n-grams or 1,2,3,...n-grams in both matrix factorisation and generative models in Topic Modeling? For example, if n=5, then do people use only 5-grams or do they use all unigrams, bigrams, trigrams, 4-grams and 5-grams for creating the document term matrix?

If there are contextual answers then what are the reasons for using either?

Thanks in advance.

2 Answers2

1

It is best to use 1,2,3,...n-grams. Giving the model more features allows it to better learn patterns in the data. Often a threshold for the number of occurrences is used to filter out infrequent ngrams.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
0

In the case of traditional topic modelling approaches like LSA, LDA, etc., no, it's done only with unigrams. These methods rely on word cooccurrences in order to cluster semantically meaningful topics. In a regular text most words appear only once or twice and therefore are not usable as cooccurrences. If one considers n-grams instead of unigrams, there would be very few cooccurrences left in a text, so it's very unlikely that these models would be able to correctly cluster these n-grams by topic.

Mixing different levels of n-grams might work (not sure if it's been done before?), but I suspect that this could cause inconsistencies in the conditional probabilities word given topic.

Erwan
  • 26,519
  • 3
  • 16
  • 39