6

I want to compare two corpora (two different collections of texts) using Topic Modeling. I trained the model separately on the two collections and manually matched similar topics based on their frequent words.

I was wondering if there is a systematic way of comparing the topics across two corpora and measuring their similarity.

saghi
  • 71
  • 4

1 Answers1

4

In my eyes, this is not a valid approach.

Note, that there is not one unique topic model (given some parameters like the number of topics and the algorithm for topic modelling) for a corpus. Different runs with different random seeds will give you different topic models for the same corpus.

So, any comparison comes down to a comparison of specific topic models, but not to a comparison of the corpora.

An approach with some better validity is to combine both corpora into one super-corpus, create a topic model of it, and than investigate the distribution of the topics with respect to the sub-corpora formed by the original corpora 1 and 2.