5

I have two sets of newspaper articles where I train the first newspaper dataset separately to get the topics per each newspaper article.

E.g., first newspaper dataset
article_1 = {'politics': 0.1, 'nature': 0.8, ..., 'sports':0, 'wild-life':1}

Again, I train my second newspaper dataset (from a different distributor) to get the topics per each newspaper article.

E.g., second newspaper dataset (from a different distributor)
article_2 = {'people': 0.3, 'animals': 0.7, ...., 'business':0.7, 'sports':0.2}

As shown in the examples, the topics I get from the two datasets are different, thus I manually matched similar topics based on their frequent words.

I want to identify whether the two newspaper distributors publish the same news in every week.

Hence, I am interested in knowing if there is a systematic way of comparing the topics across two corpora and measuring their similarity. Please help me.

Smith
  • 529
  • 1
  • 5
  • 14

2 Answers2

2

One method to compare the topics across two corpora and measuring their similarity is with Kullback-Leibler divergence, aka relative entropy. Kullback-Leibler divergence is a measure of how one probability distribution diverges from a second probability distribution.

Another, more scalable algorithm can be found in Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
1

Considering you extracted the news using a TF-IDF approach what you have as a result is just one feature (frequency of terms). I think you would need to add more features to your corpus to be able to match two news as the same (or similar).

One new feature would be the temporal, where you would add a timestamp for the news. It will allow you check if two news (from different publishers) were published on the same period. (a week, 2 weeks, etc).

The second one could be a spatial, if you have geolocation of the news for example, you can add it to your train dataset. Something similar was done by Chi-Chun Pan. It would allow you be more confident if the news happened on the same place.

Elder Santos
  • 111
  • 2