I am trying to compute the probability of a document to belong to each topic found by the LDA model. I have succeded in producing the LDA but now I am stuck. My code goes as following:
## Libraries to download
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
## Tokenizing
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = stopwords.words('english')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
import json
import nltk
import re
import pandas
appended_data = []
for i in range(2005,2016):
if i > 2013:
df0 = pandas.DataFrame([json.loads(l) for l in open('SDM_%d.json' % i)])
appended_data.append(df0)
df1 = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)])
df2 = pandas.DataFrame([json.loads(l) for l in open('APJ_%d.json' % i)])
df3 = pandas.DataFrame([json.loads(l) for l in open('TH500_%d.json' % i)])
df4 = pandas.DataFrame([json.loads(l) for l in open('DRSM_%d.json' % i)])
appended_data.append(df1)
appended_data.append(df2)
appended_data.append(df3)
appended_data.append(df4)
appended_data = pandas.concat(appended_data)
doc_set = appended_data.body
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# add tokens to list
texts.append(stopped_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=15, id2word = dictionary, passes=50)
I am trying to follow the method here but I find it confusing. For example, when I try the following code:
# Assinging the topics to the document in corpus
lda_corpus = ldamodel[corpus]
# Find the threshold, let's set the threshold to be 1/#clusters,
# To prove that the threshold is sane, we average the sum of all probabilities:
scores = list(chain(*[[score for topic_id,score in topic] \
for topic in [doc for doc in lda_corpus]]))
threshold = sum(scores)/len(scores)
print(threshold)
cluster1 = [j for i,j in zip(lda_corpus,doc_set) if i[0][1] > threshold]
print(cluster1)
It seems it works, since it retrieves the the articles that belong to topic 1. Nevertheless, can someone explain what is the intuition behind and if there is other alternatives. For example, what is the intuition behind the threshold level here? Thanks