Questions tagged [lda]

Latent Dirichlet Allocation (LDA) is an algorithm in the field of topic modeling.

If observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA represents documents as mixtures of topics that spit out words with certain probabilities.

Popular software packages to perform LDA include

It should not be confused with Linear Discriminant Analysis, a supervised learning procedure for classifying observations into a set of categories.

118 questions
62
votes
6 answers

Latent Dirichlet Allocation vs Hierarchical Dirichlet Process

Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Process (HDP) are both topic modeling processes. The major difference is LDA requires the specification of the number of topics, and HDP doesn't. Why is that so? And what are the…
alvas
  • 2,510
  • 7
  • 28
  • 40
25
votes
2 answers

What does the alpha and beta hyperparameters contribute to in Latent Dirichlet allocation?

LDA has two hyperparameters, tuning them changes the induced topics. What does the alpha and beta hyperparameters contribute to LDA? How does the topic change if one or the other hyperparameters increase or decrease? Why are they hyperparamters…
alvas
  • 2,510
  • 7
  • 28
  • 40
17
votes
3 answers

Why should we not feed LDA with TF-IDF input?

Can someone explain why we can not feed LDA topic model with TFIDF? What is wrong with this approach conceptually?
sariii
  • 171
  • 1
  • 1
  • 5
10
votes
3 answers

Clustering of documents using the topics derived from Latent Dirichlet Allocation

I want to use Latent Dirichlet Allocation for a project and I am using Python with the gensim library. After finding the topics I would like to cluster the documents using an algorithm such as k-means(Ideally I would like to use a good one for…
Swan87
  • 221
  • 1
  • 2
  • 4
9
votes
1 answer

Gensim LDA model: return keywords based on relevance (λ - lambda) value

I am using the gensim library for topic modeling, more specifically LDA. I created my corpus, my dictionary, and my LDA model. With the help of the pyLDAvis library I visualized the results. When I print the words with the highest probability on…
9
votes
5 answers

Tutorials on topic models and LDA

I would like to know if you people have some good tutorials (fast and straightforward) about topic models and LDA, teaching intuitively how to set some parameters, what they mean and if possible, with some real examples.
pedrobisp
  • 191
  • 1
  • 1
  • 3
7
votes
4 answers

BERT: it is possible to use it for topic modeling?

I'm struggling to understand which are the full capabilities of BERT: it is possible to make topic modeling of text, like the one we can achieve with LDA?
xcsob
  • 193
  • 2
  • 5
6
votes
1 answer

How to split natural language script into segments?

I have a bunch of .txt and .srt files extracted from a MOOC website, they are the scripts of the videos. I would like to segment the scripts into parts such that each part falls into one of the following categories: MainConceptDescription->…
A.D.
  • 205
  • 1
  • 6
6
votes
1 answer

Can I use euclidean distance for Latent Dirichlet Allocation document similarity?

I have a Latent Dirichlet Allocation (LDA) model with $K$ topics trained on a corpus with $M$ documents. Due to my hyper parameter configurations, the output topic distributions for each document is heavily distributed on only 3-6 topics and all the…
PyRsquared
  • 1,666
  • 1
  • 12
  • 18
5
votes
1 answer

Calculating optimal number of topics for topic modeling (LDA)

am going to do topic modeling via LDA. I run my commands to see the optimal number of topics. The output was as follows: It is a bit different from any other plots that I have ever seen. Do you think it is okay? or it is better to use other…
5
votes
1 answer

How to choose threshold for gensim Phrases when generating bigrams?

I'm generating bigrams with from gensim.models.phrases, which I'll use downstream with TF-IDF and/or gensim.LDA from gensim.models.phrases import Phrases, Phraser # 7k documents, ~500-1k tokens each. Already ran cleanup, stop_words, lemmatization,…
lefnire
  • 151
  • 4
5
votes
2 answers

Why do my Latent Dirichlet Allocation Topics mix words that never co-occurred?

I have one corpus of documents on diabetes, another on Leonardo da Vinci, and another on animation and computer graphics. I combined all of these documents into a LDA and got a topic like the one below. I'm listing the top 30 terms, in descending…
Matt
  • 821
  • 1
  • 8
  • 12
5
votes
3 answers

Need help with LDA for selecting features

I am currently selecting features of products by using LDA to group 6000 keywords of product into topics. Here is the sample of my dataset after being organized into list of keywords for each product id. I consider each id as a "document" and each…
sylvia
  • 303
  • 1
  • 2
  • 8
5
votes
2 answers

Topic modeling for short length sentences

I have a graph which was already separated into clusters. Each node in the graph has a label (typically, it's a function's name like org.java.somepackage.validateLogin). What I want to do is to give a representative label for each cluster. For the…
Elimination
  • 171
  • 1
  • 7
5
votes
3 answers

scikit-learn - Should I fit model with TF or TF-IDF?

I am trying to find out the best way to fit different probabilistic models (like Latent Dirichlet Allocation, Non-negative Matrix Factorization, etc) on sklearn (Python). Looking at the example in the sklearn documentation, I was wondering why the…
Luca P.
  • 51
  • 2
1
2 3 4 5 6 7 8