Questions tagged [topic-model]

A topic model describes text from a large corpus as a probability distribution over topics which are probability distributions over words. There are quantified contributions from all topics to a specific text.

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats (source: wikipedia)

Generative models (i.e. the statistical models used for topic modelling)

  • Latent Dirichlet Allocation (LDA)
  • Hierarchical Dirichlet process (HDP)
  • Non-Negative Matrix Factorisation

Software / Libraries

142 questions
62
votes
6 answers

Latent Dirichlet Allocation vs Hierarchical Dirichlet Process

Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Process (HDP) are both topic modeling processes. The major difference is LDA requires the specification of the number of topics, and HDP doesn't. Why is that so? And what are the…
alvas
  • 2,510
  • 7
  • 28
  • 40
31
votes
1 answer

NLP - why is "not" a stop word?

I am trying to remove stop words before performing topic modeling. I noticed that some negation words (not, nor, never, none etc..) are usually considered to be stop words. For example, NLTK, spacy and sklearn include "not" on their stop word lists.…
E.K.
  • 435
  • 1
  • 4
  • 6
30
votes
3 answers

What is difference between text classification and topic models?

I know the difference between clustering and classification in machine learning, but I don't understand the difference between text classification and topic modeling for documents. Can I use topic modeling over documents to identify a topic? Can I…
Ali
  • 361
  • 2
  • 4
  • 6
25
votes
2 answers

What does the alpha and beta hyperparameters contribute to in Latent Dirichlet allocation?

LDA has two hyperparameters, tuning them changes the induced topics. What does the alpha and beta hyperparameters contribute to LDA? How does the topic change if one or the other hyperparameters increase or decrease? Why are they hyperparamters…
alvas
  • 2,510
  • 7
  • 28
  • 40
17
votes
3 answers

Why should we not feed LDA with TF-IDF input?

Can someone explain why we can not feed LDA topic model with TFIDF? What is wrong with this approach conceptually?
sariii
  • 171
  • 1
  • 1
  • 5
12
votes
1 answer

What is the difference between topic modeling and clustering?

I know that topic modeling and clustering are related, but not similar techniques. Can anyone suggest what are the main differences?
sara
  • 481
  • 7
  • 15
9
votes
1 answer

Gensim LDA model: return keywords based on relevance (λ - lambda) value

I am using the gensim library for topic modeling, more specifically LDA. I created my corpus, my dictionary, and my LDA model. With the help of the pyLDAvis library I visualized the results. When I print the words with the highest probability on…
9
votes
5 answers

Tutorials on topic models and LDA

I would like to know if you people have some good tutorials (fast and straightforward) about topic models and LDA, teaching intuitively how to set some parameters, what they mean and if possible, with some real examples.
pedrobisp
  • 191
  • 1
  • 1
  • 3
8
votes
4 answers

How to give name to topics created using LDA?

I have categorized 800,000 documents into 500 categories using the Mahout topic modelling. Instead of representing the topic using the top 5/10 words for each topics, I want to infer a generic name for the group using any existing algorithm. For the…
adihere
  • 81
  • 1
  • 1
  • 2
8
votes
1 answer

Resume Parsing - extracting skills from resume using Machine Learning

I am trying to extract a skill set of an employee from his/her resume. I have resumes stored as plain text in Database. I do not have predefined skills in this case. How should I approach this problem? I can think of two ways: Using unsupervised…
Sociopath
  • 1,293
  • 2
  • 12
  • 27
7
votes
4 answers

BERT: it is possible to use it for topic modeling?

I'm struggling to understand which are the full capabilities of BERT: it is possible to make topic modeling of text, like the one we can achieve with LDA?
xcsob
  • 193
  • 2
  • 5
6
votes
1 answer

How to split natural language script into segments?

I have a bunch of .txt and .srt files extracted from a MOOC website, they are the scripts of the videos. I would like to segment the scripts into parts such that each part falls into one of the following categories: MainConceptDescription->…
A.D.
  • 205
  • 1
  • 6
6
votes
1 answer

Comparing two Corpora using Topic Model

I want to compare two corpora (two different collections of texts) using Topic Modeling. I trained the model separately on the two collections and manually matched similar topics based on their frequent words. I was wondering if there is a…
saghi
  • 71
  • 4
5
votes
1 answer

Calculating optimal number of topics for topic modeling (LDA)

am going to do topic modeling via LDA. I run my commands to see the optimal number of topics. The output was as follows: It is a bit different from any other plots that I have ever seen. Do you think it is okay? or it is better to use other…
5
votes
2 answers

Why do my Latent Dirichlet Allocation Topics mix words that never co-occurred?

I have one corpus of documents on diabetes, another on Leonardo da Vinci, and another on animation and computer graphics. I combined all of these documents into a LDA and got a topic like the one below. I'm listing the top 30 terms, in descending…
Matt
  • 821
  • 1
  • 8
  • 12
1
2 3
9 10