8

I have categorized 800,000 documents into 500 categories using the Mahout topic modelling.

Instead of representing the topic using the top 5/10 words for each topics, I want to infer a generic name for the group using any existing algorithm. For the time being, I have used the following algorithm to arrive at the name for the topic:

For each topic

  • Take all the documents belonging to the topic (using the document-topic distribution output)
  • Run python nltk to get the noun phrases
  • Create the TF file from the output
  • name for the topic is the phrase (limited towards max 5 words)

Please suggest a approach to arrive at more relevant name for the topics.

adihere
  • 81
  • 1
  • 1
  • 2

4 Answers4

7

I can suggest several papers on this topic:

  • Automatic Labelling of Topic Models
  • Automatic Labeling Hierarchical Topics
  • Representing Topics Labels for Exploring Digital Libraries

You can find more by looking at their citations.

Emre
  • 10,541
  • 1
  • 31
  • 39
2

If you don't want to dig into much NLP in that task, I suggest you to generate a set of most frequent NGrams (of lengths 2-5) from your documents and find the most distinct ngrams for each category using TF*IDF metric as sense importance of a particular ngram (normalizing measure by word count) and selecting those Ngrams that are used in a particular category and are not (or rarely) used in others.

chewpakabra
  • 779
  • 4
  • 13
0

You might try using word vectors to average the top N words in a topic and then using the cosine similarity to find the closest word in the corpus?

Just a quick and dirty an idea...

CpILL
  • 101
  • 2
0

A few ideas you'll often see..

  • Generate a list from Wikipedia titles, extract keyphrases, predict the related wikipedia pages and use the keyphrases.
  • Generate a hand-labeled dataset.
  • Use a graph populated with topics and the relations between words and topics to predict the most likely topics
  • Abstractive summarization and keyphrase extraction