5

I have a graph which was already separated into clusters. Each node in the graph has a label (typically, it's a function's name like org.java.somepackage.validateLogin). What I want to do is to give a representative label for each cluster.

For the sake of simplicity, let's assume I'm able to clean the data (i.e. break 'validateLogin' to 'validate' and 'login')

I've done a little research on the subjects of topics modeling and cluster labeling and encountered a few algorithms, such as: LDA, NMF and TF-IDF (not quite an algorithm on itself)

Basically, many algorithms are documents oriented which are reach of words and not short-text/labels oriented.

It is worth mentioning:

  • We can use the fact that different clusters might have different labels, so maybe a proper label for a cluster could be unique words in the overall bag of words (that can be done with TF-IDF I guess)

  • Labels can be one word but can also imply a hierarchy (i.e. packageA.packageB.packageC.funcName)

I'd be glad if you could give me your insights on this problem and what approach could fit here.

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
Elimination
  • 171
  • 1
  • 7

2 Answers2

2

A possible approach might be to use the most predictable and predictive word(s) in each cluster as its name(s). The following is inspired by Fisher's category utility used by the COBWEB algorithm.

An attribute is highly predictable in a cluster $C_l$ if most elements in that cluster have the same value $a$ for it, thus a word represented by attribute $A_i$ has a predictability of $P(A_i=a|C_l)$

An attribute is highly predictive for a cluster $C_l$ if knowing it's value $a$ implies you can say with high certainty to which cluster it belongs, expressed as $P(C_l|A_i=a)$

Assume now you process each label org.java.somepackage.validateLogin as a sentence: "org java somepackage validate Login" and apply one-hot encoding to all your sentences in the dataset. The occurrence of a word is now represented by a value of 1 for it's corresponding attribute.

The task of representing each cluster by a word can now be formulated as finding for each cluster $C_l$ the word represented by attribute $A_i$ that equals 1 and has the highest predictability and predictiveness, weighted by the total probability of this word being in a sentence.

$$ P(A_i=1|C_l)\times P(C_l|A_i=1) \times P(A_i=1)$$

Which can be calculated by counting occurrence of a word per cluster and cluster per word.

Lejafar
  • 306
  • 1
  • 5
1

Your research has yielded results like LDA, NMF, etc. But since you want to keep your concentration to short texts and generate short labels, I would recommend you to check this algorithm called: Bi Term Topic Model for Short Texts.

Here is the paper from the author: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.4032&rep=rep1&type=pdf

And here is the code from the author again: https://github.com/xiaohuiyan/BTM

I have used this personally and can very vouch for this for its performance.

trollster
  • 160
  • 1
  • 7