2

I have a problem related to clustering, where i need to cluster skill set from job domain.

Let's say, in a resume a candidate can mention they familiarity with amazon s3 bucket. But each people can mention it in any way. For example,

  1. amazon s3
  2. s3
  3. aws s3

For a human, we can easily understand these three are exactly equivalent. I can't use kmeans type of clustering because it can fail in a lot of cases.

For example,

  1. spring
  2. spring framework
  3. Spring MVC
  4. Spring Boot

These may fall in the same cluster which is wrong. A candidate who knows spring framework might not know sprint boot etc.,

The similarity of a word based on the embeddings/bow model fails here.

What are the options I have? Currently, I manually collected a lot of word variations in a dict format, key is root word value is an array of variations of that root word.

Any help is really appreciated.

Shayan Shafiq
  • 1,008
  • 4
  • 13
  • 24
Sai Kumar
  • 631
  • 2
  • 8
  • 15

1 Answers1

2

That is commonly called entity linking, the task of assigning a unique identity to entities. Your issue in particular is name variations, the same entity might appear with different textual representations / surface forms.

Clustering is not the most useful way of solving name variations since clustering is unsupervised.

There are many ways to approach resolving name variations. Given that job skills is a relatively common domain, you can find or pay for existing mapping of job skill entities. If you want to build your own system, most systems start with hand-coded rules (typically a combination of regular expressions and hash maps). After diminishing returns with a hand-coded rules, other models could be applied. A knowledge base can be used to disambiguate textual entities. Again, since job skills is a common domain there are many existing knowledge bases. You could create your own job skill knowledge base but that would be a complex, slow, and error-prone process.

Shayan Shafiq
  • 1,008
  • 4
  • 13
  • 24
Brian Spiering
  • 23,131
  • 2
  • 29
  • 113