8

I'm trying to define a metric between job titles in IT field. For this I need some metric between words of job titles that are not appearing together in the same job title, e.g. metric between the words

senior, primary, lead, head, vp, director, stuff, principal, chief,

or the words

analyst, expert, modeler, researcher, scientist, developer, engineer, architect.

How can I get all such possible words with their distance ?

Mher
  • 181
  • 5

4 Answers4

4

That's an interesting problem, thanks for bring out here on stack.

I think this problem is similar to when we apply LSA(Latent Semantic Analysis) in sentiment analysis to find list of positive and negative words with polarity with respect to some predefined positive and negative words.

Good reads:

So, according to me LSA is your best approach to begin with in this situation as it learns the underlying relation between the words from the corpus and probably that's what you are looking for.

Ankit
  • 406
  • 2
  • 8
2

If I understand your question, you can look at the co-occurrence matrix formed using the terms following the title; e.g., senior FOO, primary BAR, etc. Then you can compute the similarity between any pair of terms, such as "senior" and "primary", using a suitable metric; e.g., the cosine similarity.

Emre
  • 10,541
  • 1
  • 31
  • 39
1

Not sure if this is exactly what you're looking for, but r-base has a function called "adist" which creates a distance matrix of approximate string distances (according to the Levenshtein distance). Type '?adist' for more.

words = c("senior", "primary", "lead", "head", "vp", "director", "stuff", "principal", "chief")
adist(words)

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
 [1,]    0    6    5    5    6    5    5    7    5
 [2,]    6    0    6    6    7    7    7    6    6
 [3,]    5    6    0    1    4    7    5    8    5
 [4,]    5    6    1    0    4    7    5    8    4
 [5,]    6    7    4    4    0    8    5    8    5
 [6,]    5    7    7    7    8    0    8    8    7
 [7,]    5    7    5    5    5    8    0    9    4
 [8,]    7    6    8    8    8    8    9    0    8
 [9,]    5    6    5    4    5    7    4    8    0

Also, if R isn't an option, the Levenshtein distance algorithm is implemented in many languages here: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance

nfmcclure
  • 493
  • 3
  • 11
1

(too long for a comment)

Basically, @Emre's answer is correct: simple correlation matrix and cosine distance should work well*. There's one subtlety, though - job titles are too short to carry important context. Let me explain this.

Imagine LinkedIn profiles (which is pretty good source for data). Normally, they contain 4-10 sentences describing person's skills and qualifications. It's pretty likely that you find phrases like "lead data scientist" and "professional knowledge of Matlab and R" in a same profile, but it's very unlikely to also see "junior Java developer" in it. So we may say that "lead" and "professional" (as well as "data scientist" and "Matlab" and "R") often occur in same contexts, but they are rarely found together with "junior" and "Java".

Co-occurrence matrix shows exactly this. The more 2 words occur in same context, the more similar their vectors in the matrix will look like. And cosine distance is just a good way to measure this similarity.

But what about job titles? Normally they are much shorter and don't actually create enough context to catch similarities. Luckily, you don't need source data to be titles themselves - you need to find similarities between skills in general, not specifically in titles. So you can simply build co-occurrence matrix from (long) profiles and then use it to measure similarity of titles.

* - in fact, it's already worked for me on a similar project.

ffriend
  • 2,831
  • 19
  • 19