9

I intend to tokenize a number of job description texts. I have tried the standard tokenization using whitespace as the delimiter. However I noticed that there are some multi-word expressions that are splitted by whitespace, which may well cause accuracy problems in subsequent processing. So I want to get all the most interesting/informative collocations in these texts.

Is there any good packages for doing multi-word tokenization, regardless of specific programming language? For example, "He studies Information Technology" ===> "He" "studies" "Information Technology".

I've noticed NLTK (Python) has some related functionalities.

What the difference between these two?

The MWETokenizer class in the nltk.tokenize.mwe module seems working towards my objective. However MWETokenizer seems to require me to use its construction method and .add_mwe method to add multi-word expressions. Is there a way to use external multi-word expression lexicon to achieve this? If so, are there any multi-word lexicon?

Thanks!

DaL
  • 2,663
  • 13
  • 13
CyberPlayerOne
  • 392
  • 1
  • 4
  • 15

6 Answers6

2

The multiword tokenizer 'nltk.tokenize.mwe' basically merges a string already divided into tokens, based on a lexicon, from what I understood from the API documentation.

One thing you can do is tokenize and tag all words with it's associated part-of-speech (PoS) tag, and then define regular expressions based on the PoS-tags to extract interesting key-phrases.

For instance, an example adapted from the NLTK Book Chapter 7 and this blog post:

def extract_phrases(my_tree, phrase):
   my_phrases = []
   if my_tree.label() == phrase:
      my_phrases.append(my_tree.copy(True))

   for child in my_tree:
       if type(child) is nltk.Tree:
            list_of_phrases = extract_phrases(child, phrase)
            if len(list_of_phrases) > 0:
                my_phrases.extend(list_of_phrases)

    return my_phrases



def main():
    sentences = ["The little yellow dog barked at the cat",
                 "He studies Information Technology"]

    grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
    cp = nltk.RegexpParser(grammar)

    for x in sentences:
        sentence = pos_tag(tokenize.word_tokenize(x))
        tree = cp.parse(sentence)
        print "\nNoun phrases:"
        list_of_noun_phrases = extract_phrases(tree, 'NP')
        for phrase in list_of_noun_phrases:
            print phrase, "_".join([x[0] for x in phrase.leaves()])

You defined a grammar based on regex over PoS-tags:

grammar = "NP: {<DT>?<JJ>*<NN>|<NNP>*}"
cp = nltk.RegexpParser(grammar)

Then you applied it to a tokenized and tagged sentence, generating a Tree:

sentence = pos_tag(tokenize.word_tokenize(x))
tree = cp.parse(sentence)

Then you use extract_phrases(my_tree, phrase) to recursively parse the Tree and extract sub-trees labeled as NP. The example above would extract the following noun-phrases:

Noun phrases:
(NP The/DT little/JJ yellow/JJ dog/NN) The_little_yellow_dog
(NP the/DT cat/NN) the_cat

Noun phrases:
(NP Information/NNP Technology/NNP) Information_Technology

There is a great blog post by Burton DeWilde about many more ways to extract interesting keyphrases: Intro to Automatic Keyphrase Extraction

Noman
  • 5
  • 4
David Batista
  • 133
  • 1
  • 11
2

For your problem i think gensim can be very useful, what can be implemented with Gensim library is phrase detection. It is similar to n-gram, but instead of getting all the n-gram by sliding the window, it detects frequently used phrases and stick them together. It statistically walks through the text corpus and identifies the common side-by-side occuring words.
Following is the way it calculates the best suitable multi word tokens. enter image description here

Following is the code to use it. it calculates the two word tokens.

from gensim.models.phrases import Phrases, Phraser

tokenized_train = [t.split() for t in x_train]
phrases = Phrases(tokenized_train)
bigram = Phraser(phrases)

and this is how you would use it

enter image description here

Notice the word "new_york" which is concatenated, since in the corpus, statistical evidence of both "new" and "york" words coming together was significant.

Moreover, you can go upto n-grams for this not just bi-grams. Here is the article which explains it in detail.

1

The tokenization process shouldn't be changed even when you are interested in multi words. After all, the words are still the basic tokens. What you should do it to find a way to combine the proper words into term.

A simple way to do so is to look for term in which the probability of the term is higher than that of the independent tokens. For example P("White house") > P("White")*P("House") Choosing the proper values of need lift, number of occurrences and term classification can be deduce if you have a dataset of terms form the domain. If you don't have such a domain then requirement at least 10 occurrences and and a lift of at least 2 (usually it is much higher since each token probability is low) will work quite well.

In your case can can also extract terms by combining contexts relevant to your domain (e.g., "studied X", "practiced Y").

Again, you can build complex and elegant models for that but usually, looking for the few next words after the context indicators will be very beneficial.

DaL
  • 2,663
  • 13
  • 13
0

This extension of Stanford CoreNLP to capture MultiWord Expressions(MWE's) worked like a charm for one of my tasks. For Python users, gear up to write some connector code or hack the Java code.

quettabit
  • 111
  • 2
0

Use Stanford CoreNLP library for multi word tokenization. I found it when I was working on similar task and it worked pretty well!

Updated: You can use Stanford CoreNLP pipeline which includes multi word tokenization model. Link of demo for training neural networks with your own data is here

Noman
  • 5
  • 4
0

gensim is one of the best to do nlp tasks on text data, Gensim python library

JAbr
  • 181
  • 2
  • 3