Named Entity Recognition: NLTK using Regular Expression

Question

Many times Named Entity Recognition (NER) doesn't tag consecutive NNPs as one NE. I think editing the NER to use RegexpTagger also can improve the NER.

For example, consider the following input:

"Barack Obama is a great person."

And the output:

Tree('S', [Tree('PERSON', [('Barack', 'NNP')]), Tree('ORGANIZATION', [('Obama', 'NNP')]),
    ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('person', 'NN'), ('.', '.')])

where as for the input:

'Former Vice President Dick Cheney told conservative radio host Laura Ingraham that he "was honored" to be compared to Darth Vader while in office.'

the output is:

Tree('S', [('Former', 'JJ'), ('Vice', 'NNP'), ('President', 'NNP'),
    Tree('NE', [('Dick', 'NNP'), ('Cheney', 'NNP')]), ('told', 'VBD'), ('conservative', 'JJ'),
    ('radio', 'NN'), ('host', 'NN'), Tree('NE', [('Laura', 'NNP'), ('Ingraham', 'NNP')]),
    ('that', 'IN'), ('he', 'PRP'), ('``', '``'), ('was', 'VBD'), ('honored', 'VBN'),
    ("''", "''"), ('to', 'TO'), ('be', 'VB'), ('compared', 'VBN'), ('to', 'TO'),
    Tree('NE', [('Darth', 'NNP'), ('Vader', 'NNP')]), ('while', 'IN'), ('in', 'IN'),
    ('office', 'NN'), ('.', '.')])

Here Vice/NNP, President/NNP, (Dick/NNP, Cheney/NNP) is correctly extracted. So, I think if nltk.ne_chunk is used first, and then if two consecutive trees are NNP, there are higher chances that both refer to one entity.

I have been playing with NLTK toolkit, and I came across this problem a lot, but couldn't find a satisfying answer. Any suggestion will be really appreciated. I'm looking for flaws in my approach.

score 1 · Accepted Answer · answered Nov 12 '14 at 18:41

You have a great idea going, and it might work for your specific project. However there are a few considerations you should take into account:

In your first sentence, Obama in incorrectly classified as an organization, instead of a person. This is because the training model used my NLTK probably does not have enough data to recognize Obama as a PERSON. So, one way would be to update this model by training a new model with a lot of labeled training data. Generating labeled training data is one of the most expensive tasks in NLP - because of all the man hours it takes to tag sentences with the correct Part of Speech as well as semantic role.
In sentence 2, there are 2 concepts - "Former Vice President", and "Dick Cheney". You can use co-reference to identify the relation between the 2 NNPs. Both the NNP are refering to the same entity, and the same entity could be referenced as - "former vice president" as well as "Dick Cheney". Co-reference is often used to identify the Named entity that pronouns refer to. e.g. "Dick Cheney is the former vice president of USA. He is a Republican". Here the pronoun "he" refers to "Dick Cheney", and it should be identified by a co-reference identification tool.

Named Entity Recognition: NLTK using Regular Expression

1 Answers1