10

I am trying to generate an intelligent model which can scan a set of words or strings and classify them as names, mobile numbers, addresses, cities, states, countries and other entities using machine learning or deep learning.

I had searched for approaches, but unfortunately I didn't find any approach to take. I had tried with bag of words model and glove word embedding to predict whether a string is name or city etc..

But, I didn't succeed with bag of words model and with GloVe there are a lot of names which are not covered in the embedding example :- lauren is present in Glove and laurena isn't

I did find this post here, which had a reasonable answer but I couldn't the approached used to solve that problem apart from the fact that NLP and SVM were used to solve it.

Any suggestions are appreciated

Thanks and Regards, Sai Charan Adurthi.

2 Answers2

2

You could apply character grams - Intuitively, there might be a huge difference in character set between a phone number and an email address. and then pass the character gram vector to SVM to make a prediction. You could implement this using in sklearn using the below feature extractors.

  1. TfIdfVectorizer(analyzer='character')

  2. CountVectorizer(analyzer='character')

Cross validate on the ngram range and slack variables of SVM to fine tune your model.

2

Applying common categorical labels to words is typically called Named-entity recognition (NER).

NER can be done by static rules (e.g., regular expressions) or learned rules (e.g., decision trees). These rules are often brittle and do not generalize. Conditional Random Fields (CRF) are often a better solution because they are able to model the latent states of languages. Current state-of-the-art performance in NER is done with a combination of Deep Learning models.

The Stanford Named Entity Recognizer and spaCy are packages to perform NER.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113