Accuracy of word and sent tokenize versus custom tokenizers in nltk

Question

The Natural Language Processing with Python book is a really good resource to understand basics of NLP. One of the chapters introduces training 'sentence segmentation' using Naive Bayes Classifer and provides a method to perform sentence segmentation on unseen corpus.

NLTK provides word_tokenize and sent_tokenize. Creating our own tokenizers can help us understand how one works, but in a production environment why would we want a custom tokenizer? And if I built a custom tokenizer, how could I measure if it was better that NLTK's tokenizer?

score 3 · Answer 1 · answered Jan 01 '18 at 00:52

Why would we want a custom tokenizer?

Segementation is a very large topic, and as thus there is no perfect Natural Language Tokenizer. Any toolkit needs to be flexible, and the ability to change the tokenizer, both so that someone can experiment, and so that it can be replaced if requirements are different, or better ways are found for specific problems, is useful and important.

How could I measure if it was better that NLTK's tokenizer?

Anytime you are trying to quantify performance (ie: better) you will need to first define what is meant by better. Once this is done, then typically you would perform the using the various methods under measurement, and then compare the results against your definition of better. A couple of links which discuss these topics:

Accuracy of word and sent tokenize versus custom tokenizers in nltk

1 Answers1