9

I have a huge file of customer complaints about the products my company owns and I would like to do a data analysis on those descriptions and tag a category to each of them.

For example: I need to figure out the number of complaints on Software and Hardware side of my product from the customer complaints. Currently, I am using excel to do the data analysis which do seek a significant amount of manual work to get a tag name to the complaints.

Is there a way in NLP to build and train a model to automate this process? I have been reading stuffs about NLP for the past couple of days and it looks like NLP has a lot of good features to get a head start in addressing this issue. Could someone please guide me with the way I should use NLP to address this issue?

SRS
  • 1,065
  • 5
  • 11
  • 22

2 Answers2

7

One way to handle this is to use 'supervised classification'. In this model, you manually classify a subset of the data and use it to train your algorithm. Then, you feed the remaining data into your software to classify it.

This is accomplished with NLTK for Python (nltk.org).

If you are simply looking for strings like "hardware" and "software", this is a simple use case, and you will likely get decent results using a 'feature extractor', which informs your classifier which phrases in the document are relevant.

While it's possible to implement an automated method for finding the keywords, it sounds like you have a list in mind already, so you can skip that step and just use the tags you are aware of. (If your results aren't satisfactory the first time, this is something you might try later on).

That's an overview for getting started. If you are unhappy with the initial results, you can refine your classifier by introducing more complex methods, such as sentence segmentation, identification of dialogue act types, and decision trees. The sky is the limit (or more likely, your time is the limit)!

More info here.

Zephyr
  • 997
  • 4
  • 11
  • 20
sheldonkreger
  • 1,169
  • 8
  • 20
1

Sheldon is correct, this sounds like a fairly typical use case for supervised classification. If all of your customer complaints are either software or hardware (i.e., zero individual complaints cover both categories, and zero are outside these two classes), then all you need is a binary classifier, which makes things simpler than they otherwise could be.

If you're looking for a Java-based NLP toolkit that supports something like this, you should check out the Stanford Classifier -- it's licensed as open source software under the GPL.

Their wiki page should help you get started using the classifier -- keep in mind that you'll need to manually annotate a large sample of your data as a training set, as Sheldon mentioned.

Zephyr
  • 997
  • 4
  • 11
  • 20
Charlie Greenbacker
  • 1,541
  • 10
  • 10