3

I have a spreadsheet with thousands of records regarding support requests. Case number, Issue description, etc.

Our goal is to classify these records in many categories in order to assign them the right priority.

Example:

  • Customer can't use pickup feature.
  • Customer can't dial 911 or Long Distance numbers.

For item number 1, I have decided to use a category called Best Effort and for item 2, an Urgent category.

  • Customer can't use pickup feature, BEST_EFFORT
  • Customer can't dial 911 or Long Distance numbers, URGENT

I'm planning to setup a dictionary of words.

best_effort = ['pickup','record','conference']
urgent = ['system is down','911', 'can't dial emergency','call center is down']

My goal is to use TFIDF and then cosine similarity to find best match and category. Does it makes sense? Any better recommendation to classify this type of information?

Karlo
  • 103
  • 3
gogasca
  • 759
  • 2
  • 8
  • 17

1 Answers1

3

Rather than use an external dictionary of keywords that are indicative of target class, you may want to take your raw data (or rather a random subset of it), then hand-label your instances (assign each row a label, BEST_EFFORT or URGENT). This becomes your training data - each row of data can be transformed into a bag-of-words vector indicating the presence/absence of the word in that particular text. You can train a classifier on this data, for example a naive bayes classifier, which can then be tested on the held out unseen test data. The advantages of the proposed approach are: (1) automated computation of features vs. hand created dictionary; (2) probabilistic/weighted indicators of class vs. binary dictionary indicators.

Brandon Loudermilk
  • 1,216
  • 8
  • 19