26

The problem I am tackling is categorizing short texts into multiple classes. My current approach is to use tf-idf weighted term frequencies and learn a simple linear classifier (logistic regression). This works reasonably well (around 90% macro F-1 on test set, nearly 100% on training set). A big problem are unseen words/n-grams.

I am trying to improve the classifier by adding other features, e.g. a fixed sized vector computed using distributional similarities (as computed by word2vec) or other categorical features of the examples. My idea was to just add the features to the sparse input features from the bag of words. However, this results in worse performance on the test and training set. The additional features by themselves give about 80% F-1 on the test set, so they aren't garbage. Scaling the features didn't help as well. My current thinking is that these kind of features don't mix well with the (sparse) bag of words features.

So the question is: assuming the additional features provide additional information, what is the best way to incorporate them? Could training separate classifiers and combining them in some kind of ensemble work (this would probably have the drawback that no interaction between the features of the different classifiers could be captured)? Are there other more complex models I should consider?

elmille
  • 361
  • 1
  • 3
  • 4

2 Answers2

17

If I understand correctly, you essentially have two forms of features for your models. (1) Text data that you have represented as a sparse bag of words and (2) more traditional dense features. If that is the case then there are 3 common approaches:

  1. Perform dimensionality reduction (such as LSA via TruncatedSVD) on your sparse data to make it dense and combine the features into a single dense matrix to train your model(s).
  2. Add your few dense features to your sparse matrix using something like scipy's hstack into a single sparse matrix to train your model(s).
  3. Create a model using only your sparse text data and then combine its predictions (probabilities if it's classification) as a dense feature with your other dense features to create a model (ie: ensembling via stacking). If you go this route remember to only use CV predictions as features to train your model otherwise you'll likely overfit quite badly (you can make a quite class to do this all within a single Pipeline if desired).

All three approaches are valid and have their own pros and cons. Personally, I find (1) to typically be the worst because it is, relatively speaking, extremely slow. I also find (3) to usually be the best, being both sufficiently fast and resulting in very good predictions. You can obviously do a combination of them as well if you're willing to do some more extensive ensembling.

As for the algorithms you use, they can essentially all fit within that framework. Logistic regression performs surprisingly well most of the time, but others may do better depending on the problem at hand and how well you tune them. I'm partial to GBMs myself, but the bottom line is that you can try as many algorithms as you would like and even doing simple weighted ensembles of their predictions will almost always lead to a better overall solution.

David
  • 810
  • 5
  • 9
10

Linear models simply add their features multiplied by corresponding weights. If, for example, you have 1000 sparse features only 3 or 4 of which are active in each instance (and the others are zeros) and 20 dense features that are all non-zeros, then it's pretty likely that dense features will make most of the impact while sparse features will add only a little value. You can check this by looking at feature weights for a few instances and how they influence resulting sum.

One way to fix it is to go away from additive model. Here's a couple of candidate models.

SVM is based on separating hyperplanes. Though hyperplane is linear model itself, SVM doesn't sum up its parameters, but instead tries to split feature space in an optimal way. Given the number of features, I'd say that linear SVM should work fine while more complicated kernels may tend to overfit the data.

Despite its name, Naive Bayes is pretty powerful statistical model that showed good results for text classification. It's also flexible enough to capture imbalance in frequency of sparse and dense features, so you should definitely give it a try.

Finally, random forests may work as a good ensemble method in this case. Randomization will ensure that different kinds of features (sparse/dense) will be used as primary decision nodes in different trees. RF/decision trees are also good for inspecting features themselves, so it's worth to note their structure anyway.

Note that all of these methods have their drawbacks that may turn them into a garbage in your case. Combing sparse and dense features isn't really well-studied task, so let us know what of these approaches works best for your case.

ffriend
  • 2,831
  • 19
  • 19