2

I have a binary classification task and the data has imbalance issue (99% is negative and 1% is positive). I am able to build a decision tree that is carefully tuned, weighted, and post-pruned. Take this as tree1 and it has a high recall and medium-high precision, which performs well on detecting positive instances.

I wonder how can I improve its performance by incorporating the idea of using ensemble methods (bagging, boosting, stacking, etc).

One important thing to note is that using large amount of trees (e.g., Random Forest with 100+ trees) is not allowed in our production environment because of the real-time severing requirement. I want to look for incremental performance by adding only 1 or 2 (at max) trees. Is it possible?

I do know ensemble methods usually start with a large group of weak learners (default or lightly tuned). Then you take the majority vote, assuming that all trees weight roughly equal. However in my case I have a fine-tuned DT as a "strong" base learner, so I probably need to use soft voting (tree1 should be weighted more), but is ensembling still making sense with only three trees?

Maybe let me ask from another perspective: If I have a tree1 with high recall + low precision, how can I build a tree2 to improve the precision but keeping the high recall? If tree2 is tuned as high precision + low recall, would it possible to use ensemble learning as an optimization to balance out the weaknesses of both trees, and obtain a final model with high recall and high precision?

szheng
  • 21
  • 2

2 Answers2

1

One option is cascading - putting machine learning models in a row where the output of one model becomes the input of another model. The first model is typically high recall, the second model is high precision. The first model reduces the size search by selecting any highly likely candidates, the second model makes sure all labeled items are labeled as best as possible. This hierarchical modeling is common in search engineering where models have to be fast and correct.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
0

You could try some incremental boosting techniques such as AdaBoost or Gradient Boosting. These methods incrementally try to train the next weak learner in a way that it compensates the flaws of the previous ones.

Still your requirement are not ideal:

  • Having just 1 or 2 additional trees is quite a small number and you probably have to play a bit with the hyper-parameters of these algorithms to get a good solution
  • Most frameworks (i.e. all I know of) train all weak learners. Having already one learner at hand will restrict your options. Probably you might have to code them by your own.
Broele
  • 1,947
  • 1
  • 9
  • 16