Random Forest significantly outperforms XGBoost - problem or possible?

Question

I have dataset of around 180k observations of 13 variables (mix of numerical and categorical features). It is binary classification problem, but classes are imbalanced (25:1 for negative ones). I wanted to deployed XGBoost (in R) and reach the best possible Precision & Recall. For dealing with imbalances I tried upsampling of positive class, as well as XGB high weights for positive class. However, despite the fact Recall is pretty high, there is very poor Precision (around 0.10).

My parameters tuning for XGB:

Random search of parameters - 10 interations

5-folds CV

Parameter's intervals: max_depth = 3-10 lambda = 0 - 50 gamma = 0 -10 min_child_weight = 1 -10 eta = 0.01-0.20

Then, I tried Random Forest with upsampled dataset and it performed suprisingly great with Recall 0.88 and Precision 0.73 (on test dataset).

Could someone tell me please, if it is possible that RF outperforms XGB so much, or it is a sign I am doing something wrong? Thank you very much.

Peter · Accepted Answer · 2022-01-13T18:17:05.033

There are two important things in random forests: "bagging" and "random". Broadly speaking: bagging means that only a part of the "rows" are used at a time (see details here) while "random" means that only a small fraction of the "columns" (features, usually $\sqrt{m}$ as default) are used to make a single split. This helps to also let seemingly "weak" features have a say in the prediction or to avoid dominance of few features in the model (and thus to avoid overfitting).

Looking at your XGB parameters, I notice that you do not subsample rows and columns, which is possible by using the parameters colsample_bytree and subsample. You could also use scale_pos_weight to tackle imbalanced classes. Subsetting columns and rows would possibly be useful if you have some dominant features or observations in your data. I suspect that using subsampling (this would be „stochastic gradient boosting“), the XGB results would improve and be "closer" to the results obtained by using a random forest.

Also make sure you have enough boosting rounds (to have good learning progress). You can add a watchlist and an early_stopping_rounds criterium to stop boosting in case no more progress is made. In this case you would set nrounds to a "high" number and stop boosting in case no more learning progress after early_stopping_rounds steps is made as in this generic code.

Random Forest significantly outperforms XGBoost - problem or possible?

1 Answers1

Linked