1

I have read in some papers that the subset of features chosen for a boosting tree algorithm will make a big difference on the performance
so I've been trying RFE, Boruta, Clustering variables, correlation, WOE & IV and Chi-square

Let's say I have a classification problem with over 40 variables, best results after a long long time testing :

  • all variables for Lightgbm (except of one variable with high linearity)
  • I removed correlated variables for Xgboost (around 8 correlated ones)
  • I removed variables based on ElasticNet model for Catboost (around 7 ones)

My question is : what's the proper way to choose the candidates variables for modeling a boosting tree (especially for Lightgbm) ?

I'm using R if there is any suggestion for packages ?

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
Mamoud
  • 11
  • 2

0 Answers0