10

I have a dataset with 20 variables and ~50K observations, I created several new features using those 20 variables.

I compare the results of a GBM model (using python xgboost and light GBM) and I found that it doesn't matter what are the hyper-parameters of the model the 'thinner' version leads to better results (AUC) even that all 20 original variables are included in the wider version.

when I compare the same using Lasso model - the wider version is better (~1% higher) as expected.

I guess it can be related to to the randomness in the GBM but I was surprised to see that GBM doesn't fix it along the way.

Any explanation of the phenomena will be appreciated.

Lucas Morin
  • 2,775
  • 5
  • 25
  • 47
Yaron
  • 201
  • 1
  • 2
  • 5

2 Answers2

7

To put it shortly, xgboost tries to fix it and although it is very good in getting rid of overfitting, it is not perfect.

Adding new features is not always beneficial, because you increase the dimension of your search space and thus make the problem harder. In your particular case the increased complexity overweight the added value from extra features.

I understand you‘ve tested quite wide range of hyper-parameters and enough combinations. If you apply regularisation via colsample_bytree and/or colsample_bylevel it might happen that at some stage the randomly chosen columns (features) are less informative than your original features and the algorithm is forced to use these features for further splits. Does it make sense to you?

The number of added features might be crucial, if it is too high relative to original 20 the new features become just too dominant. For example, this might happen if one adds some nominal features with high cardinality, which are then dummy-encoded.

In order to improve the results with wider data you might want to play with the parameters controlling the early stopping and stop the fitting even earlier.

Edit

The rationale for my last suggestion: I assumed that worsened performance on wider dataset is due to overfitting, i.e. significantly worse performance on test dataset than on training data. Early stopping should prevent / control the overfitting, but it seems it didn’t work so good in your case and thus should be improved.

You can and should test different combination of new features, but trustworthy quality metric is crucial for the model choice. If you significantly overfit, the performance on training data (or even cross-validation) will not help you to choose the model (or feature combination) with good performance on test data set.

aivanov
  • 1,520
  • 10
  • 14
3

Welcome to the site!

If I understand your question correctly you want to know why a model would perform worse when a new feature is added?

So every time you do feature engineering (add new columns, derive columns, standardize the data, normalize the data, etc) there is always a flip side of the coin. If you add some features and if those features explain something about the target variable then it would aid in increasing the accuracy; on the other hand, if the feature doesn't have much relation to the target variable then it doesn't aid in increasing the accuracy.

Now before going to modeling you can go through a couple of things like (assuming that you have done all these things):

  1. Eliminating Unnecessary Features using Business Understanding
  2. Removing Outliers
  3. Imputing Missing Data (XGBoost is not prone to missing values)
  4. Standardizing data
  5. Correlation analysis with the target variable and within the variables too, eliminate if there is high correlation between variables (as you need independent variables), eliminate the variables which are not much related to the target variable.

Once the above steps are done you can get a model which explains the data well. To improve the accuracy you need to do more feature engineering and try to find if there are some external factors which might effect your model.

There are many reasons why a model doesn't work well. Some of the above stated might be the reason why your model is not performing well in your scenario. Do have a closer look on the data -- this might give you a better idea. This is a top level view.

Toros91
  • 2,392
  • 3
  • 16
  • 32