9

If I understand tree based methods correctly, it would be better for more important features to be toward the top of the tree.

Is there a way I can dictate this in xgboost? Similar to how I can assign weights to each sample/row, can I somehow enforce certain features to be more likely to be put on top compared to other features? (or some other way that enforces certain columns to play a more important role in the model?)

I am aware that you would want the data to guide the model rather than enforcing it but 1 logical case for this is when you have limited data and would like to try helping the process through your subject matter expertise.

user2677285
  • 191
  • 1
  • 3

6 Answers6

4

You may duplicate some features in your dataframe. I certainly was looking for two things as you mention,one of them is putting some features closer to the root node, and also want some features to appear more on branches. I understand your concern is to make your trees more aware to some features. I may suggest something there. XGBoost samples each feature uniformly, which it would be nicer if we can say that some features are more important and should be used more. Short hack would be duplicating the columns while decreasing the colsample_bytree ratio. XGBoost for now doesn't support weighted features since it draws features uniformly. I've seen it on another place, there's no specific sampling technique for features(columns in XGBoost) in the documentation. Check colsample_bytree : https://xgboost.readthedocs.io/en/latest/parameter.html (Credits: Heard that technique from a webinar of https://www.kaggle.com/aerdem4 )

Generally, newbies like me starts to create a lot features via engineering, however also may discard that important features becomes minority. Although XGBoost parameters can deal with it while hyperparameter search, it seems really not beneficial creating trees with no rain data because of the domination of high number of engineered wind data when problem is predicting drought .

If your concern is specifying domain knowledge, you may also define a domain knowledge via "feature interaction constrainst" in XGBoost, documentation. I hope I'll edit the answer when I become %100 sure what I'm doing with feature interactions: https://xgboost.readthedocs.io/en/latest/tutorials/feature_interaction_constraint.html#:~:text=Feature%20interaction%20constraints%20are%20expressed,but%20with%20no%20other%20variable.

I'm aware these doesn't answer your primary question, it seems not possible for now. Also, I'm not sure that features put on the top are more important, this doesn't seem always true.

Edit: XGBoost has added feature_weights to the DMatrix, in 1.3.0! However, 1.3.0 is not stable yet. I'll edit here when I find/apply feature weights. https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor.fit

2

Sometimes, when you have important categories in your population it is best to split the data set by catégories and train different model on them. This might be some way to go if you really have different populations and different associated behaviour. This might be helpful to avoid some category imbalance problems. However this might not be practical as you would have to tune multiple models.

Regarding xgboost, it is designated to handle important data sets rapidly. It’s usually not desirable to try to influence the learning process to choose what feature it will pick first.

If you have one feature that it should pick first because of expert knowledge but it doesn’t :

  • look what feature are above in term of importance and if there might be a problem with them.
  • try to improve your feature with feature engineering.
Lucas Morin
  • 2,775
  • 5
  • 25
  • 47
1

In fact, Tree algorithms chose the feature they cut on by calculating a metric, evaluating which cut is the best. Most famous metrics are Gini or Entropy.

So the goal of it is to automatically make the best splits, knowing what's in the data. Force features to be on top of the tree would mean downgrading performance, since the trees makes its cuts in variables that create the most significance.

Adept
  • 904
  • 6
  • 17
0

It's not possible to set feature importances in advance, but you can reduce your model’s dependence on specific features. Since earlier trained trees have a greater influence on the final predictions, you can reduce the importance of certain features by excluding them in the initial trees. For example, suppose you train an XGBoost model with 1,000 trees, and features F1 and F2 have the highest feature importances. If you exclude F1 and F2 from the first 200 trees, the importances of these features will be significantly reduced.

This strategy can help improve generalization, especially if your model heavily relies on F1 and F2, and you expect a distribution shift between the training and test data. Here’s how to implement this:

  1. Train the first 200 trees without F1 and F2.

  2. Save the model's JSON file.

  3. Modify the JSON file:

    • Append (otherwise, you need to change the indices in each tree structure) the features you excluded from the first 200 trees back into the feature set.
    • Adjust the feature_num to reflect the total number of features (after adding F1 and F2 back).
    • You don’t need to modify the tree index splits; they will remain the same.
  4. Retrain the model by loading the pre-trained model with 200 trees, but this time, include F1 and F2 in the training data:

    model.fit(ddf_X_train, ddf_y_train, xgb_model=previous_model)

As a result, the final model will show much smaller feature importances for F1 and F2.

0

In XGBoost, it is currently possible to set feature importance by influencing the selection process during tree construction.

Instead of considering all available features for each split, the optimisation algorithm selects a subset of features at random before each split. You can control the size of this subset as a proportion of the total number of features. Also, you can assign different probabilities to each feature, affecting how likely they are to be sampled, thereby setting their importance.

Note this is not the same "feature importance" that you can obtain from the trained model with get_score().

See parameters colsample_bynode and feature_weights in the documentation. Also see a very basic example.

Fanta
  • 101
  • 1
0

Xgboost calculates feature importance automatically. I did not find any method to set it manually. You can choose 2 options to solve the problem:

  • set weight of important samples
  • split you data with very important feature and train multiple models.
user1941407
  • 101
  • 1