3

At each node of a decision tree, we must choose a collection of features to split along.

Suppose we know a priori that the features can be partitioned into subsets that are 'correlated', i.e. this partition describes someone's hat and this partition describes their shoes

Is there anyway to force this partitioning to be used when choosing which features to split along?

Like if you are choosing $k$ features, make sure that all $k$ are from the same partition.

3 Answers3

1

Maybe you can try running a principal components analysis (PCA) first for you data set, and then use these components as variables to build your tree. Therefore, at each split, the tree algorithm will be selecting from specific combinations of your original data.

PCA will build components that describe features present in your data, such as contrasts between variables, overall size, ...

Nick
  • 11
  • 1
0

One simple way would be to create new (composite) features associated with each subgroup of original features, and feed these new composite features to the tree model instead.

Else there is no built-in way for current tree algorithms to handle subgroups of correlated features as one super feature.

Nikos M.
  • 2,493
  • 1
  • 7
  • 11
0

This can be achieved, and it is already implemented in XGBoost. See a full description here.