Feature importance with high-cardinality categorical features for regression (numerical depdendent variable)

Question

I was trying to use feature importances from Random Forests to perform some empirical feature selection for a regression problem where all the features are categorical and a lot of them have many levels (on the order of 100-1000). Given that one-hot encoding creates a dummy variable for each level, the feature importances are for each level and not each feature (column). What is a good way to aggregate these feature importances?

I thought about summing or getting the average importance for all levels of a feature (probably the former will be biased towards those features with more levels). Are there any references on this issue?

What else can one do to decrease the number of features? I am aware of group lasso, could not find anything easy to use for scikit-learn.

score 5 · Accepted Answer · answered Apr 06 '17 at 12:03

It depends on how you're one-hot encoding them. Many automated solutions for that will name all the converted booleans with a pattern so that a categorical variable called "letter" with values A-Z would end up like:

letter_A, letter_B, letter_C, letter_D,....

If after you've figured out feature importance you've got an array of feature and the associated weight/importance, I would analyze the array and perhaps sum up the feature importance weights for anything starting with "letter%".

Feature importance with high-cardinality categorical features for regression (numerical depdendent variable)

1 Answers1

Linked