0

I have a data set with ordinal features.Each feature might have 6 to 7 levels. Based on my search for R if you have ordinal data, rpart treats ordinal and nominal differently. https://stats.stackexchange.com/questions/94502/decision-tree-splitting-factor-variables

But now I'm implementing the decision tree with Python and there nothing comparable to rpart to handle ordinal data. It seems Python sklearn does not handle categorical data well and I have to use one hot encoding. In this case, the order of level like level 1 to level 2 to level3......to level 6 will just disappear.

https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree

Any suggestions? Thanks.

newleaf
  • 111
  • 2

3 Answers3

1

Ordinal variables are treated exactly the same as numerical variables by decision trees. (And so, you might as well encode them as consecutive integers.)

As for (unordered) categorical variables, LightGBM (and maybe H2O's GBM?) supports the optimal rpart-style splits [using the response-ordering trick when suitable, else trying all splits when not too expensive]. If you want a single decision tree, just set hyperparameters accordingly.

See also:
Why decision tree needs categorical variable to be encoded?
Ordinal Attributes in a Decision Tree

Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63
0

First, using OneHot encoding with tree-based models is a bad idea. It explodes your feature space, and gives worse results than ordinal encoding. Try this library with many encoding schemes for categorical data, you can quickly implement the encodings, and see for yourself the difference in performance for each type of encoding.

And yes, i share your opinion on nominal vs ordinal variables. I'm having trouble with nominal data aswell ( i have posted recently about that ) cause the encodings that exist order them in a random way and it's wrong since there is no order between the levels.

Blenz
  • 2,124
  • 13
  • 29
0

Are you bound to use decision trees? I don‘t think trees will handle ordinal/categorical in a different way, at least I‘m not aware of any ordinal tree-like implementation. See also: https://datascience.stackexchange.com/a/14038/71442

I have worked with ordinal probit/logit and generalized ordinal probit/logit in the past using Stata. I think these models may be a thing to look at (not sure what you do exactly). There are also Python implementations for such models, e.g.: https://pythonhosted.org/mord/

If you are interested in generalized methods, have a look here: https://www.stata-journal.com/article.html?article=st0097

Peter
  • 7,896
  • 5
  • 23
  • 50