1

I want to build an n-ary decision tree with categorical features. I am using ordinary ID3 algorithm to build a tree.

Lets take the next dataset as a training dataset for building a decision tree:

dors age cost
1 4 0
2 4 1
3 5 0
3 6 1

A decision tree will look as such:

decision tree

Lets say now in test or prod time, an example comes with features dors=3 age=4 our tree cannot classify this example even though it has seen examples where doors=3 and examples where age=4. My implementation throws an error that value is missing and sklearn implementations of decision trees are always binary trees. Overall it cannot be expected that we covered all possible combinations in training set so there could always be examples in test set with unique combination of feature values which our tree cannot classify. How can this problem be solved and what are some solutions for solving it?

dzi
  • 111
  • 2

1 Answers1

2

It looks like your features are not really categorical, at least age: with categorical features the possible values are known at training, so normally you cannot have a case like this (otherwise it means that the training set is not large enough, so not representative).

However this can happen with numerical features. But with numerical features the condition on the node 'age' would not be if age==5 then... as in the categorical case, it would be for instance if age < 5.5 then .... Such a condition can handle the case age=4.

Erwan
  • 26,519
  • 3
  • 16
  • 39