2

I have heard that decision trees can have a high amount of variance, and that for a data set $D$, split into test/train, the decision tree could be quite different depending on how the data was split. Apparently, this provides motivation for algorithms such as random forest.

Is this correct? Why does a decision tree suffer from high variability?

Edit:

Just a note - I do not really follow the current answer and have not been able to solve that in the comments.

Ethan
  • 1,657
  • 9
  • 25
  • 39
baxx
  • 183
  • 1
  • 8

4 Answers4

4

It is relatively simple if you understand what variance refers to in this context. A model has high variance if it is very sensitive to (small) changes in the training data.

A decision tree has high variance because, if you imagine a very large tree, it can basically adjust its predictions to every single input.

Consider you wanted to predict the outcome of a soccer game. A decision tree could make decisions like:

IF

  1. player X is on the field AND
  2. team A has a home game AND
  3. the weather is sunny AND
  4. the number of attending fans >= 26000 AND
  5. it is past 3pm

THEN team A wins.

If the tree is very deep, it will get very specific and you may only have one such game in your training data. It probably would not be appropriate to base your predictions on just one example.

Now, if you make a small change e.g. set the number of attending fans to 25999, a decision tree might give you a completely different answer (because the game now doesn't meet the 4th condition).

Linear regression, for example, would not be so sensitive to a small change because it is limited ("biased" -> see bias-variance tradeoff) to linear relationships and cannot represent sudden changes from 25999 to 26000 fans.

That's why it is important to not make decision trees arbitrary large/deep. This limits its variance.

(See e.g. here for more on how random forests can help with this further.)

oW_
  • 6,502
  • 4
  • 29
  • 47
1

The point is that if your training data does not have the same input features with different labels which leads to $0$ Bayes error, the decision tree can learn it entirely and that can lead to overfitting also known as high variance. This is why people usually use pruning using cross-validation for avoiding the trees to get overfitted to the training data.

Decision trees are powerful classifiers. Algorithms such as Bagging try to use powerful classifiers in order to achieve ensemble learning for finding a classifier that does not have high variance. One way can be ignoring some features and using the others, Random Forest, in order to find the best features which can generalize well. The other can be using choosing random training data for training each decision tree and after that put it that again inside the training data, bootstrapping.

The reason that decision trees can overfit is due to their VC. Although it is not infinite, unlike 1-NN, it is very large which leads to overfitting. It simply means you have to provide multiple numerous data in order not to overfit. For understanding VC dimension of decision trees, take a look at Are decision tree algorithms linear or nonlinear.

Green Falcon
  • 14,308
  • 10
  • 59
  • 98
1

I may be a bit late to the party. Yes, I would say this is correct.

Decision trees are prone to overfitting. Models that exhibit overfitting are usually non-linear and have low bias as well as high variance (see bias-variance trade-off). Decision trees are non-linear, now the question is why should they have high variance.

In order to illustrate this, consider yourself in a time-series regression setting. The usual procedure is to choose your features and split your data into training and test data. As features we will assume time-lagged values of the target value. Moreover, let's assume that the data exhibits a trend, meaning that the values are constantly growing. Unavoidably, the test set will contain larger values than the training data.

Now, you train a decision tree regressor on the training set. The decision tree will find splits based on the training data, i.e. partition the feature space into regions. In this sense, the feature space is dependent on the range of each individual feature. The range of the features is obviously lower for the training set. Subsequently, when testing the model on the test data, the decision tree is not able to predict the larger values. It will "cut-off" at the highest values encountered in the training data. This shows that decision trees are sensitive to change in the data and therefore have high variability.

Chipsen
  • 11
  • 1
0

I do not fully agree with the answer from @oW_♦ . Let's use linear regression as an example (and also assume the true model is a linear regression with Gaussian error).

Based on the normal equation, we could easily infer the model variance such as in this post: https://stats.stackexchange.com/a/306790

The variance of the model here means given an input row $X^{*}$ how prediction would vary based on the true error. In layman's terms, it is to say that if we collect the same data again (X is the same), the output will be different due to the irreducible error. Then how would my fitted model change?

I think we should keep features unchanged, but introduce small errors into the labels y and see how model prediction goes. I generate a simple simulation for proof of my idea: https://github.com/raingstar/BiasVarianceAnalysis/blob/main/Bias-Variance%20analysis.ipynb

rain keyu
  • 11
  • 3