26

Does the random forest implementation in scikit-learn use mean accuracy as its scoring method to estimate generalization error with out-of-bag samples? This is not mentioned in the documentation, but the score() method reports the mean accuracy.

I have a highly unbalanced dataset, and I am using AUC of ROC as my scoring metric in the grid search. Is there a way to tell the classifier to use the same scoring method on the OOB samples as well?

Ethan
  • 1,657
  • 9
  • 25
  • 39
darXider
  • 623
  • 1
  • 5
  • 12

1 Answers1

28

In general, the performance of classifiers are compared using accuracy, this is a measure of the number of correctly classified instances divided by the total number of instances. However, from the training data we can get a better approximation of the expected error from our classifier when we are using ensemble learning or bagging techniques.

Out-of-bag error

This metric is the accuracy of examples $x_i$ using all the trees in the random forest ensemble for which it was omitted during training. Thus it kind of acts as a semi-testing instance. You can get a sense of how well your classifier can generalize using this metric.

To implement oob in sklearn you need to specify it when creating your Random Forests object as

from sklearn.ensemble import RandomForestClassifier 
forest = RandomForestClassifier(n_estimators = 100, oob_score = True)

Then we can train the model

forest.fit(X_train, y_train)
print('Score: ', forest.score(X_train, y_train))

Score: 0.979921928817

As expected the accuracy of the model when evaluating the training set is very high. However, this is meaningless because you can very well be overfitting your data and thus your model is rubbish. However, we can use the out-of-bag score as

print(forest.oob_score_)

0.86453272101

This is the accuracy whilst evaluating our instances in the training set using only the trees for which they were omitted. Now let's calculate the score on the testing set as

print('Score: ', forest.score(X_test, y_test))

Score: 0.86517733935

We see that the accuracy measured by oob is very similar to that obtained with the testing set. It thus follows through the theory that the oob accuracy is a better metric by which to evaluate the performance of your model rather than just the score. This is a consequence of bagging models and cannot be done with other types of classifiers.

Calculating oob using different metrics

Yes, you can do this! However, it depends how exactly your code is structured. I am not sure how you can include the oob and AUC all together with the cross_val_score function. However, if you are doing the cross validation folds manually you can do the following, the random forests algorithm in sklearn provides you the decision function of the oob as

print(forest.oob_decision_function_)

The class can then be obtained using

from sklearn import metrics
pred_train = np.argmax(forest.oob_decision_function_,axis=1)

Then we can calculate the AUC by using the following

metrics.roc_auc_score(y_train, pred_train)

0.86217157846471204

JahKnows
  • 9,086
  • 31
  • 45