0

I am confused what I should take into account while trying to detect overfitting of a model.

Let's say I have a classification problem with the main metric being ROC-AUC. I split the data into train and test sets. I perform cross-validation (CV) on the training set and collect the average metric and the model with the best parameters. Then I use this model to predict X_test.

CV metric: ~0.75 ROC-AUC

Test: ~0.74 ROC-AUC

But when I do:

model(best_parameters).fit(X_train, y_train)
model.predict_proba(X_train)

I get ROC-AUC = 1.0. Also, during cross-validation, the train-folds metric is 1.0.

Does it mean the model is overfitted if my training metrics = 1.0? Or I should not judging by train metrics at all? Should also monitor loss function?

Mario
  • 571
  • 1
  • 6
  • 24
ike
  • 3
  • 1

1 Answers1

1

But when I do model(best_parameters).fit(X_train, y_train) and then .predict_proba(X_train), I get ROC-AUC = 1.0.

Yes, this would mean that the model is fitting your training data perfectly. That is overfitting. The model can fit the training data perfectly, and the predictions are exactly the input of the data.

Also, during cross-validation, the train-folds metric is 1.0.

This is strange. When cross validating, the model is not trained and evaluated on the same data. So while the results of cross-validation may be a little bit higher than your final testing results, the difference is too high. But are you sure this is what you mean, I understand this to be contradicting

CV metric: ~0.75 ROC-AUC

Test: ~0.74 ROC-AUC

0.75 would make sense for the cross validation. That means that the model is very flexible, so it is can overfit on the training data. So conventional wisdom is to try to reduce overfitting by making it less flexible. This should bring the training error closer to the testing error. But, also see Can we use a model that overfits?.

Gijs
  • 136
  • 3