3

For lack of a better term, overfit here means a higher discrepancy between train and validation score and non-overfit means a lower discrepancy.

This "dilemma" just showed in neural network model I've recently working on. I trained the network with 10-fold cross-validation and got overfitted model (0.118 score difference):

  • 0.967 accuracy for training set and
  • 0.849 for validation set.

Then, I applied dropout layer with dropout rate of 0.3 after each hidden layer and got "less overfitted" model (0.057 score difference):

  • 0.875 accuracy for training set and
  • 0.818 for validation set

which is supposedly good since have lower discrepancy thus have better reliability for unknown data. The problem is, it has lower validation set score. My uninformed intuition says that no matter how overfitted your model is, validation set score is what matters because it indicates how well your model sees new data, so I choose the first model.

Is that a right intuition? How to go for this situation?

kneejar
  • 33
  • 2

2 Answers2

3

What library are you using? Dropout is used during training to prevent overfitting.

Make sure dropout is not applied for the validation (which is standard for Keras). This could artificially decrease your validation accuracy.

Also, accuracy is a bad metric to evaluate your performance. See this answer to find out why. Try ROC-AUC to evaluate your model performance.

PascalIv
  • 432
  • 2
  • 8
2

TLDR: I think you can do that as long as you understand why this is happening.

I think first you should be really sure your validation set is not in any way polluted by your training data. This sometimes can happen very indirect- in that case you would still be at risk. Otherwise there is nothing fundamentally wrong with using an overtrained predictor that still generalizes good enough.

Think about examples like the Titanic dataset. Its pretty small so it's not hard to learn all survivors in your training sample but still getting the general trend right.

Another point you should consider is how big your samples are. If they are small (maybe a few hundred datapoints) you could also observe statistical noise that can be quite large.

El Burro
  • 800
  • 1
  • 4
  • 12