Why do machine learning engineers insist on training with more data than validation set?

Question

Among my colleagues I have noticed a curious insistence on training with, say, 70% or 80% of data and validating on the remainder. The reason it is curious to me is the lack of any theoretical reasoning, and it smacks of influence from a five-fold cross-validation habit.

Is there any reason for choosing a larger training set when attempting to detect overfitting during training? In other words, why not use $n^{0.75}$ for training and $n - n^{0.75}$ for validation if the influence really is from cross-validation practices carried over from linear modeling theory as I suggest in this answer?

I posted a similar question on stats.stackexchange.com but based on the response thought I might have a more interesting discussion here. The concept of training for multiple epochs is, in my opinion, inherently Bayesian and thus the concept of cross-validation may be ill-suited at worst, unnecessary at best, for reasons I suggest in that post.

score 2 · Accepted Answer · answered Dec 29 '20 at 09:44

The reasoning will be: "The more data for training the better". Then you have to keep in mind that the validation/hold-out set has to resemble how it should work on production/testing. The theory is that the larger the training data, the better the model should generalize.

The validation set can be much smaller, on extremely big dataset you can make it even 0.01% of the data, and there should be no problem.

In basic cases you don't even need to do the K-fold, this makes the training more expensive and only for hyperparameter search and inside the training set it needs to be done.

For your case, you can consider the split you want. Just keep the balance of the training data to be as larger as possible and validation data to resemble the best the production environment as possible.

Why do machine learning engineers insist on training with more data than validation set?

1 Answers1