Is data subsampling appropriate for hyperparameter optimisation?

Question

Fundamentally, under what circumstance is it reasonable to do HPO only on a subsample of the training set?

I am using Population Based Training to optimise hparameters for a sequence model. My dataset consists of 20M sequences and was wondering if it would make sense to optimise on a subsample due to restricted budget.

score 1 · Answer 1 · answered Dec 16 '20 at 12:00

Your subsample has to be representative of your original dataset.

To do so, as you are in a supervised case, I would get a random subsample that keeps the classes distribution (for instance getting randomly 40% of each class).

Note
If you have classes with too few examples, I would also try not to sample them. Risk is even with random sampling you could loose information when a cluster is too small. Plus, if your problem is computation time, that won't be a problem to keep the too small clusters while sampling the bigs.

Is data subsampling appropriate for hyperparameter optimisation?

1 Answers1