3

Fundamentally, under what circumstance is it reasonable to do HPO only on a subsample of the training set?

I am using Population Based Training to optimise hparameters for a sequence model. My dataset consists of 20M sequences and was wondering if it would make sense to optimise on a subsample due to restricted budget.

hH1sG0n3
  • 2,098
  • 8
  • 28

1 Answers1

1

Your subsample has to be representative of your original dataset.

To do so, as you are in a supervised case, I would get a random subsample that keeps the classes distribution (for instance getting randomly 40% of each class).

Note
If you have classes with too few examples, I would also try not to sample them. Risk is even with random sampling you could loose information when a cluster is too small. Plus, if your problem is computation time, that won't be a problem to keep the too small clusters while sampling the bigs.

etiennedm
  • 1,455
  • 7
  • 13