11

If I know the time of a given validation with set values, can I estimate the time GridSearchCV will take for n values I want to cross-validate?

Ethan
  • 1,657
  • 9
  • 25
  • 39
Nathan Furnal
  • 275
  • 1
  • 3
  • 10

3 Answers3

12

You could fit your model/pipeline (with default parameters) to your data once and see how long it takes to train. Then you would multiply that by how many times you want to train the model through grid search.

E.g. suppose you want to use a grid search to select the hyperparameters a, b and c of your pipeline.

params = {'a': [1, 2, 3, 4, 5],
          'b': [1, 2, 3, 4],
          'c': [1, 2, 3]}

cv = GridSearchCV(pipeline, params)

By default this should run a search for a grid of $5 \cdot 4 \cdot 3 = 60$ different parameter combinations. The default cross-validation is a 3-fold cv so the above code should train your model $60 \cdot 3 = 180$ times. By default GridSearch runs parallel on your processors, so depending on your hardware you should divide the number of iterations by the number of processing units available. Let's say for example I have 4 processors available, each processor should fit the model $ 180 / 4 = 45$ times. Now, if on average my model takes $10 sec$ to train, I'm estimating around $45 \cdot 10 / 60 = 7.5min$ training time. In practice it should be closer to $8min$ due to overhead.

Finally, because some parameters heavily affect the training time of that algorithm, I would suggest using the max_iter argument whenever available so that your estimation doesn't fall far off.

Please note : As of July 2021, the default folds is 5.

From sklearn documentation : Changed in version 0.22: cv default value if None changed from 3-fold to 5-fold.

Djib2011
  • 8,068
  • 5
  • 28
  • 39
2

Let the search complete and then you can use cv_results_ attribute to compute the elapsed time as given below.

mean_fit_time= search_cv.cv_results_['mean_fit_time']
mean_score_time= search_cv.cv_results_['mean_score_time']
n_splits  = search_cv.n_splits_ #number of splits of training data
n_iter = pd.DataFrame(search_cv.cv_results_).shape[0] #Iterations per split

print(np.mean(mean_fit_time + mean_score_time) * n_splits * n_iter)

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
0

Here is a little function I made to estimate the GridSearchCV time, though I think it could be improved by @Naveen Vuppula's comment.

import timeit

def esitmate_gridsearch_time(model, param_grid:dict, cv:int=5, processors:int=6): times = [] for _ in range(5): start = timeit.default_timer() model.fit(X_train, y_train) model.score(X_train, y_train) times.append(timeit.default_timer() - start)

single_train_time = np.array(times).mean() # seconds

combos = 1
for vals in param_grid.values():
    combos *= len(vals)

num_models = combos * cv / processors
seconds = num_models * single_train_time
minutes = seconds / 60
hours = minutes / 60

print(hours, minutes, seconds)

ACB_prgm
  • 51
  • 1
  • 2