Why does gridsearchCV fit fail?

Question

I already referred this post here but there is no answer.

I am working on a binary classification using a random forest classifier. My dataset shape is (977,8) with 77:23 class proportion. My system has 4 cores and 8 logical processors.

As my dataset is imbalanced, I used Balancedbaggingclassifier (with random forest as an estimator).

Therefore, I used gridsearchCV to identify the best parameters of balancedbagging classifier model to train/fit the model and then predict.

My code looks like below

n_estimators = [100, 300, 500, 800, 1200]
max_samples = [5, 10, 25, 50, 100]
max_features = [1, 2, 5, 10, 13]
hyperbag = dict(n_estimators = n_estimators, max_samples = max_samples, 
              max_features = max_features)
skf = StratifiedKFold(n_splits=10, shuffle=False)
gridbag = GridSearchCV(rf_boruta,hyperbag,cv = skf,scoring='f1',verbose = 3, n_jobs=-1)
gridbag.fit(ord_train_t, y_train)

However, the logs that are generated in jupyter console, have below messages where the gridsearchcv score is nan for some cv executions as shown below.

You can see that for some of the CV executions, the gridscore is nan. can help me please? And it keeps running for more than half an hour and no output yet

Why does gridsearchCV return nan?

[CV 10/10] END max_features=1, max_samples=25, n_estimators=500;, score=nan total time= 4.5min
[CV 4/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.596 total time=10.4min
[CV 5/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.622 total time=10.4min
[CV 6/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.456 total time=10.5min
[CV 9/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.519 total time=10.5min
[CV 5/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 3.3min
[CV 4/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 9.9min
[CV 8/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 7.0min
[CV 6/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time=10.7min
[CV 1/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.652 total time=16.4min
[CV 9/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 7.6min
[CV 2/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.528 total time=16.6min
[CV 3/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.571 total time=16.4min
[CV 7/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.553 total time=16.1min
[CV 4/10] END max_features=1, max_samples=25, n_estimators=1200;, score=nan total time= 6.7min
[CV 8/10] END max_features=1, max_samples=25, n_estimators=1200;, score=nan total time= 1.7min
[CV 10/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.489 total time=16.0min
[CV 3/10] END max_features=1, max_samples=25, n_estimators=1200;, score=nan total time=18.6min
[CV 1/10] END max_features=1, max_samples=50, n_estimators=100;, score=0.652 total time= 2.4min

update - error trace report - fit fail reason

he above exception was the direct cause of the following exception:
ValueError                                Traceback (most recent call last)
<timed exec> in <module>
~\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection_search.py in fit(self, X, y, groups, **fit_params)
    889                 return results
    890 
--> 891             self._run_search(evaluate_candidates)
    892 
    893             # multimetric is determined here because in the case of a callable
~\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection_search.py in _run_search(self, evaluate_candidates)
   1390     def _run_search(self, evaluate_candidates):
   1391         """Search all candidates in param_grid"""
-> 1392         evaluate_candidates(ParameterGrid(self.param_grid))
   1393 
   1394
~\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection_search.py in evaluate_candidates(candidate_params, cv, more_results)
    836                     )
    837 
--> 838                 out = parallel(
    839                     delayed(_fit_and_score)(
    840                         clone(base_estimator),
~\AppData\Roaming\Python\Python39\site-packages\joblib\parallel.py in call(self, iterable)
   1052 
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time
~\AppData\Roaming\Python\Python39\site-packages\joblib\parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())
~\AppData\Roaming\Python\Python39\site-packages\joblib_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e
~\Anaconda3\lib\concurrent\futures_base.py in result(self, timeout)
    443                     raise CancelledError()
    444                 elif self._state == FINISHED:
--> 445                     return self.__get_result()
    446                 else:
    447                     raise TimeoutError()
~\Anaconda3\lib\concurrent\futures_base.py in __get_result(self)
    388         if self._exception:
    389             try:
--> 390                 raise self._exception
    391             finally:
    392                 # Break a reference cycle with the exception in self._exception
ValueError: The target 'y' needs to have more than 1 class. Got 1 class instead

Ben Reiniger · Accepted Answer · 2022-03-24T21:34:10.260

First I want to make sure you know what you're building here. You're doing (balanced) bagging with between 100 and 1200 estimators, each of which is a random forest of 300 trees. So each model builds between $100\cdot300=30k$ and $1200\cdot300=360k$ trees. Your grid search has $5^3=125$ hyperparameter combinations, and 10 folds. So you're fitting on the order of $10^8$ individual trees.

The grid search splits your data into 10 pieces, stratified so that the class balance should be the same as in the whole dataset. Now the balanced bagging is set to use only 25 rows, but it's also using the default "not minority" method, which means it tries to only downsample the majority class. Those two together are impossible, so I'm not really sure what ends up happening (if I have some time I'll look into that later). Since not all your scores are nan, it obviously sometimes works. But now the scarce 25 rows are used to train a random forest, so conceivably sometimes one of the trees there selects a bag with no examples from one of the classes. I suspect that's the issue.

The BalancedBaggingClassifier with a single decision tree base estimator acts as a fancier random forest, so that'd be my recommendation. You also wouldn't need to set class_weights in the tree, since the balanced bags will already be equally divided. I would expect better performance with larger max_samples, but even without changing that now you'll expect ~12.5 rows of each class for each tree to build off of. If you really want to balanced-bag random forests, then definitely increase the number of rows reaching each tree.

Why does gridsearchCV fit fail?

1 Answers1

Linked