3

I already referred this post here but there is no answer.

I am working on a binary classification using a random forest classifier. My dataset shape is (977,8) with 77:23 class proportion. My system has 4 cores and 8 logical processors.

As my dataset is imbalanced, I used Balancedbaggingclassifier (with random forest as an estimator).

Therefore, I used gridsearchCV to identify the best parameters of balancedbagging classifier model to train/fit the model and then predict.

My code looks like below

n_estimators = [100, 300, 500, 800, 1200]
max_samples = [5, 10, 25, 50, 100]
max_features = [1, 2, 5, 10, 13]
hyperbag = dict(n_estimators = n_estimators, max_samples = max_samples, 
              max_features = max_features)
skf = StratifiedKFold(n_splits=10, shuffle=False)
gridbag = GridSearchCV(rf_boruta,hyperbag,cv = skf,scoring='f1',verbose = 3, n_jobs=-1)
gridbag.fit(ord_train_t, y_train)

However, the logs that are generated in jupyter console, have below messages where the gridsearchcv score is nan for some cv executions as shown below.

You can see that for some of the CV executions, the gridscore is nan. can help me please? And it keeps running for more than half an hour and no output yet

Why does gridsearchCV return nan?

[CV 10/10] END max_features=1, max_samples=25, n_estimators=500;, score=nan total time= 4.5min
[CV 4/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.596 total time=10.4min
[CV 5/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.622 total time=10.4min
[CV 6/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.456 total time=10.5min
[CV 9/10] END max_features=1, max_samples=25, n_estimators=500;, score=0.519 total time=10.5min
[CV 5/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 3.3min
[CV 4/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 9.9min
[CV 8/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 7.0min
[CV 6/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time=10.7min
[CV 1/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.652 total time=16.4min
[CV 9/10] END max_features=1, max_samples=25, n_estimators=800;, score=nan total time= 7.6min
[CV 2/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.528 total time=16.6min
[CV 3/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.571 total time=16.4min
[CV 7/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.553 total time=16.1min
[CV 4/10] END max_features=1, max_samples=25, n_estimators=1200;, score=nan total time= 6.7min
[CV 8/10] END max_features=1, max_samples=25, n_estimators=1200;, score=nan total time= 1.7min
[CV 10/10] END max_features=1, max_samples=25, n_estimators=800;, score=0.489 total time=16.0min
[CV 3/10] END max_features=1, max_samples=25, n_estimators=1200;, score=nan total time=18.6min
[CV 1/10] END max_features=1, max_samples=50, n_estimators=100;, score=0.652 total time= 2.4min

update - error trace report - fit fail reason

he above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last) <timed exec> in <module>

~\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection_search.py in fit(self, X, y, groups, **fit_params) 889 return results 890 --> 891 self._run_search(evaluate_candidates) 892 893 # multimetric is determined here because in the case of a callable

~\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection_search.py in _run_search(self, evaluate_candidates) 1390 def _run_search(self, evaluate_candidates): 1391 """Search all candidates in param_grid""" -> 1392 evaluate_candidates(ParameterGrid(self.param_grid)) 1393 1394

~\AppData\Roaming\Python\Python39\site-packages\sklearn\model_selection_search.py in evaluate_candidates(candidate_params, cv, more_results) 836 ) 837 --> 838 out = parallel( 839 delayed(_fit_and_score)( 840 clone(base_estimator),

~\AppData\Roaming\Python\Python39\site-packages\joblib\parallel.py in call(self, iterable) 1052 1053 with self._backend.retrieval_context(): -> 1054 self.retrieve() 1055 # Make sure that we get a last message telling us we are done 1056 elapsed_time = time.time() - self._start_time

~\AppData\Roaming\Python\Python39\site-packages\joblib\parallel.py in retrieve(self) 931 try: 932 if getattr(self._backend, 'supports_timeout', False): --> 933 self._output.extend(job.get(timeout=self.timeout)) 934 else: 935 self._output.extend(job.get())

~\AppData\Roaming\Python\Python39\site-packages\joblib_parallel_backends.py in wrap_future_result(future, timeout) 540 AsyncResults.get from multiprocessing.""" 541 try: --> 542 return future.result(timeout=timeout) 543 except CfTimeoutError as e: 544 raise TimeoutError from e

~\Anaconda3\lib\concurrent\futures_base.py in result(self, timeout) 443 raise CancelledError() 444 elif self._state == FINISHED: --> 445 return self.__get_result() 446 else: 447 raise TimeoutError()

~\Anaconda3\lib\concurrent\futures_base.py in __get_result(self) 388 if self._exception: 389 try: --> 390 raise self._exception 391 finally: 392 # Break a reference cycle with the exception in self._exception

ValueError: The target 'y' needs to have more than 1 class. Got 1 class instead

The Great
  • 2,725
  • 3
  • 23
  • 49

1 Answers1

6

First I want to make sure you know what you're building here. You're doing (balanced) bagging with between 100 and 1200 estimators, each of which is a random forest of 300 trees. So each model builds between $100\cdot300=30k$ and $1200\cdot300=360k$ trees. Your grid search has $5^3=125$ hyperparameter combinations, and 10 folds. So you're fitting on the order of $10^8$ individual trees.

The grid search splits your data into 10 pieces, stratified so that the class balance should be the same as in the whole dataset. Now the balanced bagging is set to use only 25 rows, but it's also using the default "not minority" method, which means it tries to only downsample the majority class. Those two together are impossible, so I'm not really sure what ends up happening (if I have some time I'll look into that later). Since not all your scores are nan, it obviously sometimes works. But now the scarce 25 rows are used to train a random forest, so conceivably sometimes one of the trees there selects a bag with no examples from one of the classes. I suspect that's the issue.

The BalancedBaggingClassifier with a single decision tree base estimator acts as a fancier random forest, so that'd be my recommendation. You also wouldn't need to set class_weights in the tree, since the balanced bags will already be equally divided. I would expect better performance with larger max_samples, but even without changing that now you'll expect ~12.5 rows of each class for each tree to build off of. If you really want to balanced-bag random forests, then definitely increase the number of rows reaching each tree.

Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63