Initial value space for Random Forest hyperparameter tuning

Question

I'm building a Random Forest Classifier using Scikit Learn.

My problem consists in a 4 class classification task, the values are distributed as follows (after splitting my data in training set and test set with a proportion of 80%-20%):

y_train values
cautious_turn       386  # label and number of elements
aggressive_brake    356
cautious_brake      245
aggressive_turn     204
y_test values
cautious_turn       104
aggressive_brake     90
aggressive_turn      53
cautious_brake       51

The full dataset consists in 1489 samples. The training set is composed by 1191 samples.

I'm trying to optimize my random forest hyperparameters, using RandomizedSearchCV from sklearn.

My code is the following (just an example):

from sklearn.model_selection import RandomizedSearchCV
import numpy as np
from pprint import pprint
Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 1, stop = 150, num = 15)]
Number of features to consider at every split
max_features = ['auto', 'sqrt']
Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
Method of selecting samples for training each tree
bootstrap = [True, False]
Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

So far, my code works perfectly and I have no problem.

My question is: is there any way/empirical approach to decide which could be a possible good initial space for my hyperparameters values?

Right now I just copied those values from a tutorial. Is there any way to decide which could be (for example) a good range of values for min_samples_split looking at my data? Is there any methodology that allows me to reduce the "exploratory" space?

For example : I decided to search min_samples_leaf = [1, 2, 4] instead of min_samples_leaf = [10, 15, 20] because.... (possible motivation here)

Initial value space for Random Forest hyperparameter tuning

Number of trees in random forest

Number of features to consider at every split

Maximum number of levels in tree

Minimum number of samples required to split a node

Minimum number of samples required at each leaf node

Method of selecting samples for training each tree

Create the random grid

0 Answers0