3

I'm building a Random Forest Classifier using Scikit Learn.

My problem consists in a 4 class classification task, the values are distributed as follows (after splitting my data in training set and test set with a proportion of 80%-20%):

y_train values
cautious_turn       386  # label and number of elements
aggressive_brake    356
cautious_brake      245
aggressive_turn     204

y_test values cautious_turn 104 aggressive_brake 90 aggressive_turn 53 cautious_brake 51

The full dataset consists in 1489 samples. The training set is composed by 1191 samples.

I'm trying to optimize my random forest hyperparameters, using RandomizedSearchCV from sklearn.

My code is the following (just an example):

from sklearn.model_selection import RandomizedSearchCV
import numpy as np
from pprint import pprint

Number of trees in random forest

n_estimators = [int(x) for x in np.linspace(start = 1, stop = 150, num = 15)]

Number of features to consider at every split

max_features = ['auto', 'sqrt']

Maximum number of levels in tree

max_depth = [int(x) for x in np.linspace(10, 110, num = 11)] max_depth.append(None)

Minimum number of samples required to split a node

min_samples_split = [2, 5, 10]

Minimum number of samples required at each leaf node

min_samples_leaf = [1, 2, 4]

Method of selecting samples for training each tree

bootstrap = [True, False]

Create the random grid

random_grid = {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth': max_depth, 'min_samples_split': min_samples_split, 'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap}

So far, my code works perfectly and I have no problem.

My question is: is there any way/empirical approach to decide which could be a possible good initial space for my hyperparameters values?

Right now I just copied those values from a tutorial. Is there any way to decide which could be (for example) a good range of values for min_samples_split looking at my data? Is there any methodology that allows me to reduce the "exploratory" space?

For example : I decided to search min_samples_leaf = [1, 2, 4] instead of min_samples_leaf = [10, 15, 20] because.... (possible motivation here)

Mattia Surricchio
  • 421
  • 3
  • 5
  • 15

0 Answers0