Which data hyperparameter tuning using for fit the model

Question

X = all features from dataset

y = all target from dataset

X_train = features that already using train_test_split approach

y_train = target that already using train_test split approach

So my question is which one should I choose if I would like to do hyperparameter tuning? I have imbalanced data. In this case I would like to make pipeline that contains smote and the algorithm. I read someone who said that you should do oversampling on each fold of cross validation. Assuming when I am using randomized search CV --> that also have cross validate I am decided to run smote in pipeline. But I am unsure which data should I fit after I run the code.

fit(X,y) or fit(X_train, y_train)

score 2 · Answer 1 · answered Oct 14 '22 at 15:36

The recommended approach is to use cross validation on the training dataset (X_train, y_train) for hyperparameter tunning and oversampling on each fold of cross validation.

The code would something like this:

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold
X_train, y_train, X_test, y_test = train_test_split(X, y)
pipeline = Pipeline([("smote", SMOTE()), ("rf", RandomForestClassifier())])
kf = StratifiedKFold()
rscv = RandomizedSearchCV(estimator=pipeline, cv=kf)
rscv.fit(X_train, y_train)

score 0 · Answer 2 · answered Oct 13 '22 at 16:34

None:

Certainly not the whole dataset (X,y) because this would cause data leakage and invalidate the evaluation.
The training set (X_train, y_train) should be used only for training.

The solution is to split the training set into a training set and validation set. Equivalently you can use cross-validation in the full training set, since the CV process would take care of splitting the data.

If interested, this is another explanation about using a validation set for parameter tuning.

If you resample (I don't recommend it, at least not without a good reason), you should never do it on the validation set or the test set.

Which data hyperparameter tuning using for fit the model

2 Answers2