1
X = all features from dataset

y = all target from dataset

X_train = features that already using train_test_split approach

y_train = target that already using train_test split approach

So my question is which one should I choose if I would like to do hyperparameter tuning? I have imbalanced data. In this case I would like to make pipeline that contains smote and the algorithm. I read someone who said that you should do oversampling on each fold of cross validation. Assuming when I am using randomized search CV --> that also have cross validate I am decided to run smote in pipeline. But I am unsure which data should I fit after I run the code.

fit(X,y) or fit(X_train, y_train)
Ethan
  • 1,657
  • 9
  • 25
  • 39

2 Answers2

2

The recommended approach is to use cross validation on the training dataset (X_train, y_train) for hyperparameter tunning and oversampling on each fold of cross validation.

The code would something like this:

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold

X_train, y_train, X_test, y_test = train_test_split(X, y)

pipeline = Pipeline([("smote", SMOTE()), ("rf", RandomForestClassifier())])

kf = StratifiedKFold()

rscv = RandomizedSearchCV(estimator=pipeline, cv=kf) rscv.fit(X_train, y_train)

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
0

None:

  • Certainly not the whole dataset (X,y) because this would cause data leakage and invalidate the evaluation.
  • The training set (X_train, y_train) should be used only for training.

The solution is to split the training set into a training set and validation set. Equivalently you can use cross-validation in the full training set, since the CV process would take care of splitting the data.

If interested, this is another explanation about using a validation set for parameter tuning.

If you resample (I don't recommend it, at least not without a good reason), you should never do it on the validation set or the test set.

Erwan
  • 26,519
  • 3
  • 16
  • 39