0

Would like to ask, in which way to use SMOTE? My dataset is imbalanced and a multiclass problem. As I read in many posts, use SMOTE method only for the training dataset (X_train and y_train). Not for the test dataset (X_test and y_test). There I include validation data. How do you handle SMOTE with validation data?

df = pd.read_excel...

X=df.drop('column1',axis=1) y=df.column1

#Training part X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

smote = SMOTE(random_state=42) X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

#validation part X_train_smote, X_val, y_train_smote, y_val = train_test_split(X_train_smote, y_train_smote, test_size=0.5, random_state=42)

Is this correct?

and is it right, that the validation datasets (X_val and y_val) have both SMOTE inside? or should I make it out of the normal train test split: X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.5, random_state=42)? Im confused.

martin
  • 329
  • 3
  • 14

1 Answers1

1

Problem in applying smote on data and than applying the split (test/or Validation does not matter) is that you could suffer from data leakage. Meaning that some Information from the Train could Spill over to the future and falsly give good predictions. I would advise seperating smote data Generation process for all 3 data sets. So do the splits, than do the data Generation.

Noah Weber
  • 5,829
  • 1
  • 13
  • 26