SMOTE train test split with validation data

Question

Would like to ask, in which way to use SMOTE? My dataset is imbalanced and a multiclass problem. As I read in many posts, use SMOTE method only for the training dataset (X_train and y_train). Not for the test dataset (X_test and y_test). There I include validation data. How do you handle SMOTE with validation data?

df = pd.read_excel...
X=df.drop('column1',axis=1)
y=df.column1
#Training part
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
#validation part
X_train_smote, X_val, y_train_smote, y_val = train_test_split(X_train_smote, y_train_smote, test_size=0.5, random_state=42)

Is this correct?

and is it right, that the validation datasets (X_val and y_val) have both SMOTE inside? or should I make it out of the normal train test split: X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.5, random_state=42)? Im confused.

score 1 · Accepted Answer · answered Oct 31 '20 at 13:09

Problem in applying smote on data and than applying the split (test/or Validation does not matter) is that you could suffer from data leakage. Meaning that some Information from the Train could Spill over to the future and falsly give good predictions. I would advise seperating smote data Generation process for all 3 data sets. So do the splits, than do the data Generation.

SMOTE train test split with validation data

1 Answers1