Difference between train, test split before preprocessing and after preprocessing

Question

I am new to machine learning. I am bit confused in preprocessing. Generally,

Scenario-1: I am splitting the dataset into train,test and validation and applying the transformations like fit_transform on train and transform on test.

Scenario-2: The other method is applying transformations on the entire dataset first and then split the dataset into train,test and validation. I am bit confused in choosing , dividing the data before preprocessing and feature engineering or after preprocessing and feature engineering. Looking for a nice answer with effects and casues.

score 1 · Accepted Answer · answered Apr 06 '19 at 18:45

You should absolutely adopt the first scenario. That's because the transformers that you use have some parameters (e.g. mean and standard deviation in case of standard scalar) and this parameters are learned from data like the parameters of your machine learning model. As you know, you should not use the validation and test data for learning the model parameters and for the same reason, you should not use them for learning the transformer parameters. As a result, you should just use the training samples for fitting your transformer parameters if you want to try a practical machine learning scenario.

Difference between train, test split before preprocessing and after preprocessing

1 Answers1