43

Which one is the right approach to make data normalization - before or after train-test split?

Normalization before split

from sklearn.preprocessing import StandardScaler

normalized_X_features = pd.DataFrame( StandardScaler().fit_transform(X_features), columns = X_features.columns )

x_train, x_test, y_train, y_test = train_test_split( normalized_X_features, Y_feature, test_size=0.20, random_state=4 ) LR = LogisticRegression( C=0.01, solver='liblinear' ).fit(x_train, y_train)

y_test_pred = LR.predict(x_test)

Normalization after split

x_train, x_test, y_train, y_test = train_test_split(
    X_features,
    Y_feature,
    test_size=0.20,
    random_state=4
)
normalized_x_train = pd.DataFrame(
    StandardScaler().fit_transform(x_train),
    columns = x_train.columns
)
LR = LogisticRegression(
    C=0.01,
    solver='liblinear'
).fit(normalized_x_train, y_train)

normalized_x_test = pd.DataFrame( StandardScaler().fit_transform(x_test), columns = x_test.columns ) y_test_pred = LR.predict(normalized_x_test)

So far I have seen both approaches.

SummerEla
  • 105
  • 4
Tauno
  • 819
  • 3
  • 10
  • 9

5 Answers5

54

Normalization across instances should be done after splitting the data between training and test set, using only the data from the training set.

This is because the test set plays the role of fresh unseen data, so it's not supposed to be accessible at the training stage. Using any information coming from the test set before or during training is a potential bias in the evaluation of the performance.

[Precision thanks to Neil's comment] When normalizing the test set, one should apply the normalization parameters previously obtained from the training set as-is. Do not recalculate them on the test set, because they would be inconsistent with the model and this would produce wrong predictions.

Erwan
  • 26,519
  • 3
  • 16
  • 39
15

As @Erwan said, you should normalize the training set and then use the same normalization steps on the test set. So your code should look like:

from sklearn.preprocessing import StandardScaler

x_train, x_test, y_train, y_test = train_test_split( X_features, Y_feature, test_size=0.20, random_state=4)

scaler = StandardScaler()

normalized_x_train = pd.DataFrame( scaler.fit_transform(x_train), columns = x_train.columns )

LR = LogisticRegression( C=0.01, solver='liblinear' ).fit(normalized_x_train, y_train)

normalized_x_test = pd.DataFrame( scaler.transform(x_test), columns = x_test.columns ) y_test_pred = LR.predict(normalized_x_test)

SummerEla
  • 105
  • 4
Jack
  • 273
  • 2
  • 7
9

Answer to your question: Do Normalization after splitting into train and test/validation. The reason is to avoid any data leakage.

Data Leakage:

Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.

You can read about it here : https://machinelearningmastery.com/data-leakage-machine-learning/

Aman Mathur
  • 225
  • 1
  • 10
0

Sklearn's own documentation indicates that:

"Note We here choose to illustrate data leakage with a feature selection step. This risk of leakage is however relevant with almost all transformations in scikit-learn, including (but not limited to) StandardScaler, SimpleImputer, and PCA."

"As with any other type of preprocessing, feature selection should only use the training data. Including the test data in feature selection will optimistically bias your model."

"10.2.2. How to avoid data leakage Below are some tips on avoiding data leakage:

Always split the data into train and test subsets first, particularly before any preprocessing steps."

https://scikit-learn.org/stable/common_pitfalls.html

0

I think it doesn't matter whether you do it before or after since data leakage is only possible when classification results or your output information are also somehow flowing in the input model.

But since you are applying normalization on the input parameter and not the output no leakage could possibly happen.

Ethan
  • 1,657
  • 9
  • 25
  • 39