StandardScaler before or after splitting data - which is better?

Question

When I was reading about using StandardScaler, most of the recommendations were saying that you should use StandardScaler before splitting the data into train/test, but when i was checking some of the codes posted online (using sklearn) there were two major uses.

Case 1: Using StandardScaler on all the data. E.g.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_fit = sc.fit(X)
X_std = X_fit.transform(X)

Or

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit(X)
X = sc.transform(X)

Or simply

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_std = sc.fit_transform(X)

Case 2: Using StandardScaler on split data.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)

I would like to standardize my data, but I am confused which approach is the best!

score 48 · Accepted Answer · answered Sep 18 '18 at 17:06

In the interest of preventing information about the distribution of the test set leaking into your model, you should go for option #2 and fit the scaler on your training data only, then standardise both training and test sets with that scaler. By fitting the scaler on the full dataset prior to splitting (option #1), information about the test set is used to transform the training set, which in turn is passed downstream.

As an example, knowing the distribution of the whole dataset might influence how you detect and process outliers, as well as how you parameterise your model. Although the data itself is not exposed, information about the distribution of the data is. As a result, your test set performance is not a true estimate of performance on unseen data. Some further discussion you might find useful is on Cross Validated.

score 9 · Answer 2 · edited Feb 08 '21 at 02:12

9

You shouldn't be doing fit_transform(X_test) on the test data.
The fit already occurred above.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)

edited Feb 08 '21 at 02:12

Ethan

1,657
9
25
39

answered Oct 10 '19 at 06:24

starsini

91
1
1

score -2 · Answer 3 · edited Jun 07 '19 at 04:21

-2

How about the following:

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.fit_transform (X_test)

Because if X_test = sc.transform(X_test), it returns error X_test is not fitted yet. Or did I miss something here?

edited Jun 07 '19 at 04:21

Ethan

1,657
9
25
39

answered Jun 06 '19 at 21:54

user253546

5
1

StandardScaler before or after splitting data - which is better?

3 Answers3

Linked