51

When I was reading about using StandardScaler, most of the recommendations were saying that you should use StandardScaler before splitting the data into train/test, but when i was checking some of the codes posted online (using sklearn) there were two major uses.

Case 1: Using StandardScaler on all the data. E.g.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_fit = sc.fit(X)
X_std = X_fit.transform(X)

Or

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit(X)
X = sc.transform(X)

Or simply

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_std = sc.fit_transform(X)

Case 2: Using StandardScaler on split data.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)

I would like to standardize my data, but I am confused which approach is the best!

blackraven
  • 125
  • 7
tsumaranaina
  • 725
  • 1
  • 6
  • 17

3 Answers3

48

In the interest of preventing information about the distribution of the test set leaking into your model, you should go for option #2 and fit the scaler on your training data only, then standardise both training and test sets with that scaler. By fitting the scaler on the full dataset prior to splitting (option #1), information about the test set is used to transform the training set, which in turn is passed downstream.

As an example, knowing the distribution of the whole dataset might influence how you detect and process outliers, as well as how you parameterise your model. Although the data itself is not exposed, information about the distribution of the data is. As a result, your test set performance is not a true estimate of performance on unseen data. Some further discussion you might find useful is on Cross Validated.

redhqs
  • 1,708
  • 16
  • 19
9

You shouldn't be doing fit_transform(X_test) on the test data.
The fit already occurred above.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)
Ethan
  • 1,657
  • 9
  • 25
  • 39
starsini
  • 91
  • 1
  • 1
-2

How about the following:

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.fit_transform (X_test)

Because if X_test = sc.transform(X_test), it returns error X_test is not fitted yet. Or did I miss something here?

Ethan
  • 1,657
  • 9
  • 25
  • 39