9

I have a Lending club dataset from Kaggle; it contains many different columns: there are for example dummy variables, years, amount of the loan...ect I want to normalize the data in the training and test set but I have to use the Min and Max of the train set to prevent data leakage from the test set. My question is: if there is, in the test set or even when I try to predict new data point, a value that is greater that the Max value or lower than the Min value and I normalize it using the same values from the train set, is it correct? can I the model process this value normally?

this is the code that I use to normalize

    from  sklearn.preprocessing import MinMaxScaler

    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)

2 Answers2

5

The minimum and maximum values are just known limits that are parts of the formula that reshapes the distribution of the data, so if a value is bigger than the previously known value the resulting feature scaling (Normalization) will be still appropriate.

An alternative is z-scores if you don't feel like using minimum and maximum values.

x'= (x-x̄) / σ Where x is the original feature vector, x̄ is the average of the vector x is the mean of that feature vector and σ is its standard deviation.

ebrahimi
  • 1,305
  • 7
  • 20
  • 40
wacax
  • 3,500
  • 4
  • 26
  • 48
5

In Machine Learning, you are making the assumption that the training and test sets follow the same distribution. If this assumption does not stand, then your model won't be able to generalize properly.

Having said that, there obviously is a chance of a test-set feature having a value slightly larger than the max of that same feature in the training set. If this is the case, all ML models will work perfectly fine for that sample having a normalized value slightly higher than $1$.

What I want to emphasize, however, is that if the training set and the test set have significantly different distributions (most commonly due to a small dataset size), then no model will be able to generalize properly and it won't be a problem of normalization.

Djib2011
  • 8,068
  • 5
  • 28
  • 39