4

Most of real world datasets have features with missing values. Replacing missing values with an appropriate value such as its mean, is considered as a good step in feature engineering. Some times we also standardize/normalize feature columns before feeding it to train an model.

Before modelling we also split our dataset to training and testing sets.

My first query is how do we do feature engineering in this splitted dataset?

Do we use a global mean of the unsplitted features to replace the missing value of those features in both training and testing set or should we use local means of those sets?

Like the above question how do we do normalization to a train, test dataset?

The last but an important question, in productions we mostly get feature values one at a time (think a row of features), how do we feature engineer such data rows?

Eka
  • 301
  • 1
  • 3
  • 11

1 Answers1

3

The principle in supervised ML is quite simple: the "method" which is going to be used to predict the response variable must be fully determined from the training set and only from the training set. In other words, anything which doesn't belong to the training set cannot be used.

As a consequence, feature engineering, i.e. choosing how to prepare/represent/normalize features must be done using only the training set. This includes any feature selection/extraction step.

Note that once the final data preparation process is fully determined, it can and should be applied exactly the same way on the test set or in production. This means that for instance normalization does not involve recomputing any parameter, it uses the ones calculated on the training set.

See also a few related questions:

Erwan
  • 26,519
  • 3
  • 16
  • 39