46

I am doing a project on an author identification problem. I applied the tf-idf normalization to train data and then trained an SVM on that data.

Now when using the classifier, should I normalize test data as well. I feel that the basic aim of normalization is to make the learning algorithm give more weight to more important features while learning. So once it has been trained, it already knows which features are important and which are not. So is there any need to apply normalization to test data as well?

I am new to this field. So please ignore if the question appears silly?

Alaa
  • 63
  • 7
Kishan Kumar
  • 685
  • 2
  • 7
  • 11

2 Answers2

73

Yes you need to apply normalisation to test data, if your algorithm works with or needs normalised training data*.

That is because your model works on the representation given by its input vectors. The scale of those numbers is part of the representation. This is a bit like converting between feet and metres . . . a model or formula would work with just one type of unit normally.

Not only do you need normalisation, but you should apply the exact same scaling as for your training data. That means storing the scale and offset used with your training data, and using that again. A common beginner mistake is to separately normalise your train and test data.

In Python and SKLearn, you might normalise your input/X values using the Standard Scaler like this:

scaler = StandardScaler()
train_X = scaler.fit_transform( train_X )
test_X = scaler.transform( test_X )

Note how the conversion of train_X using a function which fits (figures out the params) then normalises. Whilst the test_X conversion just transforms, using the same params that it learned from the train data.

The tf-idf normalisation you are applying should work similarly, as it learns some parameters from the data set as a whole (frequency of words in all documents), as well as using ratios found in each document.


* Some algorithms (such as those based on decision trees) do not need normalised inputs, and can cope with features that have different inherent scales.

Neil Slater
  • 29,388
  • 5
  • 82
  • 101
13

Definitely you should normalize your data. You normalize the data for the following aims:

  • For having different features in same scale, which is for accelerating learning process.

  • For caring different features fairly without caring the scale.

After training, your learning algorithm has learnt to deal with the data in scaled form, so you have to normalize your test data with the normalizing parameters used for training data.

Green Falcon
  • 14,308
  • 10
  • 59
  • 98