0

I am using the tf-idf to build representations. It is large dataset and it quickly becomes too much for my RAM if I convert the matrix to a Data-Frame.

What is the best way to reduce the number of features/columns and retain the highest possible level of information.

The model has the possibility of setting max_features to a number but that retains features that have high_term frequency, which kind off defeats the purpose of tf-idf. You can also set stop-words, but that doesn't reduce the dimensionality much in my case.

Borut Flis
  • 199
  • 3
  • 7

2 Answers2

3

Generally speaking the correct representation on td-idf encoding is a hyperparameter to be optimized.

As suggested in the above's answers, you can go for the regularization parameter i.e min_df which will control the minimum representativeness of a term to appear in the term-document matrix.

A reasonable approach would be a combination of both min_df and ngram_range but again via hyperparameter using cv score to select the appropriate one and of course, it is highly recommended to preprocess/clean the data by removing stopwords.

Multivac
  • 3,199
  • 2
  • 10
  • 26
2

You should almost always use the min_df parameter (minimum frequency) in order to limit the size of the vocabulary because:

  • Rare words don't help the model and they often cause overfitting.
  • Thanks to Zipf's law, this reduces the vocabulary size drastically much more than max_df or stop words. Therefore better model and lower memory requirement.
Erwan
  • 26,519
  • 3
  • 16
  • 39