What is the best way to limit number of features in TF-IDF?

Question

I am using the tf-idf to build representations. It is large dataset and it quickly becomes too much for my RAM if I convert the matrix to a Data-Frame.

What is the best way to reduce the number of features/columns and retain the highest possible level of information.

The model has the possibility of setting max_features to a number but that retains features that have high_term frequency, which kind off defeats the purpose of tf-idf. You can also set stop-words, but that doesn't reduce the dimensionality much in my case.

score 3 · Accepted Answer · answered Dec 08 '21 at 00:24

Generally speaking the correct representation on td-idf encoding is a hyperparameter to be optimized.

As suggested in the above's answers, you can go for the regularization parameter i.e min_df which will control the minimum representativeness of a term to appear in the term-document matrix.

A reasonable approach would be a combination of both min_df and ngram_range but again via hyperparameter using cv score to select the appropriate one and of course, it is highly recommended to preprocess/clean the data by removing stopwords.

score 2 · Answer 2 · answered Dec 07 '21 at 18:34

You should almost always use the min_df parameter (minimum frequency) in order to limit the size of the vocabulary because:

Rare words don't help the model and they often cause overfitting.
Thanks to Zipf's law, this reduces the vocabulary size drastically much more than max_df or stop words. Therefore better model and lower memory requirement.

What is the best way to limit number of features in TF-IDF?

2 Answers2

Linked