3

I want to make my own stop words list, I computed tf-idf scores for my terms.

Can I consider those words highlighted with red to be stop word? and what should my threshold be for stop words that depend on tf-idf? Should I consider the high values of tf-idf as the most important words that I need to keep?

enter image description here

@Erwan answered this question, check their answer to the question they linked too it is very informative

Maxi
  • 89
  • 8

1 Answers1

8
  • There's no standard definition of stop-word, but in general stop words are very frequent words which don't contribute to the meaning of the text, like determiners, pronouns, etc. Importantly stop-word is a property which applies to unique words in the vocabulary. For example if the word $w$ is considered as a stop-word then this applies to all the occurrences of $w$ in the text, not only to some of them.
  • On the contrary TFIDF applies to the words in the sentences/documents, so the same word $w$ may have a different TFIDF value in different sentences/documents:
    • IDF is a property at the vocabulary level, i.e. all the occurrences of $w$ have the same IDF.
    • TF is specific to the sentence/document. If $w$ appears 3 times more often in document A than in document B, then it has 3 times higher TFIDF value in A than in B.

This is why it doesn't really make sense to consider the TFIDF value to select stop-words: the former is specific to a sentence/document but not the second. You could use the IDF part only, but there's no difference with just using the document frequency, and practically it would give the same results as using the overall frequency.

Erwan
  • 26,519
  • 3
  • 16
  • 39