Effect of Stop-Word Removal on Transformers for Text Classification

Question

The domain here is essentially topic classification, so not necessarily a problem where stop-words have an impact on the analysis (as opposed to, say, sentiment analysis where structure can affect meaning).

With respect to the positional encoding mechanism in transformer language models, when using a pretrained LM is stop-word removal as a preprocessing step actively harmful if the LM was trained on a corpus where they were left in? I'm still working on fully understanding the mechanism but I feel like removing stop-words would affect which wavelength is used to construct the context between any given pair of words with stop words between them, which in turn would impact the encoding.

Or, would this not matter because the regression when trained figures it out from consistently processed input? I feel like it should matter but haven't been able to find anything on the topic.

score 6 · Accepted Answer · answered Jan 05 '21 at 17:23

Very interesting question.

Easy, but probably lazy answer

When using pre-trained models, it is always advised to feed it data similar to what it was trained with. Basically, if it matters, don't remove them, and if it doesn't matter, it doesn't hurt to keep them in. Obviously, if you can, try with or without stopwords, and see what works best for your problem.

Longer answer

You actually have two ways to "remove" your stopwords. Either you remove them from the input sequence altogether. Or, you could replace them with a mask token (i.e. <#UNKNOWN> or <#MASK>).

In the latter case, the transformer will implicitly guess what these masks are and you will have achieved the original goal of stopwords removal: make sure they don't affect the predictive outcome. Indeed, take the sentences:

"I like basketball with an audience" & "I like basketball without an audience"

These two sentences are both about basketball and you wouldn't want with/without to make your model think these sentences are about different topics. By masking with and without, you both "remove" the stopwords, and you don't potentially mess with the fact that the pre-trained model didn't use data without stopwords.

Now, what would happen if you fed an "incomplete" sentence to a transformer. The positional encodings would retain the notion that certain words are before or after each other, which is what you want. But if you remove stopwords, some words may appear "too close" to each other. But does that matter? I do not think so.

If the vector your transformer outputs for "word1 <#Mask> word2" is widely different than the one for "word1 word2", it must mean that the masked token is crucial to the overall meaning of the sentence, which would also suggest that it shouldn't be a stopword to begin with.

Final answer

I would suggest masking the stopwords instead of removing them. But, if performance is so critical that you need to feed smaller sequences, I think that you will be alright.

score 0 · Answer 2 · answered Jan 05 '21 at 18:38

Removing stop words or keeping them is an empirical question. The effect will vary based on corpus and task. In fact, the definition of stop words depends on the corpus and the task.

One approach would be to benchmark the effect of stop words with cross validation for the specific scenario.

Effect of Stop-Word Removal on Transformers for Text Classification

2 Answers2