Very interesting question.
- Easy, but probably lazy answer
When using pre-trained models, it is always advised to feed it data similar to what it was trained with. Basically, if it matters, don't remove them, and if it doesn't matter, it doesn't hurt to keep them in. Obviously, if you can, try with or without stopwords, and see what works best for your problem.
You actually have two ways to "remove" your stopwords. Either you remove them from the input sequence altogether. Or, you could replace them with a mask token (i.e. <#UNKNOWN> or <#MASK>).
In the latter case, the transformer will implicitly guess what these masks are and you will have achieved the original goal of stopwords removal: make sure they don't affect the predictive outcome. Indeed, take the sentences:
"I like basketball with an audience" & "I like basketball without an audience"
These two sentences are both about basketball and you wouldn't want with/without to make your model think these sentences are about different topics. By masking with and without, you both "remove" the stopwords, and you don't potentially mess with the fact that the pre-trained model didn't use data without stopwords.
Now, what would happen if you fed an "incomplete" sentence to a transformer. The positional encodings would retain the notion that certain words are before or after each other, which is what you want. But if you remove stopwords, some words may appear "too close" to each other. But does that matter? I do not think so.
If the vector your transformer outputs for "word1 <#Mask> word2" is widely different than the one for "word1 word2", it must mean that the masked token is crucial to the overall meaning of the sentence, which would also suggest that it shouldn't be a stopword to begin with.
I would suggest masking the stopwords instead of removing them. But, if performance is so critical that you need to feed smaller sequences, I think that you will be alright.