I have been reading the early paper on pre-training in NLP (https://arxiv.org/abs/1511.01432) and I can't understand what random word dropout means. The authors completely ignore explaining this method as if it was a standard thing. Can someone explain what they really do and what is the purpose of that?
1 Answers
It is not uncommon that we can make sense of a sentence without reading it completely. Or when you are having a quick look at a document, you tend to oversee some words and still understand the main point. This is the intuition behind the word dropout.
Generally this is done by randomly dropping each word in a sequence following for example a Bernoulli distribution:
$X \leftarrow X \odot \vec{e}, \vec{e} ∼ B(n, p)$
where X is the index of the word token, n is the lenth of the sequence, and $\vec{e}$ is a vector with each word dropout state.
This is usually done after calculating the word embeddings, and the words selected to be left out are normally changed to the <UNK> equivalent embedding.
By doing this, we allow out model to learn more flexible ways of writing/convey meaning.