How pre-trained BERT model generates word embeddings for out of vocabulary words?

Question

Currently, I am reading BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. I want to understand how pre-trained BERT generates word embeddings for out of vocabulary words? Models like ELMo process inputs at character-level and can generate word embeddings for out of vocabulary words. Can BERT do something similar?

score 14 · Accepted Answer · answered Nov 17 '20 at 20:26

BERT does not provide word-level representations, but subword representations. You may want to combine the vectors of all subwords of the same word (e.g. by averaging them), but that is up to you, BERT only gives you the subword vectors.

Subwords are used for representing both the input text and the output tokens. When an unseen word is presented to BERT, it will be sliced into multiple subwords, even reaching character subwords if needed. That is how it deals with unseen words.

ELMo is very different: it ingests characters and generate word-level representations. The fact that it ingests the characters of each word instead of a single token for representing the whole word is what grants ELMo the ability to handle unseen words.

How pre-trained BERT model generates word embeddings for out of vocabulary words?

1 Answers1

Linked