Can BERT/ELMo be used (or retrained) to generate a text in both directions?

Question

Text generation is perhaps one of the fun things to do with old NGram or new BERT/ELMo models. I am wondering can BERT be used to generate text from the end of a sentence, or better in both directions. That is instead of giving some starting words, we give some ending words.

noe · Answer 1 · 2019-12-12T12:48:51.697

Neither BERT nor ELMo can be used as-is for next word (or previous word) predictions. BERT is trained on a masked language model (LM) task and can therefore only be used to guess masked words, obtaining this way a contextual representation of them. There have been some attempts to use BERT for text generation, but they have been unsuccessful up to now. ELMo is a bidirectional LM, so you need the whole sentence to use it as input.

Both BERT an ELMo use the whole sentence (except masked tokens for BERT) by construction, so they can hardly be trained on "causal" LM tasks without modifying the models:

BERT would need to be added a self-attention mask to force causality. This would be equivalent to using BERT's pre-trained weights as initial values for a normal Transformer (causal) language model or any text-generation model, like a neural machine translation (NMT) one; actually, this has been attempted for NMT here and here.
ELMO should be stripped from one of the directions' LSTM. This would be equivalent to using one of the direction's parameters to initialize an LSTM for text generation.

Another option would be to use them for knowledge distillation. This has been done with BERT.

Can BERT/ELMo be used (or retrained) to generate a text in both directions?

1 Answers1