Is a BiLSTM layer required if we use BERT?

Question

I am new to Deep learning based NLP and I have a doubt - I am trying to build a NER model and I found some journals where people are relying on BERT-BiLSTM-CRF model for it. As far as I know BERT is a language model that scans the contexts in both the directions and embeds words according to the context. Now my question is - if context is captured during word embedding with BERT, why do we need another layer of BiLSTM?

score 4 · Accepted Answer · answered Jan 06 '21 at 07:27

That layer isn't required indeed as it also encodes the sequence, albeit in a different way than BERT.

What I assume is that in a BERT-BiLSTM-CRF, setup, the BERT layer is either frozen or difficult to fine-tune due to its sheer size. Which is likely why the BiLSTM layer has been added there.

Is a BiLSTM layer required if we use BERT?

1 Answers1