5

I am new to Deep learning based NLP and I have a doubt - I am trying to build a NER model and I found some journals where people are relying on BERT-BiLSTM-CRF model for it. As far as I know BERT is a language model that scans the contexts in both the directions and embeds words according to the context. Now my question is - if context is captured during word embedding with BERT, why do we need another layer of BiLSTM?

Ethan
  • 1,657
  • 9
  • 25
  • 39

1 Answers1

4

That layer isn't required indeed as it also encodes the sequence, albeit in a different way than BERT.

What I assume is that in a BERT-BiLSTM-CRF, setup, the BERT layer is either frozen or difficult to fine-tune due to its sheer size. Which is likely why the BiLSTM layer has been added there.

Valentin Calomme
  • 6,256
  • 3
  • 23
  • 54