I am new to Deep learning based NLP and I have a doubt - I am trying to build a NER model and I found some journals where people are relying on BERT-BiLSTM-CRF model for it. As far as I know BERT is a language model that scans the contexts in both the directions and embeds words according to the context. Now my question is - if context is captured during word embedding with BERT, why do we need another layer of BiLSTM?
Asked
Active
Viewed 1,936 times
1 Answers
4
That layer isn't required indeed as it also encodes the sequence, albeit in a different way than BERT.
What I assume is that in a BERT-BiLSTM-CRF, setup, the BERT layer is either frozen or difficult to fine-tune due to its sheer size. Which is likely why the BiLSTM layer has been added there.
Valentin Calomme
- 6,256
- 3
- 23
- 54