Questions tagged [bert]

BERT stands for Bidirectional Encoder Representations from Transformers and is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers

343 questions
77
votes
4 answers

What is purpose of the [CLS] token and why is its encoding output important?

I am reading this article on how to use BERT by Jay Alammar and I understand things up until: For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, so we select that slice of the cube and discard everything…
44
votes
2 answers

What is GELU activation?

I was going through BERT paper which uses GELU (Gaussian Error Linear Unit) which states equation as $$ GELU(x) = xP(X ≤ x) = xΦ(x).$$ which in turn is approximated to $$0.5x(1 + tanh[\sqrt{ 2/π}(x + 0.044715x^3)])$$ Could you simplify the equation…
thanatoz
  • 2,495
  • 4
  • 20
  • 41
39
votes
7 answers

How to get sentence embedding using BERT?

How to get sentence embedding using BERT? from transformers import BertTokenizer tokenizer=BertTokenizer.from_pretrained('bert-base-uncased') sentence='I really enjoyed this movie a lot.' #1.Tokenize the…
star
  • 1,521
  • 7
  • 20
  • 31
29
votes
7 answers

Why is the decoder not a part of BERT architecture?

I can't see how BERT makes predictions without using a decoder unit, which was a part of all models before it including transformers and standard RNNs. How are output predictions made in the BERT architecture without using a decoder? How does it do…
hathalye7
  • 445
  • 1
  • 5
  • 7
27
votes
5 answers

BERT vs Word2VEC: Is bert disambiguating the meaning of the word vector?

Word2vec: Word2vec provides a vector for each token/word and those vectors encode the meaning of the word. Although those vectors are not human interpretable, the meaning of the vectors are understandable/interpretable by comparing with other…
sovon
  • 521
  • 1
  • 5
  • 8
21
votes
1 answer

Can BERT do the next-word-predict task?

As BERT is bidirectional (uses bi-directional transformer), is it possible to use it for the next-word-predict task? If yes, what needs to be tweaked?
18
votes
2 answers

What is the use of [SEP] in paper BERT?

I know that [CLS] means the start of a sentence and [SEP] makes BERT know the second sentence has begun. However, I have a question. If I have 2 sentences, which are s1 and s2, and our fine-tuning task is the same. In one way, I add special tokens…
xiangqing shen
  • 181
  • 1
  • 1
  • 3
16
votes
2 answers

What are the good parameter ranges for BERT hyperparameters while finetuning it on a very small dataset?

I need to finetune BERT model (from the huggingface repository) on a sentence classification task. However, my dataset is really small.I have 12K sentences and only 10% of them are from positive classes. Does anyone here have any experience on…
zwlayer
  • 279
  • 1
  • 2
  • 8
15
votes
2 answers

Preprocessing for Text Classification in Transformer Models (BERT variants)

This might be silly to ask, but I am wondering if one should carry out the conventional text preprocessing steps for training one of the transformer models? I remember for training a Word2Vec or Glove, we needed to perform an extensive text cleaning…
TwinPenguins
  • 4,429
  • 3
  • 22
  • 54
14
votes
3 answers

Why does everyone use BERT in research instead of LLAMA or GPT or PaLM, etc?

It could be that I'm misunderstanding the problems space and the iterations of LLAMA, GPT, and PaLM are all based on BERT like many language models are, but every time I see a new paper in improving language models it takes BERT as a based an adds…
Ethan
  • 243
  • 1
  • 2
  • 6
12
votes
1 answer

What is whole word masking in the recent BERT model?

I was checking BERT GitHub page and noticed that there are new models built from a new training technique called "whole word masking". Here is a snippet describing it: In the original pre-processing code, we randomly select WordPiece tokens to…
kee
  • 223
  • 2
  • 6
12
votes
1 answer

what is the first input to the decoder in a transformer model?

The image is from url: Jay Alammar on transformers K_encdec and V_encdec are calculated in a matrix multiplication with the encoder outputs and sent to the encoder-decoder attention layer of each decoder layer in the decoder. The previous output is…
mLstudent33
  • 604
  • 1
  • 7
  • 20
11
votes
2 answers

What is the difference between BERT and Roberta

I want to understand the difference between BERT and Roberta. I saw the article below. https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8 It mentions that Roberta was trained on 10x more data but I don't…
Noman Tanveer
  • 213
  • 1
  • 2
  • 8
11
votes
2 answers

Does BERT has any advantage over GPT3?

I have read a couple of documents that explain in detail about the greater edge that GPT-3(Generative Pre-trained Transformer-3) has over BERT(Bidirectional Encoder Representation from Transformers). So am curious to know whether BERT scores better…
Bipin
  • 213
  • 1
  • 2
  • 8
11
votes
2 answers

Why should I understand AI architectures?

Why should I understand what is happening deep down in some AI architecture? For example LSTM-BERT- Partial Conv... Architectures like this. Why should I understand what is going on while I can find any model on the Internet or any implementations…
1
2 3
22 23