Highest Voted 'bert' Questions - Data Science Stack Exchange

77

votes

4 answers

What is purpose of the [CLS] token and why is its encoding output important?

I am reading this article on how to use BERT by Jay Alammar and I understand things up until: For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, so we select that slice of the cube and discard everything…

asked Jan 09 '20 at 17:20

user3768495

987
1
7
8

44

votes

2 answers

What is GELU activation?

I was going through BERT paper which uses GELU (Gaussian Error Linear Unit) which states equation as $$ GELU(x) = xP(X ≤ x) = xΦ(x).$$ which in turn is approximated to $$0.5x(1 + tanh[\sqrt{ 2/π}(x + 0.044715x^3)])$$ Could you simplify the equation…

activation-function bert mathematics

asked Apr 18 '19 at 08:06

thanatoz

2,495
4
20
41

39

votes

7 answers

How to get sentence embedding using BERT?

How to get sentence embedding using BERT? from transformers import BertTokenizer tokenizer=BertTokenizer.from_pretrained('bert-base-uncased') sentence='I really enjoyed this movie a lot.' #1.Tokenize the…

tensorflow nlp pytorch bert

asked Nov 04 '19 at 15:22

star

1,521
7
20
31

29

votes

7 answers

Why is the decoder not a part of BERT architecture?

I can't see how BERT makes predictions without using a decoder unit, which was a part of all models before it including transformers and standard RNNs. How are output predictions made in the BERT architecture without using a decoder? How does it do…

nlp bert machine-translation attention-mechanism

asked Dec 21 '19 at 17:09

hathalye7

445
1
5
7

27

votes

5 answers

BERT vs Word2VEC: Is bert disambiguating the meaning of the word vector?

Word2vec: Word2vec provides a vector for each token/word and those vectors encode the meaning of the word. Although those vectors are not human interpretable, the meaning of the vectors are understandable/interpretable by comparing with other…

word2vec word-embeddings bert

asked Jun 21 '19 at 16:25

sovon

521
1
5
8

21

votes

1 answer

Can BERT do the next-word-predict task?

As BERT is bidirectional (uses bi-directional transformer), is it possible to use it for the next-word-predict task? If yes, what needs to be tweaked?

neural-network deep-learning attention-mechanism transformer bert

asked Feb 28 '19 at 08:37

CoderOnly

721
1
7
17

18

votes

2 answers

What is the use of [SEP] in paper BERT?

I know that [CLS] means the start of a sentence and [SEP] makes BERT know the second sentence has begun. However, I have a question. If I have 2 sentences, which are s1 and s2, and our fine-tuning task is the same. In one way, I add special tokens…

machine-learning nlp transformer bert

asked May 07 '19 at 04:53

xiangqing shen

181
1
1
3

16

votes

2 answers

What are the good parameter ranges for BERT hyperparameters while finetuning it on a very small dataset?

I need to finetune BERT model (from the huggingface repository) on a sentence classification task. However, my dataset is really small.I have 12K sentences and only 10% of them are from positive classes. Does anyone here have any experience on…

deep-learning bert finetuning

asked Dec 10 '19 at 18:31

zwlayer

279
1
2
8

15

votes

2 answers

Preprocessing for Text Classification in Transformer Models (BERT variants)

This might be silly to ask, but I am wondering if one should carry out the conventional text preprocessing steps for training one of the transformer models? I remember for training a Word2Vec or Glove, we needed to perform an extensive text cleaning…

python nlp preprocessing bert transformer

asked Nov 08 '19 at 06:28

TwinPenguins

4,429
3
22
54

14

votes

3 answers

Why does everyone use BERT in research instead of LLAMA or GPT or PaLM, etc?

It could be that I'm misunderstanding the problems space and the iterations of LLAMA, GPT, and PaLM are all based on BERT like many language models are, but every time I see a new paper in improving language models it takes BERT as a based an adds…

nlp bert language-model gpt research

asked Aug 03 '23 at 01:11

Ethan

243
1
2
6

12

votes

1 answer

What is whole word masking in the recent BERT model?

I was checking BERT GitHub page and noticed that there are new models built from a new training technique called "whole word masking". Here is a snippet describing it: In the original pre-processing code, we randomly select WordPiece tokens to…

nlp language-model bert

asked Jun 15 '19 at 23:13

kee

223
2
6

12

votes

1 answer

what is the first input to the decoder in a transformer model?

The image is from url: Jay Alammar on transformers K_encdec and V_encdec are calculated in a matrix multiplication with the encoder outputs and sent to the encoder-decoder attention layer of each decoder layer in the decoder. The previous output is…

nlp sequence bert transformer

asked May 11 '19 at 08:36

mLstudent33

604
1
7
20

11

votes

2 answers

What is the difference between BERT and Roberta

I want to understand the difference between BERT and Roberta. I saw the article below. https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8 It mentions that Roberta was trained on 10x more data but I don't…

bert transformer

asked Jul 01 '21 at 11:02

Noman Tanveer

213
1
2
8

11

votes

2 answers

Does BERT has any advantage over GPT3?

I have read a couple of documents that explain in detail about the greater edge that GPT-3(Generative Pre-trained Transformer-3) has over BERT(Bidirectional Encoder Representation from Transformers). So am curious to know whether BERT scores better…

nlp bert gpt

asked Sep 12 '20 at 04:37

Bipin

213
1
2
8

11

votes

2 answers

Why should I understand AI architectures?

Why should I understand what is happening deep down in some AI architecture? For example LSTM-BERT- Partial Conv... Architectures like this. Why should I understand what is going on while I can find any model on the Internet or any implementations…

machine-learning deep-learning cnn machine-learning-model bert

asked Nov 07 '21 at 13:20

CanP

127
1
3

Questions tagged [bert]