3

As I understand, GPT-2 and BERT are using Byte-Pair Encoding which is a subword encoding. Since lots of start/end token is used such as <|startoftext|> and , as I image the encoder should encode the token as one single piece.

However, when I use pytorch BertTokenizer it seems the encoder also separate token into pieces. Is this correct behaviour?

from pytorch_pretrained_bert import BertTokenizer, cached_path
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False) 
tokenizer.tokenize('<s> This is a sentence <|endoftext|>')

The results are:

['<',
 's',
 '>',
 'This',
 'is',
 'a',
 'sentence',
 '<',
 '|',
 'end',
 '##oft',
 '##ex',
 '##t',
 '|',
 '>']
Kevin Ling
  • 143
  • 1
  • 3

1 Answers1

4

BERT is not trained with this kind of special tokens, so the tokenizer is not expecting them and therefore it splits them as any other piece of normal text, and they will probably harm the obtained representations if you keep them. You should remove these special tokens from the input text.

In the case of GPT-2, OpenAI trained it only with <|endoftext|>, but it has to be added after the tokenization. Some people mistakenly add it before tokenization, leading to problems. <|startoftext|> is specific to the library gpt-2-simple.

noe
  • 28,203
  • 1
  • 49
  • 83