0

The typical default for neural networks in natural language processing has been to take words as tokens.

OpenAI Codex is based on GPT-3, but also deals with source code. For source code in general, there is no corresponding obvious choice of tokens, because each programming language has different rules for tokenizing. I don't get the impression Codex uses a separate tokenizer for each language.

What does it take as tokens?

rwallace
  • 159
  • 3

1 Answers1

2

NLP neural networks don't use word tokens any more. It's been a while since the norm is using subwords. Usual approaches to define the subword vocabulary are byte-pair encoding (BPE), word pieces or unigram tokenization.

GPT-3 uses BPE tokenization. According to the OpenAI's tokenizer tool website:

Codex models use a different set of encodings that handle whitespace more efficiently

From this, I understand that they use BPE but with a different vocabulary. This is supported by this javascript tokenizer that was created by extracting the BPE vocabulary from OpenAI's own online tokenizer tool.

noe
  • 28,203
  • 1
  • 49
  • 83