How is openAI embedding models trained?

Question

how it the embedding model trained? Are the embeddings simply extracted from chatGPT4 or are they trained differently from the beginning (pre-training stage)?

score 4 · Accepted Answer · answered Mar 22 '24 at 06:59

OpenAI embeddings are not extracted from chatGPT. They are trained independently.

According to the original article OpenAI used to present their embeddings, the model used to obtain the embeddings is a Transformer encoder (like BERT or RoBERTa) trained on the task to distinguish if 2 pieces of text where consecutive in the original source text/code.

Also, as described in their last announcement, their embeddings now are also trained with a technique called "Matrioska representation learning" (see the article) that enables them to have their dimensionality reduced to take less storage space if needed.

How is openAI embedding models trained?

1 Answers1