Transformer model: Why are word embeddings scaled before adding positional encodings?

Question

While going over a Tensorflow tutorial for the Transformer model I realized that their implementation of the Encoder layer (and the Decoder) scales word embeddings by sqrt of embedding dimension before adding positional encodings. Notice that this is different from scaling the dot product attention.

I'm referring to the 3rd line of the call method of the Encoder class here: https://www.tensorflow.org/tutorials/text/transformer#encoder

def call(self, x, training, mask):
seq_len = tf.shape(x)[1]
adding embedding and position encoding.
x = self.embedding(x) # (batch_size, input_seq_len, d_model)
  x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
  x += self.pos_encoding[:, :seq_len, :]
x = self.dropout(x, training=training)
for i in range(self.num_layers):
    x = self.enc_layers[i](x, training, mask)
return x # (batch_size, input_seq_len, d_model)

I could not find any mention of this scaling in the papers I've read so far. People always show the input to the encoder as WE + PE, that is word embedding plus positional encoding. But this implementation seems to use sqrt(d_model) * WE + PE.

My questions:

Have you ever seen this extra scaling step mentioned in a paper? I didn't find it in "Attention is all you need" (Vaswani et. al.).
What is this additional scaling trying to achieve?

score 16 · Accepted Answer · edited Jan 15 '21 at 00:27

This is specified in the original Transformer paper, at the end of section 3.4:

Transcription:

3.4 Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension _model. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [24]. In the embedding layers, we multiply those weights by √_model

This aspect is not justified by the authors, either on the paper or anywhere else. It was specifically asked as an issue in the original implementation by Google with no response.

Other implementations of the Transformer have also wondered if this was actually needed (see this, this and this).

Some hypothesithed arguments (source) are:

It is for the sharing weight between the decoder embedding and the decoder pre-softmax linear weights.
It is not actually needed.
It is to make the positional encoding relatively smaller. This means the original meaning in the embedding vector won’t be lost when we add them together.

For reference, there are other StackExchange questions discussing this (see this and this).

score 8 · Answer 2 · answered Jan 18 '21 at 20:14

Thank-you!! I'd also missed that multiply in my (fairseq transformer) code study, and it helps clear up a mystery that I'd noted: the (sinusoidal, non-learned) positional embeddings are initialized with a range of -1.0 to +1.0, but the word-embeddings are initialized with a mean of 0.0 and s.d. of embedding_dim ** -0.5 (0.044 for 512, 0.03125 for 1024).

So, on the face of it, the positional embeddings would overwhelm any signal coming from the word embeddings.

But now I can see word embeddings are scaled by math.sqrt(embed_dim) (22.6 for 512, 32 for 1024), it makes sense again.

Following the links in the other answer, it seems it is done this way because the same embeddings can be used in other parts of the transformer model, and that has decided the initialization values.

AaronWeng · Answer 3 · 2023-08-24T16:48:20.207

I agree with Noe's third explanation "It is to make the positional encoding relatively smaller. This means the original meaning in the embedding vector won’t be lost when we add them together.".

I find a reference book (Dive into Deep Learning-Section 11.7.4) mentions this as "In the following Transformer encoder implementation, we stack num_blks instances of the above TransformerEncoderBlock classes. Since we use the fixed positional encoding whose values are always between -1 and 1 , we multiply values of the learnable input embeddings by the square root of the embedding dimension to rescale before summing up the input embedding and the positional encoding."

score 0 · Answer 4 · answered Mar 23 '24 at 13:54

Multiplying Weights by √dmodel

In the embedding layers, the weights are multiplied by the square root of dmodel. This is done to scale the weights appropriately and ensure that the dot product between the input tokens and the weights falls within a reasonable range. Scaling the weights helps in preventing the gradients from becoming too large or too small during training, which can lead to unstable learning dynamics.

Transformer model: Why are word embeddings scaled before adding positional encodings?

adding embedding and position encoding.

4 Answers4