Since LoRA parameters are randomly initialized, shouldn't that mean that initially breaks a models output?

Question

I have just tried using LoRA on Llama 3 8B and I found without doing any fine tuning it performed pretty well on my dataset. But then I realized that surely the LoRA parameters are randomly initialized right? So if that's the case, shouldn't that mean the model outputs are initially detrimented by the LoRA parameters? Since they're just adding random values to the regular parameters?

I also have a somewhat related question if you don't mind answering it as well. I keep reading that the alpha parameter in LoRA is the scaling factor e.g. y = Wx + alpha * L1L2, but I often see alpha values of 256 for example which seems way to large because that would be setting a ratio of 1 : 256 for the influence share between the regular parameters and the LoRA parameters on the output.

noe · Accepted Answer · 2024-04-30T17:59:09.607

Not all the LoRA parameters are initialized randomly, only one of the matrices of the decomposition is. From the original LoRA article:

We use a random Gaussian initialization for A and zero for B, so $\Delta W = B A$ is zero at the beginning of training.

Therefore, initially, the original output of the model does not change.

About $\alpha$, its value is to be divided by the rank $r$ in the update. From the original LoRA paper:

We then scale $\Delta W x$ by $\frac{\alpha}{r}$, where $\alpha$ is a constant in $r$. When optimizing with Adam, tuning $\alpha$ is roughly the same as tuning the learning rate if we scale the initialization appropriately. As a result, we simply set $\alpha$ to the first $r$ we try and do not tune it. This scaling helps to reduce the need to retune hyperparameters when we vary $r$.

The idea is to set a value for $\alpha$ using the first $r$ tried in the tests, and then fine-tune $r$, taking into account that the ratio has a multiplier effect over the learning rate.

Therefore, we cannot judge the value of the $\frac{\alpha}{r}$ ratio without knowing the learning rate used.

Since LoRA parameters are randomly initialized, shouldn't that mean that initially breaks a models output?

1 Answers1