Why LoRA is for fine-tuning but not for training too?

Question

Wfinetuned = Wpretrained + ΔW, so we make ΔW = A @ B and |ΔW| >> |A| + |B|, we are happy. But, why don't we use the same method during training? So, Wtrained = Winitialized + ΔW, and still |ΔW| >> |A| + |B|? And further we can make Wtrained = Winitialized + ΔW = C @ D + A @ B, |Winitialized + ΔW| >> |C| + |D| + |A| + |B|?

score 0 · Answer 1 · answered Nov 27 '24 at 02:28

The point of LoRA is to avoid the compute cost of training the full model. If you are already training the full model, there's no point using LoRA.

Furthermore, training the full rank matrices at the same time as LoRA is just composing linear transforms - ie going from y = x @ W to y = x @ W + x @ A @ B. Compositions of linear transforms are linear transforms. This means there exists some matrix D such that x @ D = x @ W + x @ A @ B. If you are training the full model, there's no advantage to adding the LoRA weights.

Why LoRA is for fine-tuning but not for training too?

1 Answers1