"LoRA: Low-Rank Adaptation of Large Language Models." Freezes pre-trained weights and injects trainable low-rank decomposition matrices into each Transformer layer, reducing trainable parameters by 10,000x and GPU memory by 3x compared to full fine-tuning with no inference latency penalty.

LoRA became the dominant parameter-efficient fine-tuning method, used by virtually every open-source model community (Alpaca-LoRA, QLoRA, etc.) and enabling fine-tuning of frontier models on consumer hardware. ICLR 2022. By Hu, Shen, Wallis et al. at Microsoft Research.

Paper

arXiv: 2106.09685

Venue: ICLR 2022

foundationalopen-source