What is LoRA?
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adapts large pre-trained models by injecting trainable low-rank decomposition matrices into transformer layers, dramatically reducing the number of trainable parameters while maintaining model performance.
Quick Facts
| Full Name | Low-Rank Adaptation |
|---|---|
| Created | 2021 by Microsoft Research |
| Specification | Official Specification |
How It Works
LoRA was introduced by Microsoft researchers in 2021 as a solution to the prohibitive cost of fine-tuning large language models. Instead of updating all model weights, LoRA freezes the pre-trained weights and injects trainable rank decomposition matrices into each layer. This approach reduces trainable parameters by 10,000x and GPU memory requirements by 3x while achieving comparable or better performance than full fine-tuning. LoRA has become the standard approach for customizing large models.
Key Characteristics
- Reduces trainable parameters by orders of magnitude
- Maintains original model weights frozen during training
- Low-rank matrices capture task-specific adaptations
- Multiple LoRA adapters can be swapped at inference time
- Significantly lower GPU memory requirements
- No additional inference latency when merged
Common Use Cases
- Fine-tuning LLMs for specific domains or tasks
- Customizing image generation models like Stable Diffusion
- Adapting models for multiple tasks with separate adapters
- Training on consumer hardware with limited VRAM
- Creating personalized AI assistants
Example
Loading code...Frequently Asked Questions
What is the rank (r) parameter in LoRA and how do I choose it?
The rank parameter determines the size of the low-rank matrices injected into the model. Lower ranks (4-8) use less memory and train faster but may capture less task-specific information. Higher ranks (16-64) can capture more complex adaptations but require more resources. Start with r=16 for most tasks and adjust based on performance. Complex tasks may benefit from higher ranks.
Can I combine multiple LoRA adapters for different tasks?
Yes, multiple LoRA adapters can be used together through techniques like LoRA merging or switching. You can train separate adapters for different tasks and swap them at inference time without reloading the base model. Some implementations support weighted combinations of multiple adapters for multi-task scenarios.
How does LoRA compare to full fine-tuning in terms of performance?
LoRA typically achieves 90-100% of full fine-tuning performance while using only 0.1-1% of the trainable parameters. For many tasks, LoRA matches or exceeds full fine-tuning results. However, for highly specialized domains requiring significant knowledge adaptation, full fine-tuning may still have an edge. The performance gap narrows with higher rank values.
What is QLoRA and when should I use it?
QLoRA (Quantized LoRA) combines 4-bit quantization with LoRA, enabling fine-tuning of large models on consumer GPUs. It loads the base model in 4-bit precision while training LoRA adapters in higher precision. Use QLoRA when you have limited GPU memory (8-24GB) and want to fine-tune models with 7B+ parameters.
Which layers should I apply LoRA to for best results?
The most common approach is to apply LoRA to the attention layers, specifically the query (q_proj) and value (v_proj) projection matrices. For better performance, you can also include key projections (k_proj) and output projections (o_proj). Some tasks benefit from applying LoRA to MLP layers as well. Experiment with target_modules to find the optimal configuration for your use case.