Understanding LLM finetuning and QLoRA
# Introduction to Finetuning
Finetuning is a crucial technique in adapting pre-trained language models for specific tasks or domains. Here’s a visual representation of the finetuning process:
# Key Concepts
- Pre-trained models come with general knowledge and capabilities
- Finetuning adapts these capabilities to specific use cases
- The process requires significant computational resources
# Memory Challenges in Finetuning
The primary challenge in finetuning Large Language Models lies in their enormous memory requirements. Let’s break down the memory needs for a typical 13B parameter model:
Base Model Storage
- Each parameter: 32-bit floating-point (4 bytes)
- Total size: 13B × 4 bytes = 52GB
Training Requirements
- Model parameters: 52GB
- Gradients: 52GB
- Optimiser states: 104GB (2 copies)
- Total: ~208GB GPU memory
# Quantisation Overview
Quantisation offers a solution to memory challenges. Here’s a visualisation of quantisation and memory optimisation techniques:
# Advanced Quantisation Techniques
# 1. 4-bit NormalFloat
- Reduces precision from 32-bit to 4-bit
- Preserves normal distribution of weights
- Memory reduction: 87.5%
- Example: -0.765432 (32-bit) → -0.75 (4-bit)
# 2. Double Quantisation
- Two-step quantisation process:
- Quantise weights to 4-bit
- Quantise scaling factors
- Additional 10-20% memory reduction
# 3. Paged Optimiser
- CPU memory utilisation for parameter storage
- Step-by-step process:
- Load parameter batch to GPU
- Compute gradients
- Update parameters
- Store in CPU memory
- Repeat with next batch
# 4. LoRA (Low-Rank Adaptation)
- Matrix decomposition approach
- Example:
- Original: 1000×1000 matrix
- Decomposed: 1000×8 and 8×1000 matrices
- Parameter reduction: >99%
# Comparison of Finetuning Approaches
Here’s a visual comparison of different finetuning methods:
High memory usage
Best performance] C --> F[Low-rank adaptation
Few trainable parameters
Moderate performance] D --> G[4-bit quantization + LoRA
Minimal memory usage
Near SOTA performance] subgraph Memory Requirements H[200+ GB] I[20-40 GB] J[10-20 GB] end E --> H F --> I G --> J style A fill:#f9f,stroke:#333 style B fill:#bbf,stroke:#333 style C fill:#bbf,stroke:#333 style D fill:#bbf,stroke:#333
# Detailed Comparison
Full Finetuning
- Updates all model parameters
- Memory requirement: 200GB+ (13B model)
- Best possible performance
- Limited by GPU availability
LoRA
- Trains only rank decomposition matrices
- Memory requirement: 20-40GB
- Good performance
- More accessible hardware requirements
QLoRA
- Combines 4-bit quantisation with LoRA
- Memory requirement: 10-20GB
- Near-SOTA performance
- Can run on single consumer GPU
# Practical Implications
Hardware Requirements
- Full Finetuning: Multiple high-end GPUs
- LoRA: Single high-end GPU
- QLoRA: Consumer-grade GPU (e.g., RTX 4090)
Training Time
- Full Finetuning: Longest training time
- LoRA: Moderate training time
- QLoRA: Similar to LoRA
Performance vs Resource Trade-off
- Full Finetuning: Highest performance, highest cost
- LoRA: Good performance, moderate cost
- QLoRA: Near-optimal performance, lowest cost
# Conclusion
QLoRA represents a significant advancement in making LLM finetuning more accessible while maintaining high performance. By combining quantisation techniques with LoRA, it enables training of large models on consumer hardware, democratising access to LLM customisation.