🧠 Sehra's Second Brain

Search

Search IconIcon to open search

Understanding LLM finetuning and QLoRA

Last updated Dec 1, 2024 Edit Source


# Introduction to Finetuning

Finetuning is a crucial technique in adapting pre-trained language models for specific tasks or domains. Here’s a visual representation of the finetuning process:

graph TD A[Pre-trained LLM] --> B[General Knowledge] A --> C[Generic Capabilities] B --> D[Finetuning] C --> D D --> E[Task-specific Model] D --> F[Domain-specific Knowledge] subgraph Memory Requirements G[Full Model Parameters] H[GPU Memory] I[Training Data] end G --> J[Memory Challenge] H --> J I --> J style A fill:#f9f,stroke:#333 style D fill:#bbf,stroke:#333 style J fill:#f99,stroke:#333

# Key Concepts


# Memory Challenges in Finetuning

The primary challenge in finetuning Large Language Models lies in their enormous memory requirements. Let’s break down the memory needs for a typical 13B parameter model:

  1. Base Model Storage

    • Each parameter: 32-bit floating-point (4 bytes)
    • Total size: 13B × 4 bytes = 52GB
  2. Training Requirements

    • Model parameters: 52GB
    • Gradients: 52GB
    • Optimiser states: 104GB (2 copies)
    • Total: ~208GB GPU memory

# Quantisation Overview

Quantisation offers a solution to memory challenges. Here’s a visualisation of quantisation and memory optimisation techniques:

graph TD A[Original Model Parameters] --> B[32-bit Floating Point] B --> C[Quantization] C --> D[4-bit NormalFloat] C --> E[Double Quantization] D --> F[Reduced Memory Usage] E --> F subgraph Memory Optimization G[Paged Optimizer] H[CPU Memory Offloading] I[GPU Memory Management] end F --> G G --> H G --> I style C fill:#bbf,stroke:#333 style F fill:#9f9,stroke:#333 style G fill:#ff9,stroke:#333

# Advanced Quantisation Techniques

# 1. 4-bit NormalFloat


# 2. Double Quantisation


# 3. Paged Optimiser


# 4. LoRA (Low-Rank Adaptation)


# Comparison of Finetuning Approaches

Here’s a visual comparison of different finetuning methods:

graph TD A[Finetuning Approaches] A --> B[Full Finetuning] A --> C[LoRA] A --> D[QLoRA] B --> E[Updates all parameters
High memory usage
Best performance] C --> F[Low-rank adaptation
Few trainable parameters
Moderate performance] D --> G[4-bit quantization + LoRA
Minimal memory usage
Near SOTA performance] subgraph Memory Requirements H[200+ GB] I[20-40 GB] J[10-20 GB] end E --> H F --> I G --> J style A fill:#f9f,stroke:#333 style B fill:#bbf,stroke:#333 style C fill:#bbf,stroke:#333 style D fill:#bbf,stroke:#333

# Detailed Comparison

  1. Full Finetuning

    • Updates all model parameters
    • Memory requirement: 200GB+ (13B model)
    • Best possible performance
    • Limited by GPU availability
  2. LoRA

    • Trains only rank decomposition matrices
    • Memory requirement: 20-40GB
    • Good performance
    • More accessible hardware requirements
  3. QLoRA

    • Combines 4-bit quantisation with LoRA
    • Memory requirement: 10-20GB
    • Near-SOTA performance
    • Can run on single consumer GPU

# Practical Implications

  1. Hardware Requirements

    • Full Finetuning: Multiple high-end GPUs
    • LoRA: Single high-end GPU
    • QLoRA: Consumer-grade GPU (e.g., RTX 4090)
  2. Training Time

    • Full Finetuning: Longest training time
    • LoRA: Moderate training time
    • QLoRA: Similar to LoRA
  3. Performance vs Resource Trade-off

    • Full Finetuning: Highest performance, highest cost
    • LoRA: Good performance, moderate cost
    • QLoRA: Near-optimal performance, lowest cost

# Conclusion

QLoRA represents a significant advancement in making LLM finetuning more accessible while maintaining high performance. By combining quantisation techniques with LoRA, it enables training of large models on consumer hardware, democratising access to LLM customisation.