Efficient Fine-Tuning with LoRA and Quantization

Part 2 showed fine-tuning in its simplest form: load pre-trained BERT (110M parameters), attach a classification head, and update every weight with a low learning rate. That works because BERT is small enough to fit comfortably in memory and train on a single GPU. Modern LLMs have 7 billion to 400 billion parameters. Full fine-tuning a 7B model requires storing the model weights, optimizer states, and gradients, roughly 4× the model size in memory, or about 112 GB for a 7B model in FP32. Most practitioners do not have that kind of hardware.

This tutorial covers the techniques that make LLM training and fine-tuning practical on modest hardware: LoRA for parameter-efficient fine-tuning, quantization for reducing the memory footprint, and Flash Attention for making the attention computation itself faster. Together, these techniques let you fine-tune a model that would otherwise require a cluster on a single consumer GPU.

All code is in the companion repository under 04-training-and-efficiency/.

cd stanford-transformers-llms-labs
pip install -e ".[hf,training]"
# Optional, but required for the INT8 and NF4 quantization paths on CUDA:
# pip install bitsandbytes

Note

Library versions matter here The PEFT, TRL, and bitsandbytes APIs have all churned over the past two years. The snippets in this tutorial were written against:

transformers >= 4.46

peft >= 0.13 (with the unified LoraConfig / get_peft_model flow)

bitsandbytes >= 0.43 (current BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4") API; the older load_in_4bit=True keyword argument on from_pretrained still works but is being phased out)

accelerate >= 0.34

If you’re on older versions, expect renames (prepare_model_for_kbit_training was once prepare_model_for_int8_training) and on newer versions expect the keyword arguments to drift again. Pin in your pyproject.toml.

Pre-training at scale

Before fine-tuning, there is pre-training: the phase that gives the model its general language understanding. Pre-training is conceptually simple: predict the next token on a massive dataset. The complexity is in scale.

Modern LLMs are pre-trained on trillions of tokens drawn from Common Crawl (filtered web text), Wikipedia, books, code repositories, and curated instruction datasets. GPT-3 was trained on roughly 300 billion tokens. Llama 2 used 2 trillion tokens. The compute cost is enormous. GPT-3 reportedly cost on the order of $4.6 million in compute alone.

Data quality matters more than quantity. Deduplication removes repeated web pages. Filtering removes low-quality or harmful content. Domain mixing ratios (how much code vs. books vs. web) determine what the model is good at. Training data contamination, where benchmark test sets accidentally leak into training data, can inflate evaluation scores.

The training_loss_visualizer.py script generates synthetic pre-training loss curves and visualizes the characteristic pattern: rapid initial decrease as the model learns basic patterns (common words, sentence structure), followed by slower improvement on harder linguistic structures (rare words, long-range dependencies, factual knowledge).

python 04-training-and-efficiency/training_loss_visualizer.py   # saves training_loss.png

This tutorial does not attempt pre-training: the cost is prohibitive. Instead, we start from a pre-trained checkpoint and focus on efficient adaptation.

The fine-tuning problem

After pre-training, the model needs to be adapted to specific tasks or behaviors. Part 2 demonstrated full fine-tuning with BERT: all 110M parameters update during training. For a 7B-parameter model, full fine-tuning requires:

Component	Memory (FP32)
Model weights	~28 GB
Gradients	~28 GB
Optimizer states (Adam)	~56 GB
Total	~112 GB

Even in FP16 mixed-precision training, the optimizer still stores FP32 master weights, momentum, and variance, roughly ~84 GB for a 7B model. This exceeds the memory of most consumer GPUs (24 GB for an RTX 4090, 80 GB for an A100).

The question is: do we really need to update all parameters? The answer from research is no: the weight updates during fine-tuning occupy a much lower-dimensional subspace than the full parameter space. This is the insight behind LoRA.

LoRA: Low-Rank Adaptation

The lora_finetune.py script demonstrates LoRA end-to-end. The key idea, from Hu et al. (2021): instead of updating all parameters in a weight matrix W, decompose the update into two small matrices:

$$W + \Delta W = W + BA$$

where B is (d × r) and A is (r × d) with r ≪ d. For a weight matrix with d = 4096, full fine-tuning updates d² = 16.7M parameters. With LoRA at rank r = 8, we update only 2 × d × r = 65,536 parameters: a 256× reduction.

The pre-trained weights W are frozen. Only the low-rank matrices A and B are trainable. During inference, the LoRA update can be merged back into the original weights (W’ = W + BA) with no additional latency.

python 04-training-and-efficiency/lora_finetune.py   # ~10 min on GPU, ~30 min on CPU

The script fine-tunes distilgpt2 (82M parameters) on a synthetic instruction-following dataset about capital cities. The LoRA configuration:

from peft import LoraConfig, TaskType, get_peft_model

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,                    # Rank — controls capacity vs efficiency
    lora_alpha=16,          # Scaling factor
    lora_dropout=0.1,
    target_modules=["c_attn", "c_proj"],  # Which layers get LoRA
)
model = get_peft_model(model, lora_config)

What the parameters mean

Rank (r) controls the tradeoff between expressiveness and efficiency. Higher rank means more trainable parameters and more capacity to capture task-specific patterns. Lower rank means fewer parameters and faster training. For most instruction-tuning tasks, r = 4–16 works well. Research has shown diminishing returns beyond r = 64.

lora_alpha is a scaling factor applied to the LoRA update: the effective update is (lora_alpha / r) × BA. A common heuristic is lora_alpha = 2 × r. This scaling prevents the LoRA update from being too large or too small relative to the frozen weights.

target_modules specifies which weight matrices receive LoRA adapters. Typically these are the attention projection matrices (Q, K, V, and output). Some configurations also target the feed-forward layers.

The script reports the parameter comparison: only a small fraction of parameters are trainable compared with full fine-tuning. After training, it runs inference on test prompts to show the fine-tuned model’s behavior.

Tip

LoRA in practice LoRA adapters are small files (typically a few MB) that sit alongside the frozen base model. You can train multiple LoRA adapters for different tasks and swap them at runtime. This is how services like Replicate and Together AI offer fine-tuned model variants without duplicating the full base model for each user.

Quantization: reducing numerical precision

Quantization reduces the precision of model weights from 32-bit floats to smaller representations. The quantization_comparison.py script compares four precision levels:

python 04-training-and-efficiency/quantization_comparison.py

Precision	Bits per weight	Memory for 7B model	Notes
FP32	32	~28 GB	Full precision baseline
FP16	16	~14 GB	Half precision, ~2× savings
INT8	8	~7 GB	4× savings, requires bitsandbytes + CUDA
NF4	4	~3.5 GB	8× savings, QLoRA-style quantization

Why quantization works

Neural network weights are approximately normally distributed; most values cluster near zero with a long tail. We can map these values onto a smaller set of discrete levels with minimal information loss because the high-precision differences between nearby weight values contribute little to the model’s output.

NF4: NormalFloat 4-bit

NF4, introduced as part of QLoRA (Dettmers et al., 2023), is particularly clever. Standard INT4 quantization spaces its 16 levels uniformly across the weight range. NF4 instead spaces levels according to the normal distribution’s density, more levels where weights are common (near zero) and fewer where they are rare (at the tails). This means most weights map to a nearby quantization level, minimizing the reconstruction error.

The script uses the BitsAndBytesConfig from the transformers library:

from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

The script compares inference speed, memory usage, and output quality across precision levels. On CPU, only FP32 is fully supported. The FP16/INT8/NF4 comparisons are most meaningful on CUDA hardware.

QLoRA: combining quantization and LoRA

QLoRA (Quantized LoRA) loads the base model in NF4 (3.5 GB for a 7B model), freezes it, and attaches LoRA adapters that train in FP16. The base model contributes near-zero memory cost for gradients and optimizer states because it is frozen. Only the tiny LoRA matrices need optimizer states. This makes it possible to fine-tune a 7B model on a single 24 GB GPU, or even a 65B model on a single 48 GB GPU.

Note

Quantization tradeoffs Quantization is not free. INT8 and NF4 models can produce slightly different outputs than FP32, especially on tasks requiring precise numerical computation. The quality degradation is usually small, often imperceptible in practice, but it is not zero. Always validate quantized model quality on your specific task.

Flash Attention: faster attention computation

Standard attention materializes the full n × n attention matrix in GPU High Bandwidth Memory (HBM). For a 4,096-token sequence, that is a 4096 × 4096 matrix of FP32 values, about 64 MB per attention head per layer. At 32 heads and 32 layers, the attention matrices alone consume ~65 GB. This is the O(n²) memory bottleneck of transformers.

Flash Attention (Dao et al., 2022) reorganizes the computation to never materialize the full attention matrix. Instead, it tiles the computation into blocks that fit in GPU SRAM (the fast on-chip memory, typically 20 MB) and accumulates results incrementally. The key insight: the softmax normalization can be computed in a streaming fashion using the online softmax trick, so each tile’s contribution can be added to a running sum without needing the full matrix.

The result: peak memory during attention drops from O(n²) to O(n) — the algorithm still produces an O(n) output of attention values, it just never has to hold the full n×n score matrix in HBM at once — and wall-clock time improves by 2-4× for long sequences (FlashAttention-2 typically lands around that range; FlashAttention-3 on Hopper class GPUs can push further), because SRAM access is orders of magnitude faster than HBM access.

python 04-training-and-efficiency/flash_attention_bench.py   # requires CUDA GPU

The script benchmarks standard vs Flash Attention at several sequence lengths. The speedup grows with sequence length, at short sequences, the overhead of tiling is not worth it, but at 2048+ tokens the gains are substantial. On CPU-only machines, the script prints an explanation of the mechanism without running the actual benchmark.

Flash Attention is now the default in most training frameworks and is supported natively in PyTorch 2.0+ via torch.nn.functional.scaled_dot_product_attention.

How the pieces compose

These four techniques work together:

Technique	What it reduces	When to use
LoRA	Trainable parameters	Fine-tuning on modest hardware
Quantization	Model memory footprint	Inference or QLoRA training
Flash Attention	Attention memory and latency	Any training or inference with long sequences
QLoRA (combined)	Both memory and parameters	Fine-tuning large models on consumer GPUs

A practical QLoRA setup: load a 7B model in NF4 (~3.5 GB), attach LoRA adapters at rank 8 (~2 MB trainable), use Flash Attention for the forward pass, and train with gradient checkpointing. Total GPU memory: ~8 GB. Before QLoRA, this required a data center. Now it runs on a laptop with a decent GPU.

What to notice

LoRA rank has diminishing returns. Going from r=4 to r=8 is often meaningful. Going from r=32 to r=64 rarely helps. The sweet spot depends on the task and dataset, but r=8–16 is a good starting point.
Quantization memory savings are real and dramatic. NF4 cuts memory by 8× with surprisingly little quality loss. The trick is that weight distributions are well-suited to non-uniform quantization.
Flash Attention’s advantage grows with sequence length. At 512 tokens, the speedup is modest. At 4096+, it is essential. This is why it became the default so quickly.
The loss curve reveals what the model learned when. Basic patterns (common words, simple grammar) are learned in the first few percent of training. Harder patterns (rare words, factual knowledge, long-range dependencies) take much longer. This is why pre-training at scale is so expensive: the easy stuff is cheap, but the last 20% of quality requires 80% of the compute.
Parameter efficiency enables iteration speed. The practical value of LoRA is not just memory savings; it is that you can try 10 fine-tuning experiments in the time it would take to run one full fine-tune. Faster iteration means better models.

Part 5: Preference Tuning with DPO, move beyond supervised fine-tuning to alignment. Train a reward model on human preferences, use DPO to optimize directly from preference pairs without reinforcement learning, and understand why the three-stage pipeline (pre-train → SFT → preference tune) has become the standard recipe.

Pre-training at scale

The fine-tuning problem

LoRA: Low-Rank Adaptation

What the parameters mean

Quantization: reducing numerical precision

Why quantization works

NF4: NormalFloat 4-bit

QLoRA: combining quantization and LoRA

Flash Attention: faster attention computation

How the pieces compose

What to notice

Next