BERT Fine-Tuning and Position Embeddings

Part 1 built the transformer’s core components from scratch in NumPy. This tutorial moves to real models. BERT is the canonical encoder-only transformer; it processes the full input bidirectionally and remains the foundation for classification, entity recognition, and question answering in production. Understanding BERT means understanding the pre-train-then-fine-tune paradigm that changed how models are built.

This tutorial also picks up where Part 1’s positional encoding section left off. Lecture 2 spends significant time on why position embeddings evolved from the original sinusoidal formula through learned embeddings to modern approaches like RoPE, and how the tradeoffs between them matter.

All code is in the companion repository under 02-bert-and-encoders/. You need the HuggingFace dependencies:

cd stanford-transformers-llms-labs
pip install -e ".[hf]"

Position embeddings: learned vs sinusoidal vs modern

Part 1 implemented sinusoidal positional encodings and showed that their dot product is a function of relative position distance, nearby tokens get similar encodings, distant tokens get dissimilar ones. But the original transformer added these encodings to the input embeddings, which means the positional signal is indirect by the time it reaches the attention layers where similarity actually matters.

Lecture 2 traces the evolution of position embeddings through three generations.

Learned position embeddings

BERT and GPT-2 use learned position embeddings: a trainable matrix of shape (max_positions, hidden_dim) where each position gets its own vector, optimized via gradient descent alongside everything else.

The advantage is flexibility: the model can learn whatever positional patterns the data requires. The limitations are:

Bounded by training data. You can only learn embeddings for positions you saw during training. BERT was trained with a maximum sequence length of 512, so position 513 has no embedding.
Data-dependent biases. If your training data has systematic patterns at certain positions (e.g., certain tokens always appear early in documents), the learned embeddings absorb those biases.

Sinusoidal position encodings

The original transformer’s sinusoidal encodings (Part 1) avoid both problems; they are deterministic and have no data-dependent biases. They are defined for any sequence length, but in practice quality degrades quickly past the lengths the model actually saw during training — the often-quoted “they generalize to longer sequences” is a property of the formula, not of model behaviour. Length-extrapolation studies (the original Press et al. ALiBi paper, the Su et al. RoPE paper, and follow-on work on YaRN) consistently show meaningful loss in attention quality once you go more than ~25% past the training length. Sinusoidal is also added at the input layer rather than at the attention computation where positional similarity actually needs to be expressed.

Why the field moved to RoPE

The lecture makes a key point: what we actually care about is the similarity computation inside the attention layer: the $QK^T$ dot product. Adding positional information at the input is indirect. What if we could inject positional information directly into the attention scores?

This is the motivation behind Rotary Position Embeddings (RoPE), used by Llama, Mistral, and most modern LLMs. RoPE applies a frequency-dependent 2D rotation to consecutive pairs of dimensions in each query and key vector — not a single scalar rotation, but a block-diagonal rotation matrix where each 2D block uses a different angular frequency $\theta_i = 10000^{-2i/d}$. The position $m$ multiplies the angle, so $q_m$ becomes $R(m\theta) q_m$. Because rotation matrices commute the way you’d want, the dot product $\langle R(m\theta) q, R(n\theta) k\rangle$ depends only on $m - n$, the relative distance between the two positions. The rotation is applied inside the attention mechanism, not at the input, so positional information is exactly where it needs to be.

RoPE is covered in detail in Part 9 of this series, where we implement it from scratch. For now, the takeaway is the progression:

Method	Where applied	Generalizes to unseen lengths?	Used by
Sinusoidal	Added to input	Yes	Original Transformer
Learned	Added to input	No (fixed max length)	BERT, GPT-2
RoPE	Inside attention (Q, K)	Yes (with some degradation)	Llama, Mistral, most modern LLMs

Comparing learned and sinusoidal visually

The position_embeddings.py script extracts BERT’s learned position embeddings, generates matching sinusoidal embeddings, and plots both side by side.

python 02-bert-and-encoders/position_embeddings.py

This saves a figure with four subplots to position_embeddings.png:

BERT’s learned embeddings (heatmap), smooth, gradually changing patterns. The model has learned a structured positional code during pre-training, not random noise, but organized patterns that encode position.
Sinusoidal embeddings (heatmap), clear periodic banding. Low-index dimensions oscillate rapidly; high-index dimensions change slowly. This multi-scale encoding gives each position a unique signature.
Learned similarity matrix (cosine similarity between position pairs), strong diagonal bands confirm that nearby positions have similar embeddings. Similarity falls off smoothly with distance, with some long-range structure the model discovered on its own.
Sinusoidal similarity matrix, similarity is strictly a function of relative distance, producing the symmetric banding pattern predicted by the trigonometric identities from Part 1.

Both strategies produce the critical locality property: nearby positions are more similar than distant ones. They achieve it differently, one through optimization, the other through mathematics, but the resulting structure is remarkably similar.

BERT: the encoder-only transformer

The original transformer has both an encoder and a decoder. BERT (Devlin et al., 2018) takes just the encoder side and asks: what if we pre-trained a deep bidirectional model on a massive unlabeled corpus, then fine-tuned it on small labeled datasets for specific tasks?

This was the key insight. Before BERT, getting good results on NLP tasks required task-specific architectures and large amounts of labeled data. BERT demonstrated that a single pre-trained model could be adapted to dozens of tasks with minimal modification.

Architecture

BERT-base has 12 transformer encoder layers, 12 attention heads, and a hidden dimension of 768, roughly 110 million parameters. It uses three types of embeddings that are summed together for each token:

Token embeddings, a learned lookup table mapping each WordPiece token ID to a 768-dimensional vector. WordPiece is a subword tokenizer similar to BPE (Part 1) but uses a likelihood-based merge criterion instead of frequency. BERT’s vocabulary is roughly 30,000 tokens.
Position embeddings, a learned embedding for each position (0 through 511), as discussed above.
Segment embeddings, a learned embedding indicating whether a token belongs to “sentence A” or “sentence B.” This supports the next sentence prediction task described below. Every token in the first sentence gets the same segment A vector; every token in the second sentence gets segment B.

Two special tokens structure the input:

[CLS] (classification), prepended to every input. After passing through all encoder layers, the [CLS] token’s output embedding aggregates information from the entire sequence through attention. This single vector becomes the input for classification tasks.
[SEP] (separator), placed between sentences and at the end. Marks sentence boundaries for the segment embedding and the next sentence prediction task.

Pre-training: MLM and NSP

BERT’s pre-training uses two self-supervised objectives on a large unlabeled corpus (BookCorpus + English Wikipedia):

Masked Language Modeling (MLM) randomly selects 15% of tokens in each input for prediction. Of those selected tokens:

80% are replaced with the special [MASK] token
10% are replaced with a random token
10% are left unchanged

The model must predict the original token at each selected position using the surrounding bidirectional context. This forces BERT to build deep representations of language structure, to predict a masked word, the model must understand syntax, semantics, and factual knowledge from the tokens on both sides.

The 80/10/10 split is deliberate. If the model only ever saw [MASK] tokens during pre-training, it would never encounter [MASK] during fine-tuning, creating a mismatch. The random replacement and unchanged cases prevent the model from relying on [MASK] as a special signal.

Next Sentence Prediction (NSP) takes two sentences and predicts whether B actually follows A in the original text (50% of the time it does, 50% it is a random sentence). The prediction is made from the [CLS] token’s output. The hypothesis was that this would help BERT learn inter-sentence relationships useful for tasks like question answering and natural language inference.

Note

NSP turned out to be unnecessary RoBERTa (Liu et al., 2019) later showed that removing NSP entirely causes no performance decrease and sometimes improves it. The MLM objective alone is sufficient for learning strong representations. This is one of the clearest examples in the field of a seemingly-reasonable design choice being definitively invalidated by subsequent experiments.

Seeing MLM in action

The bert_mlm_demo.py script loads pre-trained BERT and shows what it predicts for masked tokens:

python 02-bert-and-encoders/bert_mlm_demo.py

The script feeds sentences with manually placed [MASK] tokens and displays the top-5 predictions. For the transformer example, you should see self near the top; for the factual example, you should see paris near the top. The exact probabilities depend on the installed model revision and runtime environment, so treat the ranking as the stable signal and the decimals as incidental.

BERT correctly identifies “self” as the likely completion of “self attention”; it has learned enough about transformer architecture from its training corpus to recover the phrase from bidirectional context. The factual example behaves similarly: “The capital of France is [MASK]” typically ranks paris highest, with other French cities appearing lower in the list. This soft distribution is richer than a hard label, a property that becomes important when we discuss distillation later.

BERT sees both “The capital of France is” on the left and ”.” on the right. Both sides contribute to the prediction. A left-to-right model (like GPT) would only see the prefix; it could not use any rightward context. This bidirectional attention is what makes BERT effective for understanding tasks where the full context matters.

Fine-tuning: adapting BERT to a task

Pre-training gives BERT general language understanding. Fine-tuning adapts it to a specific task by attaching a small task-specific head, typically a single linear layer, and training end-to-end on labeled data.

The pattern:

Take pre-trained BERT (110M parameters of language understanding)
Add a classification layer on top of the [CLS] token’s output
Train on a small labeled dataset with a low learning rate (typically 2e-5)
All parameters update, but the pre-trained weights shift gently

Because BERT already understands language structure, fine-tuning needs far less data than training from scratch. A few thousand labeled examples and a few epochs are often enough.

Sentiment classification on SST-2

The bert_sentiment_finetune.py script demonstrates the full workflow: load pre-trained BERT, attach a binary classification head, fine-tune on the Stanford Sentiment Treebank (SST-2), and run inference on custom sentences.

python 02-bert-and-encoders/bert_sentiment_finetune.py

The script uses a small subset (1,000 training / 200 validation) for demo speed. The key pieces:

# Load BERT with a classification head — one linear layer maps the
# 768-dim [CLS] output to 2 logits (negative / positive)
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

# Standard HuggingFace Trainer handles training loop, evaluation,
# checkpointing, and metric computation
training_args = TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,           # Low LR to preserve pretrained knowledge
    weight_decay=0.01,
    eval_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,  # Reports accuracy
)

trainer.train()

After training, the script runs inference on custom sentences. In practice, the clear positive and negative examples come back with high confidence, while the mixed sentence about decent acting and a predictable plot tends to be much less certain. That qualitative pattern is the point; the exact confidences vary by hardware, library versions, and the randomly sampled SST-2 subset used in the demo. A full SST-2 fine-tune (~67k examples) typically reaches 92–93% accuracy.

The learning rate of 2e-5 is critical. Pre-trained weights encode language understanding from massive corpora: a high learning rate would destroy that knowledge in the first few gradient steps. The low rate allows the representations to adapt gently toward the sentiment task without catastrophic forgetting.

Tip

Fine-tuning beyond classification The [CLS] token works for sentence-level tasks (sentiment, topic, entailment). For token-level tasks like named entity recognition or question answering, you use the output embeddings of individual tokens instead. For QA, BERT predicts a start and end position within the input: the answer is the span between them. The same pre-trained model supports both patterns; only the task head changes.

Making BERT smaller: knowledge distillation

BERT-base has 110 million parameters. For many production use cases, real-time classification, mobile deployment, high-throughput APIs, that is too large. Knowledge distillation compresses the model while preserving most of its capability.

The idea, introduced by Hinton, Vinyals, and Dean (2015), is to train a smaller “student” model to mimic the larger “teacher” model’s output distributions rather than the hard labels in the training data. The teacher’s soft outputs, the full probability distribution over classes, carry more information than a hard 0/1 label. When BERT predicts “paris” with probability 0.87 and “lyon” with 0.02 for a masked token, the relative ranking of those alternatives encodes learned relationships (both are French cities) that a hard label discards.

DistilBERT (Sanh et al., 2019) applies this to BERT:

6 transformer layers instead of 12 (every other layer removed)
Same hidden dimension (768) and attention heads (12)
~40% fewer parameters
~60% faster inference
Retains ~97% of BERT’s language understanding on GLUE benchmarks

The distillation training combines three losses:

KL-divergence between student and teacher output distributions (soft labels)
Masked language model loss on the training corpus (hard labels)
Cosine embedding loss aligning student and teacher hidden states

Benchmarking the tradeoff

python 02-bert-and-encoders/distillbert_comparison.py

The script loads both models, compares parameter counts, benchmarks inference speed over 50 runs, and compares their MLM predictions side by side. The parameter counts are stable, but the latency numbers are inherently machine-dependent:

Model Size Comparison
  bert-base-uncased       109,514,298     100%
  distilbert-base-uncased  66,985,530      61.2%

Inference Speed Comparison
  bert-base-uncased       <machine-dependent>    1.00x (baseline)
  distilbert-base-uncased <machine-dependent>    faster on most setups

The MLM prediction comparison is where distillation’s value becomes clear. For many masked tokens, BERT and DistilBERT agree on the top prediction. Where they disagree, DistilBERT’s alternatives are usually still semantically reasonable; it learned the teacher’s general distribution, not just the argmax.

This is the practical tradeoff: you give up a small amount of accuracy for meaningful improvements in speed and memory. For applications where latency matters and perfect accuracy does not, DistilBERT is often the better choice.

RoBERTa: questioning BERT’s design choices

RoBERTa (Liu et al., 2019) took BERT’s architecture and systematically tested which design decisions actually mattered. The findings:

Removing NSP helped. The next sentence prediction objective, which seemed intuitively useful, turned out to add no value. Dropping it simplified pre-training and slightly improved performance.

Dynamic masking improved results. BERT applies its random masking once during data preprocessing: the same tokens are masked every time the model sees a given input. RoBERTa applies masking dynamically: different tokens are masked each epoch for the same text. This exposes the model to more masking patterns and improves generalization.

More data and longer training helped significantly. The original BERT was trained on 16 GB of text. RoBERTa used 160 GB. The model was also trained for much longer. On the same architecture, this alone produced large gains, suggesting the original BERT was substantially undertrained.

RoBERTa’s lesson is methodological: before adding complexity, make sure you have exhausted the simpler levers, data scale, training duration, and objective simplification.

How the pieces connect

This tutorial covered the four scripts in 02-bert-and-encoders/:

Script	What it demonstrates
`position_embeddings.py`	Visual comparison of learned vs sinusoidal strategies
`bert_mlm_demo.py`	BERT’s bidirectional predictions on masked tokens
`bert_sentiment_finetune.py`	Full fine-tuning workflow: pre-trained model → classification task
`distillbert_comparison.py`	Knowledge distillation: speed/accuracy tradeoff in practice

Together they illustrate the lifecycle of an encoder-only transformer: position embeddings inject order, pre-training builds general language understanding, fine-tuning adapts to a task, and distillation compresses for deployment.

What to notice

Learned position embeddings have visible structure. BERT didn’t learn random vectors for each position: the heatmap shows smooth, organized patterns. Gradient descent discovered a useful positional code without being told what it should look like.
MLM predictions reflect the training distribution. BERT’s top predictions for factual questions (capital of France) come from patterns in Wikipedia and BookCorpus. It didn’t learn facts as facts; it learned token co-occurrence patterns that happen to encode factual knowledge.
Fine-tuning is remarkably data-efficient. 1,000 examples and 3 epochs produce a usable sentiment classifier. This efficiency comes entirely from pre-training: the model already understands language; it just needs a nudge toward the specific task.
Distillation preserves the distribution, not just the top answer. DistilBERT often agrees with BERT on top-1 predictions, but even when it disagrees, its alternatives are semantically plausible. The KL-divergence objective transfers the shape of the teacher’s probability distribution, not just the argmax.

Part 3: LLM Decoding and Prompt Strategies, shift from encoder-only models to decoder-only generation. Compare greedy, beam search, top-k, and nucleus decoding; visualize Mixture-of-Experts routing; and test zero-shot, few-shot, and chain-of-thought prompting on reasoning tasks.