Tokenization and Attention from Scratch

A transformer takes text in and produces text out. Between input and output, four mechanisms do the heavy lifting: tokenization splits raw text into units the model can process, embeddings represent those units as vectors in a continuous space, self-attention lets each token look at every other token to build contextual understanding, and positional encoding injects sequence order so the model knows which token comes first.

This tutorial builds all four from scratch using NumPy. No frameworks, no pretrained models, no GPU. By the end you will have a working intuition for why each component exists, what happens when you change its parameters, and how they compose into the transformer architecture described in “Attention Is All You Need”.

All code is in the companion repository under 01-transformer-fundamentals/. You can run each script independently or follow along below.

# If you haven't already
cd stanford-transformers-llms-labs
pip install -e .

Tokenization: splitting text into units

Before a model can process text, it needs to convert characters into discrete units called tokens. The choice of tokenization strategy has downstream consequences; it determines vocabulary size, how the model handles rare words, and how many tokens a given input consumes from the context window.

Three strategies span the design space:

Word-level tokenization splits on whitespace and punctuation. Simple and interpretable, but the vocabulary explodes with large corpora and any word not in the vocabulary is unrepresentable.

Character-level tokenization uses individual characters as tokens. Tiny vocabulary, no out-of-vocabulary problem, but sequences become very long and the model must learn word structure from scratch.

Byte-Pair Encoding (BPE) starts with characters and iteratively merges the most frequent adjacent pair into a new token. This is what GPT-2, GPT-3, and most modern LLMs use. It balances vocabulary size against sequence length, common words become single tokens while rare words decompose into recognizable subwords.

Building a BPE tokenizer

The BPE algorithm is straightforward:

Represent each word in the corpus as a sequence of characters plus an end-of-word marker </w>
Count every adjacent pair across the entire corpus
Merge the most frequent pair into a single new symbol
Repeat for a fixed number of merges

Here is the core of the implementation from tokenizer_comparison.py:

class BPETokenizer:
    def __init__(self, num_merges: int = 50) -> None:
        self.num_merges = num_merges
        self.merges: list[tuple[str, str]] = []

    def fit(self, text: str) -> "BPETokenizer":
        words = text.lower().split()
        word_freq: dict[tuple[str, ...], int] = Counter()
        for w in words:
            symbols = tuple(w) + ("</w>",)
            word_freq[symbols] += 1

        for _ in range(self.num_merges):
            pairs = self._count_pairs(word_freq)
            if not pairs:
                break
            best_pair = max(pairs, key=pairs.get)
            self.merges.append(best_pair)
            word_freq = self._merge_pair(best_pair, word_freq)
        return self

Each merge operation takes the most common adjacent pair, say ('t', 'h'), and replaces every occurrence with the merged symbol 'th'. After 50 merges on a corpus about transformers, you see patterns like 'the', 'tion', and 'att' emerging as single tokens. These are the subword units the model will operate on.

Comparing all three

Run the comparison:

python 01-transformer-fundamentals/tokenizer_comparison.py

The script reports:

Tokenizer	Token Count	Vocab Size
Word-level	140	98
Character-level	862	35
BPE (50 merges)	517	72

BPE sits between the two extremes. It produces fewer tokens than character-level (shorter sequences for the model to process) while keeping a smaller vocabulary than word-level (better coverage of rare words). The 50-merge BPE is intentionally small here. GPT-2 uses ~50,000 merges and has a vocabulary of ~50,257 tokens. The principle is the same; the scale is different.

The script also prints the first 10 merge operations, which reveal what the algorithm learned about English text structure from the corpus:

First 10 BPE merge operations:
   1. 'e' + '</w>' -> 'e</w>'
   2. 'i' + 'n' -> 'in'
   3. 's' + '</w>' -> 's</w>'
   ...

These are deterministic: the same corpus always produces the same merge sequence. Frequent English fragments merge early and become reusable subword units.

Tip

Why this matters for LLMs Tokenization determines the “resolution” at which the model sees text. A BPE tokenizer trained on English will produce more tokens for non-English text; every non-Latin character may become multiple tokens. This is why multilingual models need larger vocabularies and why token counts vary dramatically across languages for the same semantic content. It also explains why LLMs are worse at character-level tasks like counting letters: the model never sees individual characters as tokens.

Word embeddings: from tokens to vectors

Tokens are discrete symbols. The simplest way to represent them numerically is one-hot encoding, a vector of zeros with a single 1 at the token’s index. If your vocabulary has 50,000 words, each token becomes a 50,000-dimensional vector that is mostly zeros. This has two problems. First, the vectors are enormous and sparse. Second, every pair of words is equally distant from every other, “cat” is no more similar to “dog” than it is to “refrigerator.” One-hot encoding captures identity but nothing about meaning.

Embeddings fix this by mapping each token to a dense vector in a much lower-dimensional space (typically 50–12,288 dimensions) where geometric distance encodes semantic similarity. The question is how to learn these vectors.

Word2Vec (Mikolov et al., 2013) introduced two approaches. Continuous Bag of Words (CBOW) predicts a center word from its surrounding context. Skip-gram does the reverse, given a center word, it predicts which words appear nearby. Both produce useful embeddings; skip-gram tends to work better for rare words and is the variant we implement here.

The skip-gram training objective is simple: words that frequently appear in similar contexts end up with similar vectors, not because we told the model about grammar or meaning, but because the distributional structure of language forces it.

The skip-gram objective

For every (center, context) pair in the training data, we want:

$$P(\text{context} | \text{center}) \approx \sigma(\mathbf{v}{\text{context}} \cdot \mathbf{v}{\text{center}})$$

where $\sigma$ is the sigmoid function and $\mathbf{v}$ are learned embedding vectors. Training uses negative sampling, for each real pair, we sample random words as negative examples and push their vectors apart:

$$L = -\log \sigma(\mathbf{v}c \cdot \mathbf{v}w) - \sum{k} \log \sigma(-\mathbf{v}{n_k} \cdot \mathbf{v}_w)$$

The full implementation in word2vec_from_scratch.py is about 150 lines of NumPy. The key insight is in the training loop:

def train_pair(self, center, context, negatives, lr):
    v_c = self.W_center[center]

    # Positive pair: push vectors together
    v_pos = self.W_context[context]
    dot_pos = np.dot(v_pos, v_c)
    sig_pos = sigmoid(dot_pos)
    loss = -np.log(sig_pos + 1e-12)

    # Negative pairs: push vectors apart
    v_negs = self.W_context[negatives]
    dots_neg = v_negs @ v_c
    sigs_neg = sigmoid(dots_neg)
    loss -= np.sum(np.log(1.0 - sigs_neg + 1e-12))

    # Update embeddings via SGD
    # ... (gradient computation and weight updates)
    return float(loss)

Training and inspecting the results

python 01-transformer-fundamentals/word2vec_from_scratch.py

The script trains on 15 sentences about transformers and NLP for 100 epochs, then queries nearest neighbors by cosine similarity. The corpus and seed are fixed, so you should see the same output each run:

Training …
  Epoch   1/100  —  avg loss: 4.1589
  Epoch  20/100  —  avg loss: 4.1554
  Epoch  40/100  —  avg loss: 3.8536
  Epoch  60/100  —  avg loss: 3.1286
  Epoch  80/100  —  avg loss: 2.9184
  Epoch 100/100  —  avg loss: 2.6726

Nearest Neighbours (cosine similarity):
  'transformer':
      architecture         0.9965
      processes            0.9899
      ...
  'attention':
      encoder              0.9631
      decoder              0.9628
      ...

Even on this tiny corpus, semantically related words cluster together. “Transformer” and “attention” end up close because they consistently co-occur. “Loss” and “gradient” cluster because they appear in sentences about optimization. The nearest neighbors are noisy, 15 sentences don’t provide enough co-occurrence signal for precise embeddings, but the structure is already there.

The embeddings are 50-dimensional, but the principle scales. GPT-2 uses 768 dimensions; GPT-3 uses 12,288. More dimensions mean finer-grained semantic distinctions, at the cost of more parameters to learn.

Note

From Word2Vec to transformer embeddings Modern transformers don’t use Word2Vec. They learn token embeddings as part of end-to-end training: the embedding layer is just the first matrix multiplication in the model. But the geometric intuition is the same: tokens with similar roles in language end up in similar regions of the embedding space. Word2Vec is worth implementing because it makes this intuition concrete before the added complexity of attention and position.

Why attention? The RNN bottleneck

Before transformers, the dominant approach to sequence modeling was recurrent neural networks. An RNN processes tokens one at a time, maintaining a hidden state vector that accumulates information as it reads through the sequence. After reading the entire input, the final hidden state is supposed to encode everything the model needs to know about the sentence.

This creates a bottleneck. The entire meaning of a 500-word paragraph must be compressed into a single fixed-size vector. Information from early tokens fades as new tokens overwrite the hidden state: a problem called the long-range dependency problem. The sentence “The author, who grew up in Paris and later moved to Berlin before eventually settling in Tokyo, speaks French” requires connecting “speaks French” back to “Paris,” but by the time the RNN reaches “French,” the signal from “Paris” has been diluted through dozens of hidden state updates.

LSTMs (Long Short-Term Memory networks) improved on this by adding a separate cell state that acts as a longer-term memory lane. Gating mechanisms control what gets written to, read from, and erased in this cell state. LSTMs handle longer dependencies than vanilla RNNs, but they still process tokens sequentially; each token can only see what was accumulated in the hidden state before it, not the raw representations of earlier tokens.

The fundamental limitation is architectural: in an RNN or LSTM, token 50 cannot directly examine token 3. It can only see whatever trace of token 3 survived through 47 sequential hidden state updates. The lecture frames this as the motivation for the transformer’s key innovation: what if every token could directly attend to every other token in the sequence, regardless of distance?

That is exactly what self-attention does.

Note

No RNN code in this tutorial The companion repository does not include an RNN implementation: the focus is on building transformer components. If you want to see the sequential bottleneck firsthand, the conceptual model is simple: imagine compressing a paragraph into a single 256-dimensional vector, then trying to answer detailed questions about the first sentence. That compression loss is what attention eliminates.

Self-attention: letting tokens talk to each other

Embeddings give each token a representation, but that representation is static: “bank” has the same vector whether it appears in “river bank” or “bank account.” Self-attention solves this by letting each token’s representation be influenced by every other token in the sequence.

The mechanism works through three linear projections of the input:

Query (Q), what this token is looking for
Key (K), what this token offers to be found
Value (V), the information this token carries

The attention computation is:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Each element of $QK^T$ is a dot product between one token’s query and another’s key: a measure of relevance. Softmax normalizes these scores into a probability distribution. The result is a weighted sum of the value vectors, where the weights reflect how much each token attends to every other.

Implementation

From self_attention.py:

def scaled_dot_product_attention(Q, K, V):
    d_k = K.shape[-1]

    # Step 1: Raw attention scores — Q K^T → (seq_len, seq_len)
    scores = Q @ K.T

    # Step 2: Scale by sqrt(d_k) to prevent softmax saturation
    scores = scores / np.sqrt(d_k)

    # Step 3: Softmax row-wise → attention weights
    attention_weights = softmax(scores)

    # Step 4: Weighted sum of value vectors
    output = attention_weights @ V

    return output, attention_weights

Four operations. Matrix multiply, scale, softmax, matrix multiply. That is the entire attention mechanism.

Why the scaling factor matters

Without the $\sqrt{d_k}$ scaling, the dot products grow proportionally to the embedding dimension. For $d_k = 64$, unscaled dot products can easily reach magnitudes of 8–10. At those magnitudes, softmax saturates; it pushes nearly all the weight to the largest score and produces gradients close to zero for everything else. The model stops learning from most of the attention pairs. Dividing by $\sqrt{d_k}$ keeps the variance of the dot products at roughly 1 regardless of dimension, keeping the softmax in a useful range.

Running the demo

python 01-transformer-fundamentals/self_attention.py

The script creates random embeddings for four tokens, ["The", "cat", "sat", "down"], and computes attention weights. The RNG is seeded, so you should see the same matrix:

Single-Head Attention Weights
           The     cat     sat    down
   The   0.366   0.251   0.233   0.149
   cat   0.349   0.241   0.223   0.187
   sat   0.200   0.283   0.277   0.240
  down   0.315   0.199   0.296   0.190

Each row sums to 1 (softmax). Each row shows how much that token attends to every other token, including itself. With random weights the distribution is roughly uniform. In a trained model, the patterns become meaningful: “sat” might attend strongly to “cat” (who sat?) and “down” (how?).

Multi-head attention

A single attention head captures one type of relationship. Multi-head attention runs several heads in parallel, each with its own learned projection matrices, and concatenates their outputs:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \cdot W^O$$

where each $\text{head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V)$.

The implementation splits the embedding dimension across heads. With $d_{model} = 8$ and 2 heads, each head operates on 4-dimensional query, key, and value vectors:

def multi_head_attention(X, num_heads=2, seed=42):
    seq_len, d_model = X.shape
    d_k = d_model // num_heads

    W_Q = rng.standard_normal((num_heads, d_model, d_k)) * 0.3
    W_K = rng.standard_normal((num_heads, d_model, d_k)) * 0.3
    W_V = rng.standard_normal((num_heads, d_model, d_k)) * 0.3

    head_outputs = []
    for h in range(num_heads):
        Q_h = X @ W_Q[h]
        K_h = X @ W_K[h]
        V_h = X @ W_V[h]
        out_h, _ = scaled_dot_product_attention(Q_h, K_h, V_h)
        head_outputs.append(out_h)

    # Concatenate and project
    concat = np.concatenate(head_outputs, axis=-1)
    W_O = rng.standard_normal((d_model, d_model)) * 0.3
    return concat @ W_O

The script prints both heads’ attention weights and confirms they differ:

Mean absolute difference between head-1 and head-2 weights: 0.0365
(A non-zero value confirms the heads capture different patterns.)

In a trained transformer, different heads specialize. One head might track syntactic dependencies (subject-verb agreement), another might track coreference (what does “it” refer to?), and a third might capture positional proximity. This specialization is emergent, nothing in the architecture forces it. The diversity of learned projections is enough.

Tip

The computational cost of attention The $QK^T$ matrix has shape $(n, n)$ where $n$ is the sequence length. This is why attention is $O(n^2)$ in both time and memory; every token compares against every other token. For a 4,096-token input, that is a 4096 × 4096 matrix computed for every head in every layer. This quadratic cost is the fundamental scalability bottleneck of the transformer architecture and the motivation for techniques like Flash Attention, sparse attention, and linear attention variants covered in Part 4.

Positional encoding: injecting sequence order

Self-attention is permutation-invariant. If you shuffle the input tokens, the attention weights change but the mechanism itself has no concept of “first” or “second” or “adjacent.” The sentence “cat sat the” produces different attention scores than “the cat sat,” but only because the embedding values differ: the model has no structural notion of position.

Positional encoding fixes this by adding a position-dependent signal to each token’s embedding before it enters the attention layers. The original transformer uses sinusoidal positional encodings:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$ $$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

Each dimension of the encoding corresponds to a sinusoid with a different wavelength, ranging from $2\pi$ to $10000 \cdot 2\pi$. The low-numbered dimensions oscillate rapidly (capturing fine-grained position differences), while high-numbered dimensions change slowly (capturing coarse position).

Implementation

def sinusoidal_positional_encoding(max_len, d_model):
    PE = np.zeros((max_len, d_model))
    position = np.arange(max_len)[:, np.newaxis]

    # 10000^(2i/d) = exp(2i * log(10000) / d) — log-space for stability
    div_term = np.exp(
        np.arange(0, d_model, 2) * (-np.log(10000.0) / d_model)
    )

    PE[:, 0::2] = np.sin(position * div_term)  # even dimensions
    PE[:, 1::2] = np.cos(position * div_term)  # odd dimensions
    return PE

Visualizing the encoding

python 01-transformer-fundamentals/positional_encoding_viz.py

The script generates a 100-position, 64-dimension encoding matrix and produces three visualizations saved to positional_encoding.png:

Heatmap, the full PE matrix. Lower dimensions (left) oscillate at high frequency; higher dimensions (right) change slowly. This gradient is what gives the model both fine and coarse position sensitivity.
Line plot, four individual dimensions across all positions, showing the different sinusoidal frequencies. Dimension 0 completes many cycles; dimension 20 barely curves.
Dot-product similarity, the dot product between the positional encoding at every pair of positions. The diagonal is brightest (each position is most similar to itself). Nearby positions are more similar than distant ones. The falloff is smooth, not abrupt, the model gets a gradient of positional information, not a hard boundary.

The script also confirms the similarity gradient numerically. Since sinusoidal PE is a deterministic formula, these values are exact:

Dot product  pos 0 · pos 1  = 30.9168
Dot product  pos 0 · pos 50 = 15.6738
(Nearby positions should have higher similarity.)

Why sinusoidal and not learned?

The original “Attention Is All You Need” paper tested both sinusoidal and learned positional encodings and found no significant difference in performance. The advantage of sinusoidal is generalization: because the encoding is a deterministic function of position, the model can handle sequence lengths longer than anything it saw during training. A learned positional embedding has a fixed number of positions and cannot extrapolate.

In practice, most modern models use learned positional embeddings (BERT, GPT-2) or rotary position embeddings (RoPE, used in Llama and most recent models). Part 9 of this series implements RoPE and compares it to sinusoidal.

How the pieces compose

The original transformer is an encoder-decoder architecture. The encoder reads the input sequence and produces contextual representations. The decoder generates the output sequence one token at a time, attending both to its own previous outputs (self-attention) and to the encoder’s representations (cross-attention).

Here is one encoder layer:

Input text
    │
    ▼
┌──────────────┐
│ Tokenization │  → token IDs
└──────┬───────┘
       │
┌──────▼───────┐
│  Embedding   │  → (seq_len, d_model)
│   lookup     │
└──────┬───────┘
       │
┌──────▼────────────────┐
│  + Positional Encoding│  → position-aware embeddings
└──────┬────────────────┘
       │
┌──────▼───────┐
│ Multi-Head   │  → contextual representations
│ Self-Attention│
└──────┬───────┘
       │
┌──────▼───────┐
│ Feed-Forward │  → per-token transformation
│   Network    │
└──────┬───────┘
       │
       ▼
  Output embeddings

The decoder layer looks similar but adds a cross-attention sublayer between self-attention and the feed-forward network. In cross-attention, the queries come from the decoder’s current representations while the keys and values come from the encoder’s output. This is how the decoder “looks at” the input: the same Q/K/V mechanism, but Q comes from one sequence and K/V from another. The decoder also uses masked self-attention to prevent each position from attending to future tokens during generation.

Tokenization converts text to indices. The embedding layer maps indices to vectors. Positional encoding adds order information. Self-attention mixes information across the sequence. Cross-attention (decoder only) connects the output to the input. The feed-forward network (a simple two-layer MLP) transforms each position independently. Layer normalization and residual connections stabilize training.

The original transformer stacks 6 encoder layers and 6 decoder layers. Modern models go much deeper. GPT-3 has 96 layers. Each layer’s attention captures increasingly abstract relationships because it operates on the contextual representations produced by the previous layer, not the raw embeddings.

Note

Encoder-only vs decoder-only The original transformer uses both encoder and decoder. In practice, most modern models use only one side. BERT (Part 2) is encoder-only, it processes the full input with bidirectional attention and is used for classification and understanding tasks. GPT and most LLMs (Part 3 onward) are decoder-only, they generate tokens autoregressively using masked self-attention. The encoder-decoder pattern survives in models like T5 and in machine translation.

What to notice

After running the four scripts, reflect on these observations:

BPE merges are greedy. The algorithm always picks the globally most frequent pair. This means common English patterns like “th”, “the”, “tion” merge early. Domain-specific corpora produce different merge orders: a medical tokenizer might merge “pat” + “ient” early.
Word2Vec loss decreases smoothly but neighbors are noisy. On a 15-sentence corpus, the embedding space is underspecified. Some neighbors are semantically plausible; others reflect co-occurrence artifacts. This is expected, real Word2Vec models train on billions of words.
Attention weights with random projections are near-uniform. Without training, Q/K projections produce roughly random dot products, and softmax smooths them into near-equal weights. The structure you see in trained attention patterns is entirely learned.
Positional encoding similarity decays smoothly. There is no hard cutoff, position 10 is slightly more similar to position 15 than to position 50. This smooth decay is why transformers can capture both local syntax and long-range dependencies.

Part 2: BERT Fine-Tuning and Position Embeddings, move from building components from scratch to using pretrained models. Fine-tune BERT on a real classification task, compare position embedding strategies, and benchmark the speed-accuracy tradeoff of knowledge distillation.

Tokenization: splitting text into units

Building a BPE tokenizer

Comparing all three

Word embeddings: from tokens to vectors

The skip-gram objective

Training and inspecting the results

Why attention? The RNN bottleneck

Self-attention: letting tokens talk to each other

Implementation

Why the scaling factor matters

Running the demo

Multi-head attention

Positional encoding: injecting sequence order

Implementation

Visualizing the encoding

Why sinusoidal and not learned?

How the pieces compose

What to notice

Next