Transformers & LLMs: Series Introduction and Environment Setup

Stanford released their CME295: Transformers & LLMs lectures publicly in late 2025. The course covers the full arc of modern transformer-based models, from the attention mechanism through fine-tuning, alignment, reasoning, agents, evaluation, and current research frontiers. Nine lectures, each dense enough to keep a graduate student busy for a week.

This series is the companion I wished existed while working through the material. Each tutorial maps to one lecture, pairs explanation with runnable code, and builds toward a working understanding of the entire LLM pipeline. By the end you will have implemented core transformer components from scratch, fine-tuned models with LoRA and DPO, built a tool-calling agent with RAG, and evaluated LLM outputs with metrics that go beyond gut feel.

Who this is for

You should be comfortable writing Python and reading basic linear algebra notation. Experience with NumPy helps but is not required: the early tutorials build up from first principles.

The Stanford lectures themselves assume some machine learning background: how a model is trained, what a neural network is, and basic linear algebra. This written series is a more guided companion, so you can start lighter than the original course and fill in gaps as you go, but familiarity with gradient descent and loss functions will still make the training-focused tutorials (Parts 4–6) easier to follow.

If you have worked through the LLM Red Teaming series on this site, you already have the right mental model for how tokenization, context windows, and inference work. This series goes deeper into the implementation side.

What the series covers

Part	Title	Lecture	What you build
1	Tokenization and Attention from Scratch	1	BPE tokenizer, Word2Vec, single/multi-head attention, positional encodings; all in NumPy
2	BERT Fine-Tuning and Position Embeddings	2	Masked language model predictions, SST-2 sentiment fine-tuning, BERT vs DistilBERT benchmark
3	LLM Decoding and Prompt Strategies	3	Greedy/beam/top-k/nucleus decoding, MoE routing visualization, prompt engineering comparison
4	Efficient Fine-Tuning with LoRA and Quantization	4	LoRA fine-tuning with PEFT, FP32/FP16/INT8/NF4 quantization comparison, Flash Attention benchmark
5	Preference Tuning with DPO	5	Reward model training, DPO fine-tuning with TRL, Best-of-N sampling
6	Chain-of-Thought and Reasoning Evaluation	6	CoT vs vanilla prompting on math problems, self-consistency voting, Pass@K evaluation
7	Building a Tool-Calling Agent with RAG	7	Minimal RAG pipeline, ReAct-style agent with tool use, retrieval quality metrics
8	Evaluating LLM Outputs Beyond Vibes	8	BLEU/ROUGE/METEOR from scratch, LLM-as-judge pipeline, bias detection, Cohen’s Kappa
9	Vision Transformers and Current Frontiers	9	ViT image classification, RoPE embeddings, grouped query attention, vision-language model inference

The tutorials progress from pure computation (no model downloads, no GPU) to pretrained model usage to local LLM inference to GPU-accelerated training. You can start with Part 1 using nothing more than Python and NumPy.

The companion code repository

All code lives in a single repository: stanford-transformers-llms-labs. Each lecture maps to one directory containing self-contained Python scripts that run independently with no shared state between them.

stanford-transformers-llms-labs/
├── 01-transformer-fundamentals/    # Tokenization, Word2Vec, attention, positional encoding
├── 02-bert-and-encoders/           # BERT MLM, fine-tuning, position embeddings, distillation
├── 03-llm-inference/               # Decoding, MoE, prompt engineering, context limits
├── 04-training-and-efficiency/     # LoRA, quantization, Flash Attention, loss curves
├── 05-preference-tuning/           # Reward models, DPO, Best-of-N, preference data
├── 06-reasoning/                   # CoT, self-consistency, Pass@K, cost tracking
├── 07-agentic-llms/                # RAG, tool calling, ReAct agent, retrieval eval
├── 08-evaluation/                  # BLEU/ROUGE/METEOR, LLM-as-judge, bias, agreement
└── 09-frontiers/                   # ViT, RoPE, GQA, masked diffusion, VLMs

Scripts are designed to be read top-to-bottom as literate programs. Comments explain the why, not just the what. Each directory has its own README with what it teaches, which lecture it maps to, how to run it, and what to expect.

Setting up your environment

Clone the repository and create a virtual environment:

git clone https://gitlab.com/sfoerster/stanford-transformers-llms-labs.git
cd stanford-transformers-llms-labs
python -m venv .venv
source .venv/bin/activate

The repository uses optional dependency groups so you only install what you need. Start lightweight:

# Base install — enough for lecture 1 and all pure-computation labs
pip install -e .

This gives you NumPy, Matplotlib, and Rich (for formatted terminal output). Enough to run every script in 01-transformer-fundamentals/ and several visualization scripts in later lectures.

As you progress through the series, install additional groups:

# Hugging Face models — needed from Part 2 onward
pip install -e ".[hf]"

# Training stack (LoRA, DPO) — Parts 4 and 5
pip install -e ".[hf,training]"

# RAG pipeline — Part 7
pip install -e ".[retrieval]"

# Evaluation metrics — Part 8
pip install -e ".[metrics]"

# Vision labs — Part 9
pip install -e ".[vision]"

# Everything at once
pip install -e ".[full]"

Each tutorial specifies which install profile it requires at the top.

Ollama for local inference

Parts 3, 6, 7, and 8 use Ollama to run LLMs locally. If you want to follow those tutorials when you reach them:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2:1b

This is not needed for Parts 1 or 2. Install it when you get there.

Hardware requirements

Setup	What it runs	Minimum
CPU only	Parts 1–3, 6–8 (most scripts)	8 GB RAM
CPU + Ollama	Local LLM inference labs	16 GB RAM
CUDA GPU	Parts 4, 5, 9 (training/vision)	16 GB+ VRAM recommended

Part 1 runs comfortably on any modern laptop. The tutorials are designed so that you can go deep on the computational concepts before you need heavier hardware.

Verify your setup

Run the first script from the repository to confirm everything is working:

python 01-transformer-fundamentals/tokenizer_comparison.py

You should see a comparison table showing word-level, character-level, and BPE tokenization applied to the same text, along with token counts and vocabulary sizes. If you have Rich installed (included in the base dependencies), the output will be a formatted table.

If that runs without errors, you are ready for Part 1.

How to follow along

Each tutorial in this series follows the same structure:

Concept, what the mechanism does and why it matters, grounded in the corresponding Stanford lecture
Implementation, building it from scratch or working with real libraries, with code you can run locally
Observation, inspecting the output, understanding what the numbers mean, noticing where things break
Connection, how this piece fits into the larger transformer pipeline

The tutorials reference specific scripts from the repository. You can either read the tutorial and run the scripts as you go, or read the scripts directly; they are self-documenting. Both paths converge on the same understanding.

Part 1: Tokenization and Attention from Scratch, build the four foundational components of the transformer architecture using nothing but Python and NumPy.