Tutorial

Transformers & LLMs: Series Introduction and Environment Setup

An overview of the Transformers and LLMs series: what it covers, who it is for, how the companion code is structured, and how to set up your environment.

5 min read beginner

Prerequisites

  • Basic Python knowledge
  • Familiarity with the command line

Part 1 of 8 in Transformers & LLMs

Table of Contents

Stanford released their CME295: Transformers & LLMs lectures publicly in late 2025. The course covers the full arc of modern transformer-based models, from the attention mechanism through fine-tuning, alignment, reasoning, agents, evaluation, and current research frontiers. Nine lectures, each dense enough to keep a graduate student busy for a week.

This series is the companion I wished existed while working through the material. Each tutorial maps to one lecture, pairs explanation with runnable code, and builds toward a working understanding of the entire LLM pipeline. By the end you will have implemented core transformer components from scratch, fine-tuned models with LoRA and DPO, built a tool-calling agent with RAG, and evaluated LLM outputs with metrics that go beyond gut feel.

Who this is for

You should be comfortable writing Python and reading basic linear algebra notation. Experience with NumPy helps but is not required: the early tutorials build up from first principles.

The Stanford lectures themselves assume some machine learning background: how a model is trained, what a neural network is, and basic linear algebra. This written series is a more guided companion, so you can start lighter than the original course and fill in gaps as you go, but familiarity with gradient descent and loss functions will still make the training-focused tutorials (Parts 4–6) easier to follow.

If you have worked through the LLM Red Teaming series on this site, you already have the right mental model for how tokenization, context windows, and inference work. This series goes deeper into the implementation side.

What the series covers

PartTitleLectureWhat you build
1Tokenization and Attention from Scratch1BPE tokenizer, Word2Vec, single/multi-head attention, positional encodings; all in NumPy
2BERT Fine-Tuning and Position Embeddings2Masked language model predictions, SST-2 sentiment fine-tuning, BERT vs DistilBERT benchmark
3LLM Decoding and Prompt Strategies3Greedy/beam/top-k/nucleus decoding, MoE routing visualization, prompt engineering comparison
4Efficient Fine-Tuning with LoRA and Quantization4LoRA fine-tuning with PEFT, FP32/FP16/INT8/NF4 quantization comparison, Flash Attention benchmark
5Preference Tuning with DPO5Reward model training, DPO fine-tuning with TRL, Best-of-N sampling
6Chain-of-Thought and Reasoning Evaluation6CoT vs vanilla prompting on math problems, self-consistency voting, Pass@K evaluation
7Building a Tool-Calling Agent with RAG7Minimal RAG pipeline, ReAct-style agent with tool use, retrieval quality metrics
8Evaluating LLM Outputs Beyond Vibes8BLEU/ROUGE/METEOR from scratch, LLM-as-judge pipeline, bias detection, Cohen’s Kappa
9Vision Transformers and Current Frontiers9ViT image classification, RoPE embeddings, grouped query attention, vision-language model inference

The tutorials progress from pure computation (no model downloads, no GPU) to pretrained model usage to local LLM inference to GPU-accelerated training. You can start with Part 1 using nothing more than Python and NumPy.

The companion code repository

All code lives in a single repository: stanford-transformers-llms-labs. Each lecture maps to one directory containing self-contained Python scripts that run independently with no shared state between them.

stanford-transformers-llms-labs/
├── 01-transformer-fundamentals/    # Tokenization, Word2Vec, attention, positional encoding
├── 02-bert-and-encoders/           # BERT MLM, fine-tuning, position embeddings, distillation
├── 03-llm-inference/               # Decoding, MoE, prompt engineering, context limits
├── 04-training-and-efficiency/     # LoRA, quantization, Flash Attention, loss curves
├── 05-preference-tuning/           # Reward models, DPO, Best-of-N, preference data
├── 06-reasoning/                   # CoT, self-consistency, Pass@K, cost tracking
├── 07-agentic-llms/                # RAG, tool calling, ReAct agent, retrieval eval
├── 08-evaluation/                  # BLEU/ROUGE/METEOR, LLM-as-judge, bias, agreement
└── 09-frontiers/                   # ViT, RoPE, GQA, masked diffusion, VLMs

Scripts are designed to be read top-to-bottom as literate programs. Comments explain the why, not just the what. Each directory has its own README with what it teaches, which lecture it maps to, how to run it, and what to expect.

Setting up your environment

Clone the repository and create a virtual environment:

git clone https://gitlab.com/sfoerster/stanford-transformers-llms-labs.git
cd stanford-transformers-llms-labs
python -m venv .venv
source .venv/bin/activate

The repository uses optional dependency groups so you only install what you need. Start lightweight:

# Base install — enough for lecture 1 and all pure-computation labs
pip install -e .

This gives you NumPy, Matplotlib, and Rich (for formatted terminal output). Enough to run every script in 01-transformer-fundamentals/ and several visualization scripts in later lectures.

As you progress through the series, install additional groups:

# Hugging Face models — needed from Part 2 onward
pip install -e ".[hf]"

# Training stack (LoRA, DPO) — Parts 4 and 5
pip install -e ".[hf,training]"

# RAG pipeline — Part 7
pip install -e ".[retrieval]"

# Evaluation metrics — Part 8
pip install -e ".[metrics]"

# Vision labs — Part 9
pip install -e ".[vision]"

# Everything at once
pip install -e ".[full]"

Each tutorial specifies which install profile it requires at the top.

Ollama for local inference

Parts 3, 6, 7, and 8 use Ollama to run LLMs locally. If you want to follow those tutorials when you reach them:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2:1b

This is not needed for Parts 1 or 2. Install it when you get there.

Hardware requirements

SetupWhat it runsMinimum
CPU onlyParts 1–3, 6–8 (most scripts)8 GB RAM
CPU + OllamaLocal LLM inference labs16 GB RAM
CUDA GPUParts 4, 5, 9 (training/vision)16 GB+ VRAM recommended

Part 1 runs comfortably on any modern laptop. The tutorials are designed so that you can go deep on the computational concepts before you need heavier hardware.

Verify your setup

Run the first script from the repository to confirm everything is working:

python 01-transformer-fundamentals/tokenizer_comparison.py

You should see a comparison table showing word-level, character-level, and BPE tokenization applied to the same text, along with token counts and vocabulary sizes. If you have Rich installed (included in the base dependencies), the output will be a formatted table.

If that runs without errors, you are ready for Part 1.

How to follow along

Each tutorial in this series follows the same structure:

  1. Concept, what the mechanism does and why it matters, grounded in the corresponding Stanford lecture
  2. Implementation, building it from scratch or working with real libraries, with code you can run locally
  3. Observation, inspecting the output, understanding what the numbers mean, noticing where things break
  4. Connection, how this piece fits into the larger transformer pipeline

The tutorials reference specific scripts from the repository. You can either read the tutorial and run the scripts as you go, or read the scripts directly; they are self-documenting. Both paths converge on the same understanding.

Next

Part 1: Tokenization and Attention from Scratch, build the four foundational components of the transformer architecture using nothing but Python and NumPy.