Chain-of-Thought and Reasoning Evaluation

Language models trained to predict the next token are remarkably capable: they can summarize, translate, write code, and answer factual questions. But they struggle with multi-step reasoning. Asking “If a store had 22 oranges, threw away 4, and got 12 new ones, how many do they have now?” requires three arithmetic steps, and a model that tries to produce the answer in a single forward pass often gets it wrong. The model has no scratchpad; it must solve the problem in the same fixed number of computation steps regardless of difficulty. Chain-of-thought prompting fixes this by giving the model permission to think out loud, generating intermediate reasoning tokens that effectively expand its computation budget.

This tutorial covers chain-of-thought prompting and its variants, implements self-consistency (majority voting over multiple reasoning chains), explores the Pass@K metric for code generation, and measures the practical cost of reasoning in tokens and compute.

All code is in the companion repository under 06-reasoning/.

cd stanford-transformers-llms-labs
pip install -e .

Why vanilla LLMs fail at multi-step reasoning

A transformer processes its entire input in a fixed number of layers, typically 32–96. Each layer performs one round of attention and one feed-forward transformation. This means the model has the same computational budget for “What is 2 + 2?” and “What is 47 × 23?” The first problem can be solved by pattern matching; the second requires carrying digits across multiple positions.

When forced to produce an answer immediately (direct prompting), the model effectively attempts to compute the final result in a single “thought.” For simple problems, this works. For problems requiring multiple sequential steps, multi-digit arithmetic, logical deduction, word problems with several quantities, the model’s fixed-depth computation is insufficient.

The key insight behind chain-of-thought prompting: if we let the model generate intermediate tokens (the reasoning steps), each generated token conditions the next. The autoregressive generation process turns the model into a serial computer; each step of reasoning is a separate forward pass that can build on the previous step. The model’s “thinking” happens in its output, not in its hidden layers.

Chain-of-thought prompting

The cot_prompting.py script compares direct prompting against chain-of-thought on GSM8K-style grade school math problems, using a local Ollama model.

python 06-reasoning/cot_prompting.py   # requires Ollama running locally

Direct prompting

Ask the model to answer immediately:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
   Each can has 3 tennis balls. How many tennis balls does he have now?
A: Answer with just the number.

The model must jump from the problem statement to the final number in one step. On multi-step problems, it often produces the wrong answer, not because it lacks the knowledge, but because it lacks the computation.

Chain-of-thought prompting

Add “Let’s solve this step by step, then give the final answer”:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
   Each can has 3 tennis balls. How many tennis balls does he have now?
A: Let's solve this step by step.
   Roger starts with 5 tennis balls.
   He buys 2 cans × 3 balls per can = 6 new balls.
   Total: 5 + 6 = 11.
   The answer is 11.

By producing the intermediate steps as tokens, the model decomposes the hard problem into three easy sub-problems: “how many balls per can?”, “how many new balls total?”, “what is the sum?” Each sub-problem is trivial. The chain connects them.

This technique was introduced by Wei et al. (2022) in “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” and is one of the simplest yet most effective techniques for improving LLM reasoning. The key finding: CoT helps most on problems requiring multiple reasoning steps, and the benefit scales with model size. Very small models may not produce coherent reasoning chains.

Tip

Why “Let’s think step by step” works The phrase itself matters less than what it triggers. Any instruction that causes the model to generate intermediate reasoning tokens improves accuracy. “Show your work,” “Break this into steps,” and “Explain your reasoning” all achieve similar effects. The mechanism is the same: more output tokens mean more computation, and more computation means fewer errors on multi-step problems.

Self-consistency: majority voting over reasoning chains

Chain-of-thought is not deterministic. With temperature > 0, the model can produce different reasoning chains for the same problem; some correct, some not. Self-consistency (Wang et al., 2022) exploits this: sample multiple chains, extract the final answer from each, and take the majority vote.

The self_consistency.py script demonstrates this on the same math problems.

python 06-reasoning/self_consistency.py   # requires Ollama running locally

The intuition: different reasoning paths may make different errors, but the correct answer tends to appear most often. If you sample 5 chains and get answers [11, 11, 13, 11, 12], the majority vote is 11, the correct answer. The two erroneous chains made different mistakes that did not reinforce each other.

Self-consistency is an ensemble technique applied to reasoning. The cost is linear: N chains means N× the inference compute. In practice, N=5–10 provides substantial accuracy gains over a single chain. The improvement has diminishing returns. N=20 is rarely much better than N=10.

Pass@K: evaluating code generation

For code generation, the evaluation is binary: the generated code either passes the test suite or it does not. Pass@K measures the probability that at least one of K generated solutions is correct.

The pass_at_k.py script implements the unbiased estimator from the Codex/HumanEval paper (Chen et al., 2021). No LLM is needed: the script simulates model outputs with varying correctness rates.

python 06-reasoning/pass_at_k.py

The estimator

Given n total samples and c correct samples for a problem:

$$\text{pass@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}$$

This computes the probability that at least one of k randomly chosen samples (without replacement) from the n total is correct. The implementation:

def pass_at_k(n: int, c: int, k: int) -> float:
    if n - c < k:
        return 1.0
    return 1.0 - comb(n - c, k) / comb(n, k)

The script computes pass@k for k = 1, 5, 10, 20 across problems of varying difficulty. With n=20 samples per problem, representative correctness rates look like:

Correct samples	pass@1	pass@5	pass@10	pass@20
18/20	0.90	1.00	1.00	1.00
10/20	0.50	0.98	1.00	1.00
3/20	0.15	0.60	0.89	1.00
2/20	0.10	0.45	0.76	1.00

Pass@20 is 1.00 whenever at least one of the 20 samples is correct because every generated sample is included; it is more informative when n is larger than k. Pass@1 is the hardest bar: the model must get it right on the first try. More samples forgive more. This is why reporting only pass@1 can be misleading: a model that gets 10% of problems right on the first attempt might solve 76% of them given 10 tries.

Note

Pass@K vs best-of-N Pass@K assumes random selection from the generated samples. Best-of-N (Part 5) uses a reward model to pick the best sample. With an oracle selector, best-of-N upper-bounds random selection; with a learned reward model, the result depends on reward model quality. Pass@K measures raw model capability; best-of-N measures capability plus selection.

Reasoning cost

Chain-of-thought uses significantly more tokens than direct answers. This matters: API providers charge per token, and on simple problems reasoning traces can be 5–20× the direct-answer length, with reasoning-trained models going higher still.

The reasoning_cost_tracker.py script measures the token difference on the same set of math problems.

python 06-reasoning/reasoning_cost_tracker.py   # requires Ollama running locally

A direct answer to “Roger has 5 tennis balls…” might be 2 tokens: “11.” The chain-of-thought answer might be 50–100 tokens including all the intermediate steps. For a problem that costs $0.01 per 1000 tokens, this increases the cost from $0.00002 to $0.001, a 50× difference per query. At scale, this adds up.

Modern reasoning models

Models like OpenAI’s o1 / o3 (and o3-mini, released late 2024 / early 2025), DeepSeek-R1, and Claude’s extended thinking mode go beyond prompt-based CoT. They train the reasoning process directly with reinforcement learning. The only public training recipe is DeepSeek’s: R1’s paper details a multi-stage pipeline using GRPO (Group Relative Policy Optimization), an offshoot of PPO that drops the value model and uses group-relative advantages. The training procedures for OpenAI’s reasoning family and Claude’s extended thinking are not public; commentary that conflates them all under “they use GRPO” is speculation, not fact. What we do know is that all three families produce internal “thinking tokens” that are part of the generation but may not be shown to the user, and the user interface often displays a summary of the reasoning (“Thinking…”) rather than the full chain.

These reasoning tokens have real cost implications. An o1 / o3 response might use 5,000 tokens of internal reasoning to produce a 200-token answer. The API charges for all of them. The tradeoff is accuracy: on hard math and code problems, reasoning models dramatically outperform standard models, but at 10–50× the token cost.

The practical question is when reasoning is worth the cost. For simple factual queries, CoT adds cost without improving accuracy. For multi-step math, code generation, and complex analysis, the accuracy gains justify the expense. The script helps quantify this tradeoff.

Benchmarks overview

The reasoning capabilities of LLMs are evaluated across several benchmark families:

Benchmark	Domain	Difficulty	What it tests
GSM8K	Grade school math	Moderate	Multi-step arithmetic word problems
HumanEval	Code generation	Moderate	Function-level Python from docstrings
Codeforces	Competitive programming	Hard	Algorithmic problem solving
SWE-Bench	Software engineering	Very hard	Fixing real GitHub issues in full codebases
AIME	Math olympiad	Very hard	Advanced mathematical reasoning

GSM8K is largely solved by frontier models, most achieve 90%+ accuracy with CoT. HumanEval has also seen rapid progress. Codeforces and AIME remain challenging because they require deep combinatorial reasoning that current models struggle with. SWE-Bench is the hardest because it requires understanding entire codebases, not just isolated problems, the model must navigate files, understand APIs, and produce patches that pass CI.

The trend is that easier benchmarks saturate quickly, forcing the field to create harder ones. A model that scores well on GSM8K may still fail on problems requiring genuine multi-step planning or novel problem decomposition.

What to notice

CoT helps most on multi-step problems. On single-step factual recall, direct prompting and CoT perform similarly. The gap widens as the number of reasoning steps increases.
Self-consistency trades compute for accuracy. Sampling 5 chains and voting is often more reliable than a single chain. The gains come from error diversity, different chains fail differently.
Pass@K reveals hidden capability. A model that looks weak at pass@1 may be strong at pass@10. Many correct solutions exist in the model’s distribution; the question is whether they surface on the first sample.
Reasoning tokens are expensive. Prompt-based CoT typically increases output tokens several-fold; reasoning-trained models like o1 can push that to 10–50×. The cost is justified for hard problems but wasteful for easy ones.
Benchmarks saturate. GSM8K went from a challenging benchmark to a nearly-solved one in two years. The field needs progressively harder evaluations to measure genuine progress.

Part 7: Building a Tool-Calling Agent with RAG, move from reasoning about what the model knows to accessing external knowledge. Build a retrieval-augmented generation pipeline, implement tool-calling, and construct a ReAct-style agent that interleaves reasoning with action.

Why vanilla LLMs fail at multi-step reasoning

Chain-of-thought prompting

Direct prompting

Chain-of-thought prompting

Self-consistency: majority voting over reasoning chains

Pass@K: evaluating code generation

The estimator

Reasoning cost

Modern reasoning models

Benchmarks overview

What to notice

Next