LLM Decoding and Prompt Strategies

Parts 1 and 2 covered the encoder side of the transformer, building attention from scratch, then using BERT for classification. Modern LLMs are decoder-only: they generate one token at a time, each conditioned on everything before it. The architecture is simpler than the original encoder-decoder transformer (no cross-attention), but the design space for how to select each next token is surprisingly rich. Greedy decoding, beam search, top-k sampling, and nucleus sampling all make different tradeoffs between coherence and diversity. On top of that, the way you write the prompt, zero-shot, few-shot, or chain-of-thought, can dramatically change the quality of the output without touching a single model weight.

This tutorial implements and compares these decoding strategies and prompting techniques. It also covers Mixture-of-Experts, the scaling strategy that lets models like Mixtral expose much larger total parameter counts while only activating a fraction per token, and context degradation, the empirical reality that LLMs lose track of information buried in long prompts.

All code is in the companion repository under 03-llm-inference/.

cd stanford-transformers-llms-labs
pip install -e ".[hf]"

Decoder-only generation

The original transformer (Part 1) uses an encoder to read the input and a decoder to produce the output. BERT (Part 2) uses only the encoder; it reads the full input bidirectionally and outputs contextual representations for classification. GPT and all modern LLMs go the other direction: decoder-only.

A decoder-only model generates text autoregressively. Given a prompt, it produces one token, appends it to the input, then produces the next token conditioned on everything so far. The process repeats until a stopping condition is met (max length, end-of-sequence token, or a stop string).

The key architectural difference from the encoder is masked self-attention: each token can only attend to tokens at its position or earlier. Token 5 sees tokens 0–5 but not 6 onward. This causal mask ensures the model can be trained on sequences in parallel (each position predicts the next token using only leftward context) while generating sequentially at inference time.

This left-to-right constraint is what makes decoding strategy selection matter. The model outputs a probability distribution over the entire vocabulary at each step. The question is: how do you select the next token from that distribution?

Decoding strategies

The decoding_strategies.py script loads GPT-2 (~124M parameters, runs on CPU) and generates 50 tokens from the same prompt using four base decoding strategies, then sweeps temperature separately.

python 03-llm-inference/decoding_strategies.py

The prompt is "The key innovation of the transformer architecture is" and the generation wrapper is straightforward:

def generate_text(model, tokenizer, strategy_name, **generate_kwargs):
    torch.manual_seed(SEED)
    input_ids = tokenizer.encode(PROMPT, return_tensors="pt")
    with torch.no_grad():
        output_ids = model.generate(
            input_ids,
            max_new_tokens=MAX_NEW_TOKENS,
            **generate_kwargs,
        )
    new_tokens = output_ids[0][input_ids.shape[-1] :]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

The strategies differ only in what gets passed as generate_kwargs:

strategies = {
    "Greedy": dict(do_sample=False),
    "Beam search (B=5)": dict(do_sample=False, num_beams=5, early_stopping=True),
    "Top-k (k=50)": dict(do_sample=True, top_k=50),
    "Top-p / nucleus (p=0.9)": dict(do_sample=True, top_p=0.9, top_k=0),
}

Greedy decoding

Always pick the single highest-probability token. Fast and deterministic, but it cannot recover from a locally-optimal-but-globally-bad choice. If the model assigns high probability to a generic continuation early on, it gets locked into bland, repetitive text. Greedy decoding is the default (do_sample=False).

Beam search

Maintain the top-B candidate sequences at every step and return the one with the highest total log-probability. With B=5, beam search explores five parallel paths simultaneously, allowing it to find globally better sequences than greedy. The cost is B× in both memory and compute. Despite the improved likelihood, beam search outputs can still be bland; it optimizes raw probability, which often means safe, generic text.

Top-k sampling

Restrict the next-token distribution to the k most probable tokens, renormalize, then sample. With k=50, very unlikely tokens are eliminated but there is still room for diversity. The problem is that k is fixed: in a peaked distribution (the model is confident), k=50 includes many poor options. In a flat distribution (the model is uncertain), k=50 might exclude good alternatives.

Nucleus (top-p) sampling

Dynamically choose the smallest set of tokens whose cumulative probability reaches p. With p=0.9, the model considers fewer tokens when it is confident and more tokens when it is uncertain. This adapts to the shape of the distribution at every step, which is why top-p has largely replaced top-k as the default sampling strategy.

Temperature

Temperature divides the logits by a scalar before softmax. The script sweeps T = 0.1, 0.5, 1.0, 1.5, all with top-p = 0.9:

T < 1 sharpens the distribution: the highest-probability token gets even more mass. At T = 0.1, sampling is nearly deterministic.
T = 1 leaves the model’s learned distribution unchanged.
T > 1 flattens the distribution, probability mass spreads to lower-ranked tokens. At T = 1.5, outputs become creative but can turn incoherent.

In practice, top-p ≈ 0.9 with temperature ≈ 0.7–1.0 is a common default for open-ended generation.

Tip

Decoding strategy and task type For tasks with a constrained answer space (math, factual QA, classification), start with greedy or low-temperature sampling. For code, the right default depends on the workflow: deterministic decoding is useful for single-shot answers, while sampled decoding can be better when you plan to generate multiple candidates and rerank or test them. For creative writing, brainstorming, or diverse generation, use higher temperature with top-p. There is no universally best strategy: the right choice depends on whether you want precision, diversity, or candidate breadth.

Mixture of Experts: scaling without proportional compute

Large language models scale by adding parameters, but more parameters mean more compute per token. Mixture of Experts (MoE) breaks this relationship: the model has N “expert” sub-networks (typically feed-forward layers), but each token is routed to only a small subset of them.

The moe_simulator.py script simulates this with 8 experts and top-2 sparse routing on 20 random token embeddings. It does not load a real neural network: the point is to visualize the routing mechanics.

python 03-llm-inference/moe_simulator.py   # saves moe_routing.png

How routing works

A lightweight gating network maps each token embedding to a score per expert. In sparse routing, only the top-k scoring experts (typically k=2) are activated for each token:

def sparse_routing(logits, top_k=2):
    num_tokens, num_experts = logits.shape
    weights = np.zeros_like(logits)
    for t in range(num_tokens):
        top_indices = np.argsort(logits[t])[-top_k:]
        selected_logits = logits[t, top_indices]
        selected_weights = softmax(selected_logits)
        weights[t, top_indices] = selected_weights
    return weights

In dense routing (what a non-MoE model effectively does), every expert receives a softmax weight for every token; no compute savings.

The script produces a figure with three subplots: the sparse routing heatmap (mostly white, with a few bright cells per row), the dense routing heatmap (colour everywhere), and an expert load bar chart showing how many tokens each expert handles. The contrast is the point: sparse routing means the model can have 8× the total parameters while activating only 2 experts worth of compute per token.

Load balancing

The main challenge is uneven routing. If the gating network sends most tokens to the same few experts, the others sit idle, wasted parameters. Real MoE models (Switch Transformer, Mixtral) add an auxiliary load-balancing loss that encourages even expert utilization. The script reports the coefficient of variation of expert load so you can see the imbalance with random gating.

Note

MoE in practice Mixtral 8x7B has 47B total parameters but activates only ~13B per token (2 of 8 experts). This gives it the quality of a much larger dense model at the inference cost of a smaller one. The tradeoff is memory: all 47B parameters must be loaded even though only a fraction is used per token.

Prompt engineering: changing output without changing weights

Decoding strategies control how tokens are selected from the distribution. Prompt engineering controls the distribution itself by changing the input. The prompt_engineering.py script compares three strategies on arithmetic and logic problems via a local Ollama model.

To make the comparison meaningful, keep the model and decoding settings fixed across all prompt variants. A good baseline is llama3.2:1b with deterministic or near-deterministic decoding (temperature=0 or very low, fixed top_p, same max-token limit). Larger or reasoning-tuned models may narrow the gap or solve some of these questions correctly even in zero-shot form.

python 03-llm-inference/prompt_engineering.py   # requires Ollama running locally

Zero-shot

The bare question with no context:

def zero_shot_prompt(question):
    return f"Q: {question}\nA:"

The model relies entirely on its pre-training. For the classic bat-and-ball problem (“A bat and a ball cost $1.10 total. The bat costs $1.00 more than the ball. How much does the ball cost?”), small general-purpose instruct models often produce the intuitive-but-wrong answer of $0.10 because they pattern-match to surface-level cues. Larger or reasoning-tuned models may answer correctly even without additional prompting.

Few-shot

Prepend solved examples before the actual question:

def few_shot_prompt(question, examples):
    return f"{examples}Q: {question}\nA:"

This primes the model to follow the demonstrated format and reasoning pattern. It works because of in-context learning: the model “learns” from the examples within its context window. Few-shot can improve accuracy by providing a template the model can follow.

Chain-of-thought

Append “Let’s think step by step.” after the question:

def chain_of_thought_prompt(question):
    return f"Q: {question}\nA: Let's think step by step."

Chain-of-thought (CoT) prompting works because it forces the model to produce intermediate reasoning tokens before the final answer. Each generated token conditions the next, so by generating “thinking” tokens the model effectively performs serial computation, breaking a hard multi-step problem into a sequence of easier single-step sub-problems. This can substantially improve accuracy on arithmetic, logic, and multi-step reasoning tasks, especially on non-reasoning-tuned models (Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, NeurIPS 2022).

The results vary by model and run. Larger models benefit more from CoT; very small models may not have enough capacity to produce coherent reasoning chains. The key observation is that CoT helps most on problems requiring multiple reasoning steps.

Context degradation

Even within their stated context window, LLMs can struggle to retrieve information placed at the beginning or middle of a long prompt. The context_window_demo.py script demonstrates this directly.

python 03-llm-inference/context_window_demo.py   # requires Ollama running locally

The script plants a specific fact (“The secret code is BLUE-FALCON-42”) at the very beginning of the prompt, then pads with increasingly long filler text (topically unrelated paragraphs), and finally asks: “What is the secret code mentioned earlier?”

Testing at five context lengths (~500, ~1000, ~2000, ~4000, ~8000 estimated tokens), the script checks whether the model can retrieve the planted fact. At short contexts, retrieval is typically successful. As the context grows, retrieval becomes unreliable: the model may hallucinate an answer or claim no code was mentioned. This occurs even when the total context is well within the model’s maximum window.

This phenomenon is called “context rot” or the “lost in the middle” effect (Liu et al., 2023). Attention mechanisms exhibit recency bias (recent tokens get more weight) and sometimes primacy bias (first tokens get more weight), leaving the middle as a dead zone.

Mitigation strategies include:

Placing critical information at the end of the prompt, near the question
Using retrieval-augmented generation (RAG) to select only relevant context (covered in Part 7)
Fine-tuning with long-context data
Architectural improvements like long-context positional methods, memory mechanisms, or attention sinks

For a deeper look at the security implications of context windows, see LLM Tokens, Context, and Attack Surface.

What to notice

After running the four scripts, reflect on these observations:

Greedy and beam search outputs are usually more repetitive than sampling. Likelihood optimization favors safe, generic continuations. Sampling introduces diversity at the cost of occasional incoherence.
Top-p adapts where top-k cannot. When the model is confident (peaked distribution), top-p automatically narrows the candidate set. When unsure (flat distribution), it widens. Top-k uses the same fixed window regardless.
Temperature is a continuous knob, not a binary switch. The difference between T=0.5 and T=0.7 is subtle but real. Tuning temperature is often more productive than switching between decoding strategies entirely.
Prompt engineering gains are strongest on multi-step problems. Zero-shot, few-shot, and CoT perform similarly on simple factual recall. The gap widens on problems requiring arithmetic or logical reasoning.
Context degradation happens well within the model’s stated limit. A model with a 128k context window may still struggle to retrieve a fact planted 4000 tokens ago if surrounded by noise. The context window is a ceiling, not a guarantee.

Part 4: Efficient Fine-Tuning with LoRA and Quantization, move from inference to training. Fine-tune a language model with LoRA using a fraction of the parameters, compare precision levels from FP32 to 4-bit, and understand Flash Attention’s memory-compute tradeoff.

Decoder-only generation

Decoding strategies

Greedy decoding

Beam search

Top-k sampling

Nucleus (top-p) sampling

Temperature

Mixture of Experts: scaling without proportional compute

How routing works

Load balancing

Prompt engineering: changing output without changing weights

Zero-shot

Few-shot

Chain-of-thought

Context degradation

What to notice

Next