Building a Tool-Calling Agent with RAG

Language models are trained on data collected up to a certain date. After that date, their knowledge is frozen in the weights. Ask about an event after the cutoff and the model either hallucinates a plausible-sounding answer or admits it does not know. This is not a failure of reasoning; it is a fundamental architectural constraint. The model’s parameters encode a static snapshot of the world. Three solutions exist: retrain the model (expensive and slow), fine-tune on new data (still expensive), or retrieve relevant information at query time and inject it into the prompt. The third option, retrieval-augmented generation, is the one that scales.

This tutorial builds a complete RAG pipeline from chunking through generation, implements tool-calling agents that can invoke external functions, and constructs a ReAct-style agent that interleaves reasoning with action. It also covers how to evaluate whether your retrieval system is actually finding the right documents.

All code is in the companion repository under 07-agentic-llms/.

cd stanford-transformers-llms-labs
pip install -e ".[retrieval]"

RAG: retrieval-augmented generation

The rag_pipeline.py script implements a minimal but complete RAG pipeline using sentence-transformers for embedding and ChromaDB for vector storage.

python 07-agentic-llms/rag_pipeline.py   # requires Ollama + chromadb + sentence-transformers

If you have your own corpus of lecture transcripts at data/transcripts/, the script uses them automatically. Otherwise it falls back to a built-in corpus of course topics.

The pipeline

The Local RAG tutorial walks chunking, embedding, storage, and retrieval from scratch. This tutorial assumes that foundation and focuses on what is different here:

Embedding model. Uses all-MiniLM-L6-v2 (sentence-transformers, 384-dim) rather than nomic-embed-text via Ollama. The retrieval-quality numbers below are model-specific.
Top-k. Defaults to k=3, the same as the Local RAG default, but the evaluation script in the next section sweeps k so you can see how the metrics move.
Storage backend. ChromaDB uses HNSW (an approximate nearest neighbor algorithm) under the hood. For corpora in the low thousands of documents you would not notice the difference from exact search; ANN starts to matter at millions of documents.

The generation prompt follows the same grounded pattern:

Answer the question using ONLY the context provided below.
If the context doesn't contain enough information, say so.

Context:
[retrieved chunk 1]
[retrieved chunk 2]
[retrieved chunk 3]

Question: [user's question]

The critical design decision: the LLM never searches. The retrieval system finds relevant context; the LLM reads it and generates from it. By instructing the model to answer only from the provided context, we ground its output in the retrieved documents and reduce hallucination.

Tip

RAG vs fine-tuning RAG and fine-tuning solve different problems. Fine-tuning changes the model’s behavior (style, format, capabilities). RAG changes the model’s knowledge (what facts it can access). For dynamic, frequently-updated information, RAG is almost always the right choice because you can update the document store without retraining. For changing how the model responds, fine-tuning is necessary.

For a deeper dive into building RAG pipelines locally, see Build a Local RAG Pipeline with Ollama and ChromaDB. To visualize how sentence-transformers represent text in embedding space, try the Embedding Explorer.

Evaluating retrieval quality

The retrieval stage is the weakest link in a RAG pipeline. If the wrong documents are retrieved, the LLM generates from irrelevant context, often confidently. The retrieval_evaluation.py script implements four metrics for measuring retrieval quality.

python 07-agentic-llms/retrieval_evaluation.py

Precision@K

Of the top K retrieved documents, how many are actually relevant?

$$\text{Precision@K} = \frac{\text{relevant documents in top K}}{K}$$

If you retrieve 5 documents and 3 are relevant, Precision@5 = 0.6. This answers: “How much noise is in my context?”

Recall@K

Of all relevant documents in the collection, how many appear in the top K?

$$\text{Recall@K} = \frac{\text{relevant documents in top K}}{\text{total relevant documents}}$$

If there are 10 relevant documents and 3 appear in the top 5, Recall@5 = 0.3. This answers: “How much relevant information am I missing?”

MRR (Mean Reciprocal Rank)

For a single query, the reciprocal rank is 1 / rank_of_first_relevant_result. MRR averages reciprocal rank across a query set:

$$\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q}$$

If the first relevant document for a query is at position 3, that query contributes 1/3. MRR rewards retrieval systems that put relevant results early. An MRR of 1.0 means the most relevant document is always first.

nDCG (Normalized Discounted Cumulative Gain)

DCG sums the relevance of each retrieved document, discounted logarithmically by position:

$$\text{DCG@K} = \sum_{i=1}^{K} \frac{rel_i}{\log_2(i+1)}$$

nDCG normalizes DCG by the score of the ideal ranking, so the metric lands in [0, 1]:

$$\text{nDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}$$

A relevant document at position 1 contributes more than one at position 5. The implementation in the script is short enough to read end-to-end:

import math

def dcg_at_k(relevances: list[int], k: int) -> float:
    return sum(rel / math.log2(i + 2) for i, rel in enumerate(relevances[:k]))

def ndcg_at_k(relevances: list[int], k: int) -> float:
    ideal = sorted(relevances, reverse=True)
    idcg = dcg_at_k(ideal, k)
    return dcg_at_k(relevances, k) / idcg if idcg > 0 else 0.0

The i + 2 (rather than i + 1) in the denominator accounts for the zero-indexed loop: position 1 gets log_2(2) = 1, position 2 gets log_2(3) ≈ 1.585, and so on.

The key tradeoff visible in the script: as K increases, recall rises (you find more relevant documents) but precision falls (you also include more noise). MRR and nDCG reward early relevant hits; they penalize systems that bury good results below irrelevant ones.

Tool calling

Tool calling extends LLMs beyond text generation by letting them invoke external functions. The tool_calling_agent.py script implements this with a set of defined tools.

python 07-agentic-llms/tool_calling_agent.py   # requires Ollama

The pattern is:

Define tools with names, descriptions, and parameter schemas
Include tool definitions in the system prompt
The model outputs a structured tool call (typically JSON) when it needs external information
The system executes the tool and feeds the result back into the conversation
The model incorporates the tool result and continues generating

The script implements three tools: a calculator for arithmetic, a date lookup, and a web search stub. Each tool is declared with a JSON Schema describing its parameters:

TOOLS = [
    {
        "name": "calculator",
        "description": "Evaluate a Python arithmetic expression.",
        "parameters": {
            "type": "object",
            "properties": {
                "expression": {"type": "string"},
            },
            "required": ["expression"],
        },
    },
    {
        "name": "date_lookup",
        "description": "Return today's date in ISO 8601.",
        "parameters": {"type": "object", "properties": {}},
    },
    # ... web_search elided
]

When the model decides a tool is needed, it emits a JSON object on its own line:

{"tool": "calculator", "args": {"expression": "(2024 - 1969) * 365"}}

The dispatcher parses that line, executes the matching Python function, and feeds the result back into the conversation as the next turn:

import json

def dispatch(tool_call: str) -> str:
    call = json.loads(tool_call)
    name, args = call["tool"], call.get("args", {})
    handler = HANDLERS.get(name)
    if handler is None:
        return f"ERROR: unknown tool {name}"
    try:
        return str(handler(**args))
    except Exception as e:
        return f"ERROR: {type(e).__name__}: {e}"

The output shows the full trace: tool request → execution → result → final answer.

Tool calling is what turns an LLM from a text generator into a component that can interact with the world, querying databases, calling APIs, executing code, and controlling other systems. The model’s role shifts from “generate the answer” to “figure out what tool calls would produce the answer.”

Note

Tool calling reliability Small models frequently misformat tool calls or call the wrong tool. Larger models are more reliable but not perfect. Production systems need robust parsing, validation, and error recovery. The script uses constrained tools and prompts so the traces are easy to inspect; real agents need significantly more defensive logic.

ReAct: reasoning + acting

The react_agent.py script implements a ReAct-style agent (Yao et al., ICLR 2023) that interleaves chain-of-thought reasoning with tool actions in a structured loop.

python 07-agentic-llms/react_agent.py   # requires Ollama

The ReAct pattern

Question: What is the population of the capital of France?
Thought:  I need to find the capital of France first, then look up its population.
Action:   lookup("capital of france")
Observation: The capital of France is Paris.
Thought:  Now I need the population of Paris.
Action:   lookup("population of paris")
Observation: Paris has a population of approximately 2.16 million (city proper).
Thought:  I now have both pieces of information.
Answer:   The population of Paris, the capital of France, is approximately 2.16 million.

The agent alternates between Thought (reasoning about what to do next), Action (calling a tool), and Observation (reading the tool’s output). This loop continues until the agent has enough information to produce a final Answer. The loop itself is short, most of the complexity lives in the prompt and the tool implementations:

def react_loop(question: str, max_steps: int = 8) -> str:
    transcript = [f"Question: {question}"]
    for _ in range(max_steps):
        completion = llm(REACT_PROMPT + "\n".join(transcript) + "\nThought:")
        transcript.append("Thought:" + completion.thought)

        if completion.kind == "answer":
            return completion.answer

        observation = dispatch(completion.action)
        transcript.append(f"Action: {completion.action}")
        transcript.append(f"Observation: {observation}")
    return "ERROR: max steps exceeded"

The model is responsible for deciding whether the next step is an Action (call a tool) or an Answer (stop). The parser distinguishes the two by looking at which keyword the model emits first after Thought:.

Why ReAct outperforms alternatives

Pure chain-of-thought (no actions) is limited to what the model already knows. It cannot look up facts, perform precise calculations, or access external systems. If the knowledge is not in the weights, CoT cannot help.

Pure action-only (no reasoning) executes tools without planning. The agent might call tools in the wrong order, call irrelevant tools, or fail to synthesize results from multiple tool calls. Without the Thought step, there is no place for the model to plan or adapt.

ReAct combines both: the model can plan (Thought), execute (Action), observe results (Observation), and adapt its plan based on what it learned. The Thought steps also make the agent’s reasoning transparent and debuggable; you can see why it made each decision.

The script implements this loop with a small knowledge base (facts about cities, countries, distances, scientific constants) and three tools: lookup, calculator, and compare. The Rich-formatted output colorizes each step type, making the reasoning trace easy to follow.

Multi-step composition

The power of ReAct shows in multi-step questions that require combining information from multiple tool calls. “Is the population of Tokyo larger than the population of Paris?” requires two lookups and a comparison. The agent handles this naturally: look up Tokyo, look up Paris, compare the numbers, state the answer.

What to notice

RAG separates knowledge from reasoning. The LLM provides the reasoning; the retrieval system provides the knowledge. This separation means you can update facts without retraining and switch LLMs without rebuilding your document store.
Retrieval quality bounds generation quality. If the wrong documents are retrieved, the best LLM in the world generates a wrong answer. Invest in retrieval evaluation before tuning the generation prompt.
Top-k is a critical hyperparameter. Too few documents (k=1) risks missing relevant context. Too many (k=10) dilutes the signal with noise and may exceed the model’s ability to attend to all the context. k=3–5 is a common sweet spot.
Tool calling turns generation into orchestration. The model stops being the source of answers and starts being the coordinator that figures out which tools to call and how to combine their results.
ReAct makes agent reasoning transparent. The Thought/Action/Observation structure is not just a prompt template; it is a debugging and auditing tool. When the agent fails, you can see exactly where its reasoning went wrong.

Part 8: Evaluating LLM Outputs Beyond Vibes, now that we can generate, retrieve, and reason, how do we know if the outputs are actually good? Implement BLEU, ROUGE, and METEOR from scratch, build an LLM-as-judge pipeline, and learn to detect the biases that make automated evaluation unreliable.

RAG: retrieval-augmented generation

The pipeline

Evaluating retrieval quality

Precision@K

Recall@K

MRR (Mean Reciprocal Rank)

nDCG (Normalized Discounted Cumulative Gain)

Tool calling

ReAct: reasoning + acting

The ReAct pattern

Why ReAct outperforms alternatives

Multi-step composition

What to notice

Next