Tutorial

Evaluating RAG Pipelines with RAGAS and TruLens

Build an eval set and measure faithfulness, context precision, context recall, and answer relevancy with RAGAS and TruLens to know if changes actually help.

10 min read intermediate

Prerequisites

  • Tutorial: Build a Local RAG Pipeline with Ollama and ChromaDB
  • Tutorial: Connect Your RAG Pipeline to Live CVE Feeds
  • Comfortable with Python and pandas

Part 7 of 7 in Local RAG Pipeline

Table of Contents

You changed the chunk size from 500 to 800. The five questions you typed into the chat UI came back with answers that read better than before. You ship. Two weeks later an analyst tells you the assistant has been missing the mitigation section on half their queries because the new chunk boundary lands in the middle of every “Mitigation:” paragraph. A small change moved a small visible sample in one direction and a large invisible population in the other. The fix is measurement.

This tutorial connects your existing local RAG pipeline (built across the base tutorial, the chat interface, and the live CVE feeds) to two eval systems: RAGAS for offline batch scoring, and TruLens for online per-query observability. Both run locally against an Ollama judge LLM.

Why “the answers look good” is not evaluation

The anecdote above is the most common pattern in RAG projects without an eval harness: five canonical questions held in the team’s head, run by hand after each change, ship when answers feel better. The problem is structural:

  • Five questions cannot cover three failure modes. Retrieval missed the docs, the model ignored retrieved docs, or the model addressed the wrong question. Independent failures need independent measurement.
  • The questions you remember are the ones the system already answers well. Confirmation bias picks the demo set; real user queries do not match it.
  • Small-sample improvement is statistically meaningless. 4/5 to 5/5 on a hand-picked set is not signal. 41/50 to 47/50 on a frozen eval set is.

A working eval setup separates the failure categories:

FailureQuestion it answersWhere it lives in the pipeline
RetrievalDid we even pull the right documents?Embedding model, chunker, vector store
GroundingIs every claim in the answer derivable from those documents?LLM prompt, model size, system instructions
RelevanceDoes the answer actually address the user’s question?LLM, prompt, query understanding

You need a metric for each. Optimizing one without watching the others is how you get a pipeline whose answers are perfectly faithful to perfectly wrong retrieved chunks.

The four metrics that matter

RAGAS (Retrieval-Augmented Generation Assessment) defines four core metrics that map cleanly onto the three failure categories above. The framing comes from the original RAGAS paper (Es et al., 2023, “RAGAS: Automated Evaluation of Retrieval Augmented Generation”); the formulas below are simplified.

Faithfulness

Of the claims the model made in its answer, what fraction are supported by the retrieved context?

faithfulness = |claims in answer that are entailed by context|
               ────────────────────────────────────────────
                          |claims in answer|

The hallucination detector. A judge LLM extracts atomic claims from the answer, then checks each against the retrieved chunks. Faithfulness drops when the model invents details, mis-attributes versions, or fabricates mitigations not in the source.

Answer relevancy

Does the answer actually address the question, regardless of whether it is correct?

answer_relevancy = mean cosine_sim(embed(q_i), embed(question))

RAGAS asks the judge LLM to generate n candidate questions for which the answer would be a good response, embeds them with the original question, and averages cosine similarities. A technically true answer that addresses the wrong question gets a low score. The metric depends on the embedding model: weak embedders collapse semantically distinct questions and inflate the score.

Context precision

Of the chunks we retrieved, are the relevant ones ranked at the top?

context_precision@K = sum_{k=1..K} (precision@k * relevant_k)
                      ─────────────────────────────────────
                            |relevant chunks in top-K|

Mean average precision restricted to the retrieved set. Detects ranking failures: the right chunk is in top-10 but buried at position 8 while loosely related chunks crowd the top. In a security RAG, generic CVE chatter outranks the one chunk with the affected version range.

Context recall

Of the things the ground-truth answer requires, how many are present in the retrieved chunks?

context_recall = |sentences in ground_truth supported by context|
                 ──────────────────────────────────────────────
                              |sentences in ground_truth|

Low context recall means no amount of prompt engineering or model swap will save you; the information is not in the context window.

The four together form a diagnostic grid: low faithfulness with high others points at the model (tighten the prompt or upgrade); low relevancy points at query understanding; low precision points at ranking (add a reranker); low recall points at the chunker, embedder, or top-k.

Each metric attaches to a specific stage of the pipeline, which is what makes the grid diagnostic: a low score tells you where to look.

graph LR
    Q[Question] --> RET[Retrieval]
    RET -->|retrieved chunks| GEN[Generation]
    GEN --> ANS[Answer]
    GT[Ground-truth answer] -. compare .-> ANS

    CP["Context precision:<br/>are relevant chunks ranked high?"] -. measures .-> RET
    CR["Context recall:<br/>is the needed info retrieved at all?"] -. measures .-> RET
    FA["Faithfulness:<br/>is every claim grounded in the chunks?"] -. measures .-> GEN
    AR["Answer relevancy:<br/>does the answer address the question?"] -. measures .-> ANS

    style Q fill:#4a9eff,stroke:#2a7edf,color:#fff
    style RET fill:#868e96,stroke:#666e76,color:#fff
    style GEN fill:#868e96,stroke:#666e76,color:#fff
    style ANS fill:#51cf66,stroke:#31af46,color:#fff
    style GT fill:#adb5bd,stroke:#8d959d,color:#fff
    style CP fill:#ffa94d,stroke:#df894d,color:#fff
    style CR fill:#ffa94d,stroke:#df894d,color:#fff
    style FA fill:#ff6b6b,stroke:#df4b4b,color:#fff
    style AR fill:#ff6b6b,stroke:#df4b4b,color:#fff

Context precision and context recall are the same precision/recall tradeoff you see in any classifier, applied to the ranked list of retrieved chunks. If that tradeoff is not yet intuitive, the Classifier Threshold Lab lets you drag a decision threshold across score distributions and watch precision and recall move against each other in real time.

Building an eval dataset

Every metric needs a dataset of (question, ground_truth_answer, ground_truth_contexts) triples. The answer anchors faithfulness and relevancy; the contexts anchor recall.

Three sources, used together:

  1. Real analyst questions. Pull the last 30 days of queries from your chat UI logs. Strip PII, then have a human write the ground-truth answer and mark which advisory chunks contain it.
  2. Synthetic questions from your corpus. RAGAS’ TestsetGenerator takes documents and produces Q/A/context triples. The questions tend to be shallower than real ones; use it for breadth, not depth.
  3. A small human-curated golden set. Twenty hand-written examples covering hard cases: cross-document reasoning, version ranges, “is X affected” questions.

Aim for 50-200 examples. Smaller and metric variance swamps any change you make; much larger and a local judge LLM becomes painfully slow per eval run.

JSON Lines format, one example per line:

{
  "question": "What versions of OpenSSH are vulnerable to regreSSHion?",
  "ground_truth_answer": "OpenSSH 8.5p1 through 9.7p1 on glibc-based Linux are vulnerable to CVE-2024-6387 (regreSSHion). The fix is in OpenSSH 9.8p1.",
  "ground_truth_contexts": [
    "CVE-2024-6387: regreSSHion. Affected versions: OpenSSH 8.5p1 through 9.7p1 (glibc-based). Fixed in: OpenSSH 9.8p1."
  ],
  "tags": ["versions", "openssh"]
}

Save this as eval/golden.jsonl next to your existing pipeline code.

A simple loader and a synthetic-generation entrypoint live in eval/build_dataset.py:

import json
from pathlib import Path

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_ollama import ChatOllama, OllamaEmbeddings
from ragas.testset import TestsetGenerator

EVAL_DIR = Path("eval")
GOLDEN_PATH = EVAL_DIR / "golden.jsonl"
SYNTHETIC_PATH = EVAL_DIR / "synthetic.jsonl"


def load_jsonl(path):
    return [json.loads(l) for l in open(path) if l.strip()]


def write_jsonl(path, examples):
    path.parent.mkdir(parents=True, exist_ok=True)
    with open(path, "w") as f:
        for ex in examples:
            f.write(json.dumps(ex) + "\n")


def load_all():
    out = []
    for p in (GOLDEN_PATH, SYNTHETIC_PATH):
        if p.exists():
            out += load_jsonl(p)
    return out


def generate_synthetic(n=50):
    """Generate synthetic Q/A/context triples via RAGAS' TestsetGenerator.

    Tiny models (3B and below) produce poor questions; use llama3.1:8b+.
    """
    docs = DirectoryLoader("advisories", glob="*.txt",
                           loader_cls=TextLoader).load()

    llm = ChatOllama(model="llama3.1:8b", temperature=0)
    emb = OllamaEmbeddings(model="nomic-embed-text")

    generator = TestsetGenerator.from_langchain(llm=llm, embedding_model=emb)
    testset = generator.generate_with_langchain_docs(docs, testset_size=n)

    df = testset.to_pandas()
    examples = [{
        "question": row["user_input"],
        "ground_truth_answer": row["reference"],
        "ground_truth_contexts": list(row["reference_contexts"]),
        "tags": ["synthetic"],
    } for _, row in df.iterrows()]

    write_jsonl(SYNTHETIC_PATH, examples)
    print(f"Wrote {len(examples)} synthetic examples to {SYNTHETIC_PATH}")


if __name__ == "__main__":
    import sys
    if len(sys.argv) > 1 and sys.argv[1] == "synthetic":
        generate_synthetic(int(sys.argv[2]) if len(sys.argv) > 2 else 50)
    else:
        print(f"Loaded {len(load_all())} examples")

Run it:

python eval/build_dataset.py synthetic 50
python eval/build_dataset.py

Warning

Synthetic generation needs a capable judge A 3B model will happily emit “What is CVE?” as a synthetic question and grade it as good. Use at least an 8B model for generation. Inspect the first 20 generated examples by hand before trusting the rest.

RAGAS setup with the local stack

RAGAS defaults to OpenAI. We configure it to use a local Ollama judge via the LangChain interface so nothing leaves the machine. langchain-ollama is the current package; the langchain_community.chat_models.ollama import was deprecated in 2024.

Pin the versions for this tutorial. RAGAS and TruLens both changed public APIs several times between 2024 and 2026; a loose pip install ragas trulens is likely to break imports or dataset schemas months after this is written.

cat > eval/requirements.txt <<'REQ'
ragas==0.2.10
langchain-ollama==0.2.3
langchain-community==0.3.16
datasets==3.2.0
chromadb==0.6.3
ollama==0.4.7
pandas==2.2.3
numpy==1.26.4
REQ

pip install -r eval/requirements.txt
ollama pull llama3.1:8b

The code below uses the RAGAS 0.2 column names: user_input, retrieved_contexts, response, reference, and reference_contexts. If you upgrade to RAGAS 0.3/0.4, check the migration notes before changing the pins; the metric names mostly survived, but testset generation and some schema helpers moved.

The 3B llama3.2 from the base tutorial is too small for a judge; claim extraction and entailment grading are unreliable below 7B. Create eval/run_ragas.py:

from pathlib import Path

import chromadb
import ollama
from datasets import Dataset
from langchain_ollama import ChatOllama, OllamaEmbeddings
from ragas import evaluate
from ragas.metrics import (faithfulness, answer_relevancy,
                           context_precision, context_recall)

from build_dataset import load_all

COLLECTION = "security_advisories"
EMBED_MODEL = "nomic-embed-text"
CHAT_MODEL = "llama3.2"
JUDGE_MODEL = "llama3.1:8b"
TOP_K = 3


def retrieve(question):
    coll = chromadb.PersistentClient(path="./chroma_db").get_collection(COLLECTION)
    q_emb = ollama.embed(model=EMBED_MODEL, input=question)["embeddings"][0]
    return coll.query(query_embeddings=[q_emb], n_results=TOP_K,
                      include=["documents"])["documents"][0]


def generate(question, contexts):
    prompt = (
        "You are a security analyst assistant. Answer using only the context.\n"
        "If the context doesn't contain enough information, say so.\n\n"
        f"Context:\n{chr(10).join(contexts)}\n\nQuestion: {question}"
    )
    return ollama.chat(model=CHAT_MODEL,
                       messages=[{"role": "user", "content": prompt}]
                       )["message"]["content"]


def build_row(ex):
    ctx = retrieve(ex["question"])
    return {
        "user_input": ex["question"],
        "retrieved_contexts": ctx,
        "response": generate(ex["question"], ctx),
        "reference": ex["ground_truth_answer"],
        "reference_contexts": ex["ground_truth_contexts"],
    }


def main(out_path="eval/results.json"):
    examples = load_all()
    print(f"Running {len(examples)} examples through the pipeline...")
    rows = [build_row(ex) for ex in examples]

    print("Scoring with RAGAS...")
    results = evaluate(
        dataset=Dataset.from_list(rows),
        metrics=[faithfulness, answer_relevancy,
                 context_precision, context_recall],
        llm=ChatOllama(model=JUDGE_MODEL, temperature=0),
        embeddings=OllamaEmbeddings(model=EMBED_MODEL),
    )

    df = results.to_pandas()
    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
    df.to_json(out_path, orient="records", indent=2)

    print("\n=== Aggregate scores ===")
    for m in ["faithfulness", "answer_relevancy",
              "context_precision", "context_recall"]:
        if m in df.columns:
            print(f"  {m:22s} {df[m].mean():.3f}")
    print(f"\nPer-example results saved to {out_path}")


if __name__ == "__main__":
    main()

Run it:

python eval/run_ragas.py

Expect this to take time. 100 examples through a local 8B judge typically lands at 30-60 minutes for a full eval. Not interactive, but acceptable as a CI step or daily batch job.

Note

Local judge LLMs are noisier than GPT-4 Published RAGAS scores in the literature use GPT-4-class judges. An 8B local judge produces noisier per-example scores and a small systematic bias. The aggregate is still useful for relative comparisons on the same dataset and judge (config A vs config B), which is what you need to make pipeline decisions. It is less reliable for absolute claims like “our faithfulness is 0.91”. Treat absolute numbers as rough bands.

Reading the results

A typical first-run output, on a 60-example dataset:

=== Aggregate scores ===
  faithfulness           0.812
  answer_relevancy       0.874
  context_precision      0.413
  context_recall         0.667

Per-example results saved to eval/results.json

A useful per-example view (truncated):

questionfaith.rel.prec.recall
What versions of OpenSSH have regreSSHion?1.0000.9111.0001.000
How is the XZ backdoor activated?0.6670.8230.5000.667
Which CVEs allow remote code execution?0.5000.7990.2500.333
Compare regreSSHion and XZ in terms of impact0.3330.7510.0000.500

Concrete bands for security RAG, calibrated against typical local-stack scores:

  • Faithfulness > 0.85 is acceptable for production. > 0.95 is excellent.
  • Answer relevancy > 0.85 is acceptable. Below 0.75, the worst-scoring queries are usually vague (“tell me about Linux vulns”) where the model picked one slice and ran with it.
  • Context recall < 0.7 signals that the chunker is dropping needed information. Try larger chunks, more overlap, or paragraph-aware splitting. Can also indicate top-k is too small.
  • Context precision < 0.5 means retrieval is finding the right chunks but ranking them poorly. The fix is a reranker; the hybrid-search tutorial later in this series covers BGE rerankers, which typically pull this metric from 0.4-0.5 up to 0.7-0.8.

The example output tells a clear story: faithfulness and relevancy are okay, precision is low, recall is mediocre. Retrieval is the bottleneck, not generation. Another week tuning prompts produces essentially nothing; a reranker produces most of the available improvement.

TruLens as an alternative

RAGAS is offline batch eval: collect a dataset, run it, score, repeat. The right tool for “did this change help?”

TruLens is online observability: it sits between the app and the LLM, captures every prompt and retrieval, attaches per-record metrics, and exposes them per query in a dashboard. The right tool for “what is happening in production right now?”

Note

TruLens 2.x is built on OpenTelemetry TruLens used to ship as trulens-eval (import path trulens_eval) with a Feedback / .on_input_output() API. The current package is trulens (providers under trulens.providers.*), and 2.x is built on OpenTelemetry: you annotate your pipeline with @instrument, each retrieval and generation becomes a span, and Metric + Selector read their arguments straight from those spans rather than from a function’s input and output. The old Feedback class still works but is deprecated, and selectors are not subscriptable: you cannot pluck a field with select_record_output()["answer"]. Use @instrument attributes and Selector.select_context() instead. Set TRULENS_OTEL_TRACING=1 before importing trulens; it is the default in recent 2.x builds, but setting it explicitly keeps the code robust across versions.

Install:

cat >> eval/requirements.txt <<'REQ'
trulens==2.7.2
trulens-providers-litellm==2.7.2
streamlit==1.45.0
REQ

pip install -r eval/requirements.txt

Instead of wrapping a plain function, TruLens 2.x instruments the pipeline itself. Re-create the base pipeline from the chat interface tutorial as a small class and decorate retrieval, generation, and the top-level query with @instrument. Each call becomes an OpenTelemetry span, and the metrics read their inputs from those spans. Create chat_app_trulens.py:

import os

# Must be set before trulens is imported.
os.environ["TRULENS_OTEL_TRACING"] = "1"
os.environ["OLLAMA_API_BASE"] = "http://localhost:11434"

import numpy as np
import streamlit as st
import chromadb
import ollama

from trulens.core import TruSession, Metric, Selector
from trulens.core.otel.instrument import instrument
from trulens.otel.semconv.trace import SpanAttributes
from trulens.apps.app import TruApp
from trulens.providers.litellm import LiteLLM

COLLECTION = "security_advisories"
EMBED_MODEL = "nomic-embed-text"
CHAT_MODEL = "llama3.2"
JUDGE_MODEL = "ollama/llama3.1:8b"
TOP_K = 3


class RAG:
    """The base-tutorial pipeline, instrumented as OpenTelemetry spans."""

    @instrument(
        span_type=SpanAttributes.SpanType.RETRIEVAL,
        attributes={
            SpanAttributes.RETRIEVAL.QUERY_TEXT: "query",
            SpanAttributes.RETRIEVAL.RETRIEVED_CONTEXTS: "return",
        },
    )
    def retrieve(self, query: str) -> list[str]:
        coll = chromadb.PersistentClient(path="./chroma_db").get_collection(COLLECTION)
        q_emb = ollama.embed(model=EMBED_MODEL, input=query)["embeddings"][0]
        return coll.query(query_embeddings=[q_emb], n_results=TOP_K,
                          include=["documents"])["documents"][0]

    @instrument(span_type=SpanAttributes.SpanType.GENERATION)
    def generate(self, query: str, contexts: list[str]) -> str:
        ctx_text = "\n\n---\n\n".join(contexts)
        prompt = (
            "You are a security analyst assistant. Answer using only the context.\n"
            f"\nContext:\n{ctx_text}\n\nQuestion: {query}"
        )
        return ollama.chat(model=CHAT_MODEL,
                           messages=[{"role": "user", "content": prompt}]
                           )["message"]["content"]

    @instrument(
        span_type=SpanAttributes.SpanType.RECORD_ROOT,
        attributes={
            SpanAttributes.RECORD_ROOT.INPUT: "query",
            SpanAttributes.RECORD_ROOT.OUTPUT: "return",
        },
    )
    def query(self, query: str) -> str:
        contexts = self.retrieve(query)
        return self.generate(query, contexts)


@st.cache_resource
def get_app():
    """Build the instrumented app once and reuse it across Streamlit reruns."""
    TruSession()  # connects to the local sqlite trace database
    provider = LiteLLM(model_engine=JUDGE_MODEL)

    # Selectors read from the instrumented spans. select_context() pulls the
    # RETRIEVED_CONTEXTS attribute off the retrieval span; record input/output
    # come from the RECORD_ROOT span. No subscripting of the return value.
    f_groundedness = Metric(
        implementation=provider.groundedness_measure_with_cot_reasons,
        name="Groundedness",
        selectors={
            "source": Selector.select_context(collect_list=True),
            "statement": Selector.select_record_output(),
        },
    )
    f_answer_relevance = Metric(
        implementation=provider.relevance_with_cot_reasons,
        name="Answer Relevance",
        selectors={
            "prompt": Selector.select_record_input(),
            "response": Selector.select_record_output(),
        },
    )
    f_context_relevance = Metric(
        implementation=provider.context_relevance_with_cot_reasons,
        name="Context Relevance",
        selectors={
            "question": Selector.select_record_input(),
            # collect_list=False scores each chunk separately; agg averages them.
            "context": Selector.select_context(collect_list=False),
        },
        agg=np.mean,
    )

    rag = RAG()
    tru_app = TruApp(
        rag,
        app_name="rag-cve-assistant",
        app_version="v1",
        feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],
    )
    return rag, tru_app


rag, tru_app = get_app()

st.set_page_config(page_title="RAG Assistant (TruLens)", layout="centered")
st.title("Security Advisory Assistant (instrumented)")

if "messages" not in st.session_state:
    st.session_state.messages = []

for m in st.session_state.messages:
    with st.chat_message(m["role"]):
        st.markdown(m["content"])

if prompt := st.chat_input("Ask about a security advisory..."):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)

    with st.chat_message("assistant"):
        with tru_app as recording:
            answer = rag.query(prompt)
        st.markdown(answer)

    st.session_state.messages.append({"role": "assistant", "content": answer})

Calling rag.query() inside the with tru_app as recording: block is what produces a record. The contexts are not returned to the UI; TruLens captures them from the retrieval span, so the chat shows only the answer while the dashboard sees the full trace.

Start the chat app in one terminal:

streamlit run chat_app_trulens.py

Ask it a few questions, then launch the TruLens dashboard against the same trace database. Create dashboard.py:

import os

os.environ["TRULENS_OTEL_TRACING"] = "1"

from trulens.core import TruSession
from trulens.dashboard import run_dashboard

run_dashboard(TruSession())

Run it in a second terminal:

python dashboard.py

The dashboard opens at http://localhost:8484. Because the pipeline is instrumented span by span, each record shows the full trace: the input query, the retrieval span with its retrieved contexts, the generation span, and the three feedback scores with the judge’s chain-of-thought reasons attached to each.

Tip

Verify the instrumentation before trusting the numbers Ask one question whose answer is unambiguous in the corpus, for example “What versions of OpenSSH are affected by regreSSHion?”, then open the dashboard and inspect that single record. A healthy record has three things: a retrieval span listing the chunks it pulled (three, at the default TOP_K), a generation span containing the answer, and all three feedback scores populated with a number and a reason. If the scores are blank, NaN, or stuck on “pending”, the judge LLM is not reachable: confirm ollama serve is running, that you have pulled llama3.1:8b, and that OLLAMA_API_BASE points at it. The distinction matters: a score of 0.0 with a reason is a real (bad) result worth investigating; a missing score is a plumbing problem, not a quality one.

The two are complementary: RAGAS gates your changes (config A vs config B on a fixed dataset, pre-merge CI checks, comparing chunking strategies); TruLens watches what users do (production query triage, drift detection, deep dives into single bad answers). Use both.

The judge-LLM problem

Both tools reduce metrics to LLM calls. “Is this claim entailed by the context?” is a smaller task than answering the original question, but it is still generation, and it inherits the judge’s weaknesses.

Local 7B/8B judges agree with GPT-4-class judges roughly 70-80% of the time on these metrics. About one in four per-example scores is wrong in some direction. Aggregated over 100 examples it mostly washes out, but two failure modes survive:

  • Systematic bias. A small judge may consistently over-score faithfulness because it does not catch subtle unsupported claims. Your “0.91” might be GPT-4’s “0.83”.
  • Prompt-format sensitivity. The judge’s score changes when phrasing changes, even when the underlying answer is the same.

Two mitigations:

  1. Spot-check the bottom 10%. Each eval run, sort by score and have a human grade the 10 lowest-scoring examples. If the judge agrees with you, trust the rest. If it flags things that are fine or misses things that are broken, the absolute numbers are not reliable.
  2. Run two judges, treat disagreements as low-confidence. Run RAGAS twice with two different judge models (e.g. llama3.1:8b and qwen2.5:7b). Examples where they disagree by more than 0.3 are noise; agreement is signal. Optimize against the agreement subset.
import pandas as pd

METRICS = ["faithfulness", "answer_relevancy",
           "context_precision", "context_recall"]

df_a = pd.read_json("eval/results_llama8b.json")
df_b = pd.read_json("eval/results_qwen7b.json")

low_conf = sum(abs(df_a[m] - df_b[m]) for m in METRICS) > 0.3
print(f"{low_conf.sum()} of {len(df_a)} examples are low-confidence")
print(df_a[~low_conf].mean(numeric_only=True))

Closing the loop

The point of an eval harness is to make pipeline decisions you can defend:

  1. Run baseline. Score the current pipeline on the frozen dataset. Save results with the config.
  2. Identify the weakest dimension. Low recall → chunker. Low precision → reranker. Low faithfulness → prompt or model. Low relevancy → query understanding.
  3. Make one change. Only one. Two at once and you cannot attribute the delta.
  4. Re-run the same dataset. Same judge, same examples.
  5. Compare. Did the targeted metric improve? Did anything else regress?

A worked example, comparing two chunk sizes on the CVE corpus:

import pandas as pd

a = pd.read_json("eval/results_chunk500.json")
b = pd.read_json("eval/results_chunk800.json")
metrics = ["faithfulness", "answer_relevancy",
           "context_precision", "context_recall"]

summary = pd.DataFrame({
    "chunk_500": [a[m].mean() for m in metrics],
    "chunk_800": [b[m].mean() for m in metrics],
}, index=metrics)
summary["delta"] = summary["chunk_800"] - summary["chunk_500"]
print(summary.round(3))

Sample output:

                    chunk_500  chunk_800  delta
faithfulness            0.812      0.798 -0.014
answer_relevancy        0.874      0.881  0.007
context_precision       0.413      0.392 -0.021
context_recall          0.667      0.821  0.154

Chunk size 800 gave a 23% relative improvement in context recall at the cost of tiny faithfulness and precision regressions. Net win with caveats; if faithfulness matters more in your domain, revert.

Store the pipeline config next to the results in eval/run_config.json:

{
  "chunk_size": 800, "chunk_overlap": 80, "top_k": 3,
  "embed_model": "nomic-embed-text",
  "chat_model": "llama3.2", "judge_model": "llama3.1:8b",
  "ragas_version": "0.2.10",
  "dataset": "eval/golden.jsonl + eval/synthetic.jsonl",
  "dataset_hash": "sha256:f3a1...",
  "timestamp": "2026-04-30T14:22:11Z"
}

Commit both to version control, named after the config (eval/results-chunk800-top3-llama3.2.json). Six months from now when someone asks “wasn’t recall higher last summer?”, you can answer.

Anti-Goodhart guardrails

Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. Once you start optimizing the four metrics, the pipeline begins to drift toward gaming them.

Failure modes:

  • Faithfulness gaming. Hedge every claim with “according to the retrieved context.” Faithfulness rises; usefulness drops as answers become paraphrased chunk dumps.
  • Relevancy gaming. Restate the question in the answer. Cosine similarity rises; information content does not.
  • Precision gaming via top-k reduction. Drop top-k from 5 to 1 and precision goes to 1.0 by construction. Recall craters but if nobody watches it, the dashboard looks great.

Three guardrails:

  1. Rotate eval datasets quarterly. A frozen dataset becomes implicit training signal as you tune against it. Generate a fresh synthetic set quarterly and have a human curate 20 new golden examples. Keep the old set for trend tracking; make the rotated set the optimization target.
  2. Track unmoderated user feedback alongside metrics. Add thumbs-up/thumbs-down to the chat UI (one extra column in your TruLens database). When metrics improve but thumbs-down rate stays flat or rises, the metrics are lying.
  3. Watch all four metrics together, every run. A single-metric dashboard is the easiest way to start gaming. If you cannot show all four moving in a defensible direction, you only improved one number.

When the instruments and your users disagree, the users are right and the instruments need recalibration.

Next steps

  • Add a reranker. If context precision is below 0.5, a cross-encoder reranker on top of dense retrieval is the next big lever. The hybrid-search tutorial covers BGE rerankers and BM25+dense fusion.
  • Wire RAGAS into CI. A GitHub Actions job that runs the full eval on pipeline-config PRs and posts the four-metric delta as a comment. Block merges where any metric regresses past a tolerance.
  • Instrument the production chat UI with TruLens. The wrapping pattern above generalizes; leave the dashboard running and triage low-scoring queries weekly.
  • Push toward harder questions. Once aggregate scores are above 0.85 across the board, the dataset is too easy. Add cross-document, version-range, and “is X affected” questions until the scores drop.

For the broader operational picture, see Private AI in Your SOC.