You changed the chunk size from 500 to 800. The five questions you typed into the chat UI came back with answers that read better than before. You ship. Two weeks later an analyst tells you the assistant has been missing the mitigation section on half their queries because the new chunk boundary lands in the middle of every “Mitigation:” paragraph. A small change moved a small visible sample in one direction and a large invisible population in the other. The fix is measurement.
This tutorial connects your existing local RAG pipeline (built across the base tutorial, the chat interface, and the live CVE feeds) to two eval systems: RAGAS for offline batch scoring, and TruLens for online per-query observability. Both run locally against an Ollama judge LLM.
Why “the answers look good” is not evaluation
The anecdote above is the most common pattern in RAG projects without an eval harness: five canonical questions held in the team’s head, run by hand after each change, ship when answers feel better. The problem is structural:
- Five questions cannot cover three failure modes. Retrieval missed the docs, the model ignored retrieved docs, or the model addressed the wrong question. Independent failures need independent measurement.
- The questions you remember are the ones the system already answers well. Confirmation bias picks the demo set; real user queries do not match it.
- Small-sample improvement is statistically meaningless. 4/5 to 5/5 on a hand-picked set is not signal. 41/50 to 47/50 on a frozen eval set is.
A working eval setup separates the failure categories:
| Failure | Question it answers | Where it lives in the pipeline |
|---|---|---|
| Retrieval | Did we even pull the right documents? | Embedding model, chunker, vector store |
| Grounding | Is every claim in the answer derivable from those documents? | LLM prompt, model size, system instructions |
| Relevance | Does the answer actually address the user’s question? | LLM, prompt, query understanding |
You need a metric for each. Optimizing one without watching the others is how you get a pipeline whose answers are perfectly faithful to perfectly wrong retrieved chunks.
The four metrics that matter
RAGAS (Retrieval-Augmented Generation Assessment) defines four core metrics that map cleanly onto the three failure categories above. The framing comes from the original RAGAS paper (Es et al., 2023, “RAGAS: Automated Evaluation of Retrieval Augmented Generation”); the formulas below are simplified.
Faithfulness
Of the claims the model made in its answer, what fraction are supported by the retrieved context?
faithfulness = |claims in answer that are entailed by context|
────────────────────────────────────────────
|claims in answer|The hallucination detector. A judge LLM extracts atomic claims from the answer, then checks each against the retrieved chunks. Faithfulness drops when the model invents details, mis-attributes versions, or fabricates mitigations not in the source.
Answer relevancy
Does the answer actually address the question, regardless of whether it is correct?
answer_relevancy = mean cosine_sim(embed(q_i), embed(question))RAGAS asks the judge LLM to generate n candidate questions for which the answer would be a good response, embeds them with the original question, and averages cosine similarities. A technically true answer that addresses the wrong question gets a low score. The metric depends on the embedding model: weak embedders collapse semantically distinct questions and inflate the score.
Context precision
Of the chunks we retrieved, are the relevant ones ranked at the top?
context_precision@K = sum_{k=1..K} (precision@k * relevant_k)
─────────────────────────────────────
|relevant chunks in top-K|Mean average precision restricted to the retrieved set. Detects ranking failures: the right chunk is in top-10 but buried at position 8 while loosely related chunks crowd the top. In a security RAG, generic CVE chatter outranks the one chunk with the affected version range.
Context recall
Of the things the ground-truth answer requires, how many are present in the retrieved chunks?
context_recall = |sentences in ground_truth supported by context|
──────────────────────────────────────────────
|sentences in ground_truth|Low context recall means no amount of prompt engineering or model swap will save you; the information is not in the context window.
The four together form a diagnostic grid: low faithfulness with high others points at the model (tighten the prompt or upgrade); low relevancy points at query understanding; low precision points at ranking (add a reranker); low recall points at the chunker, embedder, or top-k.
Each metric attaches to a specific stage of the pipeline, which is what makes the grid diagnostic: a low score tells you where to look.
graph LR
Q[Question] --> RET[Retrieval]
RET -->|retrieved chunks| GEN[Generation]
GEN --> ANS[Answer]
GT[Ground-truth answer] -. compare .-> ANS
CP["Context precision:<br/>are relevant chunks ranked high?"] -. measures .-> RET
CR["Context recall:<br/>is the needed info retrieved at all?"] -. measures .-> RET
FA["Faithfulness:<br/>is every claim grounded in the chunks?"] -. measures .-> GEN
AR["Answer relevancy:<br/>does the answer address the question?"] -. measures .-> ANS
style Q fill:#4a9eff,stroke:#2a7edf,color:#fff
style RET fill:#868e96,stroke:#666e76,color:#fff
style GEN fill:#868e96,stroke:#666e76,color:#fff
style ANS fill:#51cf66,stroke:#31af46,color:#fff
style GT fill:#adb5bd,stroke:#8d959d,color:#fff
style CP fill:#ffa94d,stroke:#df894d,color:#fff
style CR fill:#ffa94d,stroke:#df894d,color:#fff
style FA fill:#ff6b6b,stroke:#df4b4b,color:#fff
style AR fill:#ff6b6b,stroke:#df4b4b,color:#fffContext precision and context recall are the same precision/recall tradeoff you see in any classifier, applied to the ranked list of retrieved chunks. If that tradeoff is not yet intuitive, the Classifier Threshold Lab lets you drag a decision threshold across score distributions and watch precision and recall move against each other in real time.
Building an eval dataset
Every metric needs a dataset of (question, ground_truth_answer, ground_truth_contexts) triples. The answer anchors faithfulness and relevancy; the contexts anchor recall.
Three sources, used together:
- Real analyst questions. Pull the last 30 days of queries from your chat UI logs. Strip PII, then have a human write the ground-truth answer and mark which advisory chunks contain it.
- Synthetic questions from your corpus. RAGAS’
TestsetGeneratortakes documents and produces Q/A/context triples. The questions tend to be shallower than real ones; use it for breadth, not depth. - A small human-curated golden set. Twenty hand-written examples covering hard cases: cross-document reasoning, version ranges, “is X affected” questions.
Aim for 50-200 examples. Smaller and metric variance swamps any change you make; much larger and a local judge LLM becomes painfully slow per eval run.
JSON Lines format, one example per line:
{
"question": "What versions of OpenSSH are vulnerable to regreSSHion?",
"ground_truth_answer": "OpenSSH 8.5p1 through 9.7p1 on glibc-based Linux are vulnerable to CVE-2024-6387 (regreSSHion). The fix is in OpenSSH 9.8p1.",
"ground_truth_contexts": [
"CVE-2024-6387: regreSSHion. Affected versions: OpenSSH 8.5p1 through 9.7p1 (glibc-based). Fixed in: OpenSSH 9.8p1."
],
"tags": ["versions", "openssh"]
}Save this as eval/golden.jsonl next to your existing pipeline code.
A simple loader and a synthetic-generation entrypoint live in eval/build_dataset.py:
import json
from pathlib import Path
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_ollama import ChatOllama, OllamaEmbeddings
from ragas.testset import TestsetGenerator
EVAL_DIR = Path("eval")
GOLDEN_PATH = EVAL_DIR / "golden.jsonl"
SYNTHETIC_PATH = EVAL_DIR / "synthetic.jsonl"
def load_jsonl(path):
return [json.loads(l) for l in open(path) if l.strip()]
def write_jsonl(path, examples):
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, "w") as f:
for ex in examples:
f.write(json.dumps(ex) + "\n")
def load_all():
out = []
for p in (GOLDEN_PATH, SYNTHETIC_PATH):
if p.exists():
out += load_jsonl(p)
return out
def generate_synthetic(n=50):
"""Generate synthetic Q/A/context triples via RAGAS' TestsetGenerator.
Tiny models (3B and below) produce poor questions; use llama3.1:8b+.
"""
docs = DirectoryLoader("advisories", glob="*.txt",
loader_cls=TextLoader).load()
llm = ChatOllama(model="llama3.1:8b", temperature=0)
emb = OllamaEmbeddings(model="nomic-embed-text")
generator = TestsetGenerator.from_langchain(llm=llm, embedding_model=emb)
testset = generator.generate_with_langchain_docs(docs, testset_size=n)
df = testset.to_pandas()
examples = [{
"question": row["user_input"],
"ground_truth_answer": row["reference"],
"ground_truth_contexts": list(row["reference_contexts"]),
"tags": ["synthetic"],
} for _, row in df.iterrows()]
write_jsonl(SYNTHETIC_PATH, examples)
print(f"Wrote {len(examples)} synthetic examples to {SYNTHETIC_PATH}")
if __name__ == "__main__":
import sys
if len(sys.argv) > 1 and sys.argv[1] == "synthetic":
generate_synthetic(int(sys.argv[2]) if len(sys.argv) > 2 else 50)
else:
print(f"Loaded {len(load_all())} examples")Run it:
python eval/build_dataset.py synthetic 50
python eval/build_dataset.pyWarning
Synthetic generation needs a capable judge A 3B model will happily emit “What is CVE?” as a synthetic question and grade it as good. Use at least an 8B model for generation. Inspect the first 20 generated examples by hand before trusting the rest.
RAGAS setup with the local stack
RAGAS defaults to OpenAI. We configure it to use a local Ollama judge via the LangChain interface so nothing leaves the machine. langchain-ollama is the current package; the langchain_community.chat_models.ollama import was deprecated in 2024.
Pin the versions for this tutorial. RAGAS and TruLens both changed public APIs several times between 2024 and 2026; a loose pip install ragas trulens is likely to break imports or dataset schemas months after this is written.
cat > eval/requirements.txt <<'REQ'
ragas==0.2.10
langchain-ollama==0.2.3
langchain-community==0.3.16
datasets==3.2.0
chromadb==0.6.3
ollama==0.4.7
pandas==2.2.3
numpy==1.26.4
REQ
pip install -r eval/requirements.txt
ollama pull llama3.1:8bThe code below uses the RAGAS 0.2 column names: user_input, retrieved_contexts, response, reference, and reference_contexts. If you upgrade to RAGAS 0.3/0.4, check the migration notes before changing the pins; the metric names mostly survived, but testset generation and some schema helpers moved.
The 3B llama3.2 from the base tutorial is too small for a judge; claim extraction and entailment grading are unreliable below 7B. Create eval/run_ragas.py:
from pathlib import Path
import chromadb
import ollama
from datasets import Dataset
from langchain_ollama import ChatOllama, OllamaEmbeddings
from ragas import evaluate
from ragas.metrics import (faithfulness, answer_relevancy,
context_precision, context_recall)
from build_dataset import load_all
COLLECTION = "security_advisories"
EMBED_MODEL = "nomic-embed-text"
CHAT_MODEL = "llama3.2"
JUDGE_MODEL = "llama3.1:8b"
TOP_K = 3
def retrieve(question):
coll = chromadb.PersistentClient(path="./chroma_db").get_collection(COLLECTION)
q_emb = ollama.embed(model=EMBED_MODEL, input=question)["embeddings"][0]
return coll.query(query_embeddings=[q_emb], n_results=TOP_K,
include=["documents"])["documents"][0]
def generate(question, contexts):
prompt = (
"You are a security analyst assistant. Answer using only the context.\n"
"If the context doesn't contain enough information, say so.\n\n"
f"Context:\n{chr(10).join(contexts)}\n\nQuestion: {question}"
)
return ollama.chat(model=CHAT_MODEL,
messages=[{"role": "user", "content": prompt}]
)["message"]["content"]
def build_row(ex):
ctx = retrieve(ex["question"])
return {
"user_input": ex["question"],
"retrieved_contexts": ctx,
"response": generate(ex["question"], ctx),
"reference": ex["ground_truth_answer"],
"reference_contexts": ex["ground_truth_contexts"],
}
def main(out_path="eval/results.json"):
examples = load_all()
print(f"Running {len(examples)} examples through the pipeline...")
rows = [build_row(ex) for ex in examples]
print("Scoring with RAGAS...")
results = evaluate(
dataset=Dataset.from_list(rows),
metrics=[faithfulness, answer_relevancy,
context_precision, context_recall],
llm=ChatOllama(model=JUDGE_MODEL, temperature=0),
embeddings=OllamaEmbeddings(model=EMBED_MODEL),
)
df = results.to_pandas()
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
df.to_json(out_path, orient="records", indent=2)
print("\n=== Aggregate scores ===")
for m in ["faithfulness", "answer_relevancy",
"context_precision", "context_recall"]:
if m in df.columns:
print(f" {m:22s} {df[m].mean():.3f}")
print(f"\nPer-example results saved to {out_path}")
if __name__ == "__main__":
main()Run it:
python eval/run_ragas.pyExpect this to take time. 100 examples through a local 8B judge typically lands at 30-60 minutes for a full eval. Not interactive, but acceptable as a CI step or daily batch job.
Note
Local judge LLMs are noisier than GPT-4 Published RAGAS scores in the literature use GPT-4-class judges. An 8B local judge produces noisier per-example scores and a small systematic bias. The aggregate is still useful for relative comparisons on the same dataset and judge (config A vs config B), which is what you need to make pipeline decisions. It is less reliable for absolute claims like “our faithfulness is 0.91”. Treat absolute numbers as rough bands.
Reading the results
A typical first-run output, on a 60-example dataset:
=== Aggregate scores ===
faithfulness 0.812
answer_relevancy 0.874
context_precision 0.413
context_recall 0.667
Per-example results saved to eval/results.jsonA useful per-example view (truncated):
| question | faith. | rel. | prec. | recall |
|---|---|---|---|---|
| What versions of OpenSSH have regreSSHion? | 1.000 | 0.911 | 1.000 | 1.000 |
| How is the XZ backdoor activated? | 0.667 | 0.823 | 0.500 | 0.667 |
| Which CVEs allow remote code execution? | 0.500 | 0.799 | 0.250 | 0.333 |
| Compare regreSSHion and XZ in terms of impact | 0.333 | 0.751 | 0.000 | 0.500 |
Concrete bands for security RAG, calibrated against typical local-stack scores:
- Faithfulness > 0.85 is acceptable for production. > 0.95 is excellent.
- Answer relevancy > 0.85 is acceptable. Below 0.75, the worst-scoring queries are usually vague (“tell me about Linux vulns”) where the model picked one slice and ran with it.
- Context recall < 0.7 signals that the chunker is dropping needed information. Try larger chunks, more overlap, or paragraph-aware splitting. Can also indicate top-k is too small.
- Context precision < 0.5 means retrieval is finding the right chunks but ranking them poorly. The fix is a reranker; the hybrid-search tutorial later in this series covers BGE rerankers, which typically pull this metric from 0.4-0.5 up to 0.7-0.8.
The example output tells a clear story: faithfulness and relevancy are okay, precision is low, recall is mediocre. Retrieval is the bottleneck, not generation. Another week tuning prompts produces essentially nothing; a reranker produces most of the available improvement.
TruLens as an alternative
RAGAS is offline batch eval: collect a dataset, run it, score, repeat. The right tool for “did this change help?”
TruLens is online observability: it sits between the app and the LLM, captures every prompt and retrieval, attaches per-record metrics, and exposes them per query in a dashboard. The right tool for “what is happening in production right now?”
Note
TruLens 2.x is built on OpenTelemetry TruLens used to ship as
trulens-eval(import pathtrulens_eval) with aFeedback/.on_input_output()API. The current package istrulens(providers undertrulens.providers.*), and 2.x is built on OpenTelemetry: you annotate your pipeline with@instrument, each retrieval and generation becomes a span, andMetric+Selectorread their arguments straight from those spans rather than from a function’s input and output. The oldFeedbackclass still works but is deprecated, and selectors are not subscriptable: you cannot pluck a field withselect_record_output()["answer"]. Use@instrumentattributes andSelector.select_context()instead. SetTRULENS_OTEL_TRACING=1before importingtrulens; it is the default in recent 2.x builds, but setting it explicitly keeps the code robust across versions.
Install:
cat >> eval/requirements.txt <<'REQ'
trulens==2.7.2
trulens-providers-litellm==2.7.2
streamlit==1.45.0
REQ
pip install -r eval/requirements.txtInstead of wrapping a plain function, TruLens 2.x instruments the pipeline itself. Re-create the base pipeline from the chat interface tutorial as a small class and decorate retrieval, generation, and the top-level query with @instrument. Each call becomes an OpenTelemetry span, and the metrics read their inputs from those spans. Create chat_app_trulens.py:
import os
# Must be set before trulens is imported.
os.environ["TRULENS_OTEL_TRACING"] = "1"
os.environ["OLLAMA_API_BASE"] = "http://localhost:11434"
import numpy as np
import streamlit as st
import chromadb
import ollama
from trulens.core import TruSession, Metric, Selector
from trulens.core.otel.instrument import instrument
from trulens.otel.semconv.trace import SpanAttributes
from trulens.apps.app import TruApp
from trulens.providers.litellm import LiteLLM
COLLECTION = "security_advisories"
EMBED_MODEL = "nomic-embed-text"
CHAT_MODEL = "llama3.2"
JUDGE_MODEL = "ollama/llama3.1:8b"
TOP_K = 3
class RAG:
"""The base-tutorial pipeline, instrumented as OpenTelemetry spans."""
@instrument(
span_type=SpanAttributes.SpanType.RETRIEVAL,
attributes={
SpanAttributes.RETRIEVAL.QUERY_TEXT: "query",
SpanAttributes.RETRIEVAL.RETRIEVED_CONTEXTS: "return",
},
)
def retrieve(self, query: str) -> list[str]:
coll = chromadb.PersistentClient(path="./chroma_db").get_collection(COLLECTION)
q_emb = ollama.embed(model=EMBED_MODEL, input=query)["embeddings"][0]
return coll.query(query_embeddings=[q_emb], n_results=TOP_K,
include=["documents"])["documents"][0]
@instrument(span_type=SpanAttributes.SpanType.GENERATION)
def generate(self, query: str, contexts: list[str]) -> str:
ctx_text = "\n\n---\n\n".join(contexts)
prompt = (
"You are a security analyst assistant. Answer using only the context.\n"
f"\nContext:\n{ctx_text}\n\nQuestion: {query}"
)
return ollama.chat(model=CHAT_MODEL,
messages=[{"role": "user", "content": prompt}]
)["message"]["content"]
@instrument(
span_type=SpanAttributes.SpanType.RECORD_ROOT,
attributes={
SpanAttributes.RECORD_ROOT.INPUT: "query",
SpanAttributes.RECORD_ROOT.OUTPUT: "return",
},
)
def query(self, query: str) -> str:
contexts = self.retrieve(query)
return self.generate(query, contexts)
@st.cache_resource
def get_app():
"""Build the instrumented app once and reuse it across Streamlit reruns."""
TruSession() # connects to the local sqlite trace database
provider = LiteLLM(model_engine=JUDGE_MODEL)
# Selectors read from the instrumented spans. select_context() pulls the
# RETRIEVED_CONTEXTS attribute off the retrieval span; record input/output
# come from the RECORD_ROOT span. No subscripting of the return value.
f_groundedness = Metric(
implementation=provider.groundedness_measure_with_cot_reasons,
name="Groundedness",
selectors={
"source": Selector.select_context(collect_list=True),
"statement": Selector.select_record_output(),
},
)
f_answer_relevance = Metric(
implementation=provider.relevance_with_cot_reasons,
name="Answer Relevance",
selectors={
"prompt": Selector.select_record_input(),
"response": Selector.select_record_output(),
},
)
f_context_relevance = Metric(
implementation=provider.context_relevance_with_cot_reasons,
name="Context Relevance",
selectors={
"question": Selector.select_record_input(),
# collect_list=False scores each chunk separately; agg averages them.
"context": Selector.select_context(collect_list=False),
},
agg=np.mean,
)
rag = RAG()
tru_app = TruApp(
rag,
app_name="rag-cve-assistant",
app_version="v1",
feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],
)
return rag, tru_app
rag, tru_app = get_app()
st.set_page_config(page_title="RAG Assistant (TruLens)", layout="centered")
st.title("Security Advisory Assistant (instrumented)")
if "messages" not in st.session_state:
st.session_state.messages = []
for m in st.session_state.messages:
with st.chat_message(m["role"]):
st.markdown(m["content"])
if prompt := st.chat_input("Ask about a security advisory..."):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
with tru_app as recording:
answer = rag.query(prompt)
st.markdown(answer)
st.session_state.messages.append({"role": "assistant", "content": answer})Calling rag.query() inside the with tru_app as recording: block is what produces a record. The contexts are not returned to the UI; TruLens captures them from the retrieval span, so the chat shows only the answer while the dashboard sees the full trace.
Start the chat app in one terminal:
streamlit run chat_app_trulens.pyAsk it a few questions, then launch the TruLens dashboard against the same trace database. Create dashboard.py:
import os
os.environ["TRULENS_OTEL_TRACING"] = "1"
from trulens.core import TruSession
from trulens.dashboard import run_dashboard
run_dashboard(TruSession())Run it in a second terminal:
python dashboard.pyThe dashboard opens at http://localhost:8484. Because the pipeline is instrumented span by span, each record shows the full trace: the input query, the retrieval span with its retrieved contexts, the generation span, and the three feedback scores with the judge’s chain-of-thought reasons attached to each.
Tip
Verify the instrumentation before trusting the numbers Ask one question whose answer is unambiguous in the corpus, for example “What versions of OpenSSH are affected by regreSSHion?”, then open the dashboard and inspect that single record. A healthy record has three things: a retrieval span listing the chunks it pulled (three, at the default
TOP_K), a generation span containing the answer, and all three feedback scores populated with a number and a reason. If the scores are blank,NaN, or stuck on “pending”, the judge LLM is not reachable: confirmollama serveis running, that you have pulledllama3.1:8b, and thatOLLAMA_API_BASEpoints at it. The distinction matters: a score of 0.0 with a reason is a real (bad) result worth investigating; a missing score is a plumbing problem, not a quality one.
The two are complementary: RAGAS gates your changes (config A vs config B on a fixed dataset, pre-merge CI checks, comparing chunking strategies); TruLens watches what users do (production query triage, drift detection, deep dives into single bad answers). Use both.
The judge-LLM problem
Both tools reduce metrics to LLM calls. “Is this claim entailed by the context?” is a smaller task than answering the original question, but it is still generation, and it inherits the judge’s weaknesses.
Local 7B/8B judges agree with GPT-4-class judges roughly 70-80% of the time on these metrics. About one in four per-example scores is wrong in some direction. Aggregated over 100 examples it mostly washes out, but two failure modes survive:
- Systematic bias. A small judge may consistently over-score faithfulness because it does not catch subtle unsupported claims. Your “0.91” might be GPT-4’s “0.83”.
- Prompt-format sensitivity. The judge’s score changes when phrasing changes, even when the underlying answer is the same.
Two mitigations:
- Spot-check the bottom 10%. Each eval run, sort by score and have a human grade the 10 lowest-scoring examples. If the judge agrees with you, trust the rest. If it flags things that are fine or misses things that are broken, the absolute numbers are not reliable.
- Run two judges, treat disagreements as low-confidence. Run RAGAS twice with two different judge models (e.g.
llama3.1:8bandqwen2.5:7b). Examples where they disagree by more than 0.3 are noise; agreement is signal. Optimize against the agreement subset.
import pandas as pd
METRICS = ["faithfulness", "answer_relevancy",
"context_precision", "context_recall"]
df_a = pd.read_json("eval/results_llama8b.json")
df_b = pd.read_json("eval/results_qwen7b.json")
low_conf = sum(abs(df_a[m] - df_b[m]) for m in METRICS) > 0.3
print(f"{low_conf.sum()} of {len(df_a)} examples are low-confidence")
print(df_a[~low_conf].mean(numeric_only=True))Closing the loop
The point of an eval harness is to make pipeline decisions you can defend:
- Run baseline. Score the current pipeline on the frozen dataset. Save results with the config.
- Identify the weakest dimension. Low recall → chunker. Low precision → reranker. Low faithfulness → prompt or model. Low relevancy → query understanding.
- Make one change. Only one. Two at once and you cannot attribute the delta.
- Re-run the same dataset. Same judge, same examples.
- Compare. Did the targeted metric improve? Did anything else regress?
A worked example, comparing two chunk sizes on the CVE corpus:
import pandas as pd
a = pd.read_json("eval/results_chunk500.json")
b = pd.read_json("eval/results_chunk800.json")
metrics = ["faithfulness", "answer_relevancy",
"context_precision", "context_recall"]
summary = pd.DataFrame({
"chunk_500": [a[m].mean() for m in metrics],
"chunk_800": [b[m].mean() for m in metrics],
}, index=metrics)
summary["delta"] = summary["chunk_800"] - summary["chunk_500"]
print(summary.round(3))Sample output:
chunk_500 chunk_800 delta
faithfulness 0.812 0.798 -0.014
answer_relevancy 0.874 0.881 0.007
context_precision 0.413 0.392 -0.021
context_recall 0.667 0.821 0.154Chunk size 800 gave a 23% relative improvement in context recall at the cost of tiny faithfulness and precision regressions. Net win with caveats; if faithfulness matters more in your domain, revert.
Store the pipeline config next to the results in eval/run_config.json:
{
"chunk_size": 800, "chunk_overlap": 80, "top_k": 3,
"embed_model": "nomic-embed-text",
"chat_model": "llama3.2", "judge_model": "llama3.1:8b",
"ragas_version": "0.2.10",
"dataset": "eval/golden.jsonl + eval/synthetic.jsonl",
"dataset_hash": "sha256:f3a1...",
"timestamp": "2026-04-30T14:22:11Z"
}Commit both to version control, named after the config (eval/results-chunk800-top3-llama3.2.json). Six months from now when someone asks “wasn’t recall higher last summer?”, you can answer.
Anti-Goodhart guardrails
Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. Once you start optimizing the four metrics, the pipeline begins to drift toward gaming them.
Failure modes:
- Faithfulness gaming. Hedge every claim with “according to the retrieved context.” Faithfulness rises; usefulness drops as answers become paraphrased chunk dumps.
- Relevancy gaming. Restate the question in the answer. Cosine similarity rises; information content does not.
- Precision gaming via top-k reduction. Drop top-k from 5 to 1 and precision goes to 1.0 by construction. Recall craters but if nobody watches it, the dashboard looks great.
Three guardrails:
- Rotate eval datasets quarterly. A frozen dataset becomes implicit training signal as you tune against it. Generate a fresh synthetic set quarterly and have a human curate 20 new golden examples. Keep the old set for trend tracking; make the rotated set the optimization target.
- Track unmoderated user feedback alongside metrics. Add thumbs-up/thumbs-down to the chat UI (one extra column in your TruLens database). When metrics improve but thumbs-down rate stays flat or rises, the metrics are lying.
- Watch all four metrics together, every run. A single-metric dashboard is the easiest way to start gaming. If you cannot show all four moving in a defensible direction, you only improved one number.
When the instruments and your users disagree, the users are right and the instruments need recalibration.
Next steps
- Add a reranker. If context precision is below 0.5, a cross-encoder reranker on top of dense retrieval is the next big lever. The hybrid-search tutorial covers BGE rerankers and BM25+dense fusion.
- Wire RAGAS into CI. A GitHub Actions job that runs the full eval on pipeline-config PRs and posts the four-metric delta as a comment. Block merges where any metric regresses past a tolerance.
- Instrument the production chat UI with TruLens. The wrapping pattern above generalizes; leave the dashboard running and triage low-scoring queries weekly.
- Push toward harder questions. Once aggregate scores are above 0.85 across the board, the dataset is too easy. Add cross-document, version-range, and “is X affected” questions until the scores drop.
For the broader operational picture, see Private AI in Your SOC.