Build a Local RAG Pipeline with Ollama and ChromaDB

Large language models are confident, but they don’t know your data. Ask one about a CVE published last week or an internal advisory from your team, and it will either hallucinate or punt. Retrieval-augmented generation fixes this by grounding the model’s responses in documents you control.

This tutorial builds a complete RAG pipeline that runs entirely on your machine — no API keys, no cloud services, no data leaving the boundary. You’ll use security advisories as the document corpus, which makes this directly applicable to SOC and vulnerability management workflows. By the end you’ll have a working system that retrieves relevant advisory text and generates answers grounded in that context.

If you want to see what embeddings look like under the hood, open the Embedding Space Explorer in another tab. It visualizes how models cluster words by meaning — the same principle that makes retrieval work.

How retrieval-augmented generation works

RAG splits the problem into two phases: retrieval (find the right documents) and generation (answer using those documents as context).

                         ┌──────────────┐
                         │  Documents   │
                         └──────┬───────┘
                                │
                           chunk + embed
                                │
                         ┌──────▼───────┐
                         │ Vector Store │
                         │  (ChromaDB)  │
                         └──────┬───────┘
                                │
   ┌──────────┐          similarity search
   │  Query   │─── embed ──────►│
   └──────────┘                 │
                         ┌──────▼───────┐
                         │  Top-k Chunks│
                         └──────┬───────┘
                                │
                         inject as context
                                │
                         ┌──────▼───────┐
                         │   LLM Chat   │
                         │   (Ollama)   │
                         └──────┬───────┘
                                │
                         ┌──────▼───────┐
                         │   Answer     │
                         └──────────────┘

The key insight: the LLM never searches for documents itself. A separate retrieval system finds relevant chunks, and the LLM receives them as part of its prompt. This means the model’s answer is constrained by what was actually retrieved — which is both the strength (grounding) and the limitation (retrieval quality is the ceiling).

RAG doesn’t eliminate hallucination entirely. The model can still misinterpret or over-extrapolate from retrieved context. But it dramatically reduces the failure mode of “making things up from nothing.”

Step 1: Install Ollama

Ollama runs LLMs locally with a simple CLI. You need two models: one for generating embeddings (turning text into vectors) and one for chat (generating answers).

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Pull the embedding model and a chat model:

ollama pull nomic-embed-text
ollama pull llama3.2

Verify both models respond:

# Test embeddings — should return a JSON array of floats
curl -s http://localhost:11434/api/embeddings \
  -d '{"model": "nomic-embed-text", "prompt": "test"}' | head -c 100

# Test chat
ollama run llama3.2 "Say hello in one sentence."

nomic-embed-text produces 768-dimensional vectors and runs well on modest hardware. For the chat model, llama3.2 (3B parameters) strikes a good balance between quality and speed. If you have a GPU with 16+ GB of VRAM, you can substitute llama3.1:8b for better answers.

Step 2: Set up the project

Create a project directory and install the Python dependencies:

mkdir rag-pipeline && cd rag-pipeline
python -m venv .venv
source .venv/bin/activate
pip install chromadb ollama

Create the project structure:

mkdir advisories

You’ll end up with:

rag-pipeline/
├── advisories/       # Document corpus
├── ingest.py         # Chunking and embedding script
├── query.py          # Query interface
└── rag.py            # Complete pipeline (final version)

Step 3: Prepare the documents

Create a small corpus of security advisories. In production you’d pull these from a feed, but for this tutorial, create them manually. The content matters more than the quantity — three well-written advisories teach the pipeline more than thirty empty ones.

Create the following files in the advisories/ directory:

advisories/CVE-2024-3094.txt

CVE-2024-3094: XZ Utils Backdoor

Severity: Critical (CVSS 10.0)
Affected versions: xz 5.6.0, 5.6.1
Fixed in: xz 5.6.2

A supply chain compromise was discovered in XZ Utils, a
widely-used compression library. Malicious code was introduced
through a series of obfuscated commits to the upstream repository
by a trusted maintainer account. The backdoor targeted the build
process of liblzma, injecting code that modified the behavior of
OpenSSH's sshd when linked against the compromised library.

The backdoor allowed unauthorized remote access by intercepting
SSH authentication. It specifically targeted x86-64 Linux systems
using glibc and systemd. The malicious payload was hidden in test
fixture files and activated through a multi-stage build script
modification.

Mitigation: Downgrade to xz 5.4.x or upgrade to 5.6.2. Audit
systems running affected versions for signs of unauthorized SSH
access. Check if sshd links against liblzma with:
ldd $(which sshd) | grep liblzma.

advisories/CVE-2024-6387.txt

CVE-2024-6387: regreSSHion — OpenSSH Race Condition

Severity: High (CVSS 8.1)
Affected versions: OpenSSH 8.5p1 through 9.7p1 (glibc-based)
Fixed in: OpenSSH 9.8p1

A signal handler race condition in OpenSSH's sshd allows
unauthenticated remote code execution as root on glibc-based
Linux systems. The vulnerability exists in the SIGALRM handler
called during the LoginGraceTime period. An attacker can trigger
the race condition by failing to authenticate within the grace
period.

Exploitation requires approximately 10,000 connection attempts,
making it detectable through monitoring. The attack takes several
hours against a 32-bit target and has not been demonstrated
against 64-bit systems with ASLR, though this does not rule it
out.

Mitigation: Upgrade to OpenSSH 9.8p1. As a temporary workaround,
set LoginGraceTime to 0 in sshd_config — this eliminates the
race condition but exposes the server to connection exhaustion.
Monitor for unusual volumes of failed SSH connections.

advisories/CVE-2023-44487.txt

CVE-2023-44487: HTTP/2 Rapid Reset DDoS Attack

Severity: High (CVSS 7.5)
Affected: All HTTP/2 implementations
  (nginx, Apache, IIS, envoy, Go net/http, Node.js)
Fixed in: Varies by implementation

A novel denial-of-service technique exploits the HTTP/2
protocol's stream multiplexing feature. The attacker opens a
large number of HTTP/2 streams and immediately cancels them
using RST_STREAM frames. Because the server processes each
stream request before receiving the cancellation, this creates
asymmetric resource consumption — the server does significantly
more work than the client.

This technique was observed in the wild generating over 398
million requests per second against Google Cloud infrastructure.
Unlike traditional volumetric DDoS, this attack requires
relatively little bandwidth from the attacker.

Mitigation: Apply vendor-specific patches. Configure HTTP/2
stream limits and rate limiting. Monitor for abnormal RST_STREAM
frame rates. Consider limiting maximum concurrent streams per
connection to 100 or fewer.

Step 4: Chunk and embed

Raw documents are too long to use as search results directly. You need to split them into chunks small enough to be meaningful but large enough to preserve context, then convert each chunk into an embedding vector.

Create ingest.py:

import os
import chromadb
import ollama

ADVISORY_DIR = "advisories"
COLLECTION_NAME = "security_advisories"
CHUNK_SIZE = 500       # characters per chunk
CHUNK_OVERLAP = 50     # overlap between consecutive chunks


def chunk_text(text, size=CHUNK_SIZE, overlap=CHUNK_OVERLAP):
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap
    return chunks


def load_advisories(directory):
    """Read all .txt files from the advisory directory."""
    documents = []
    for filename in sorted(os.listdir(directory)):
        if not filename.endswith(".txt"):
            continue
        filepath = os.path.join(directory, filename)
        with open(filepath) as f:
            text = f.read().strip()
        documents.append({"filename": filename, "text": text})
    return documents


def embed(text):
    """Generate an embedding vector using Ollama."""
    response = ollama.embed(model="nomic-embed-text", input=text)
    return response["embeddings"][0]


def ingest():
    client = chromadb.PersistentClient(path="./chroma_db")

    # Delete existing collection to avoid stale data
    try:
        client.delete_collection(COLLECTION_NAME)
    except ValueError:
        pass

    collection = client.create_collection(
        name=COLLECTION_NAME,
        metadata={"hnsw:space": "cosine"},
    )

    all_ids = []
    all_embeddings = []
    all_documents = []
    all_metadatas = []

    for doc in load_advisories(ADVISORY_DIR):
        chunks = chunk_text(doc["text"])
        print(f"{doc['filename']}: {len(chunks)} chunks")

        for i, chunk in enumerate(chunks):
            chunk_id = f"{doc['filename']}::chunk{i}"
            embedding = embed(chunk)

            all_ids.append(chunk_id)
            all_embeddings.append(embedding)
            all_documents.append(chunk)
            all_metadatas.append({
                "source": doc["filename"],
                "chunk_index": i,
            })

    collection.add(
        ids=all_ids,
        embeddings=all_embeddings,
        documents=all_documents,
        metadatas=all_metadatas,
    )

    print(f"\nIngested {len(all_ids)} chunks into '{COLLECTION_NAME}'")


if __name__ == "__main__":
    ingest()

Run the ingestion:

python ingest.py

Expected output:

CVE-2023-44487.txt: 3 chunks
CVE-2024-3094.txt: 4 chunks
CVE-2024-6387.txt: 3 chunks

Ingested 10 chunks into 'security_advisories'

Tip

Chunk size is a tradeoff Smaller chunks (200-300 chars) improve retrieval precision — each chunk covers one idea, so the best match is more likely to be exactly what was asked about. Larger chunks (800-1000 chars) preserve more context, which helps the LLM generate better answers. 500 characters is a reasonable starting point. Tune based on your documents and query patterns.

Step 5: Query the pipeline

Now build the retrieval and generation side. The query function embeds the user’s question, searches ChromaDB for the most similar chunks, and passes them to the chat model as context.

Create query.py:

import chromadb
import ollama

COLLECTION_NAME = "security_advisories"


def retrieve(question, n_results=3):
    """Find the most relevant chunks for a question."""
    client = chromadb.PersistentClient(path="./chroma_db")
    collection = client.get_collection(COLLECTION_NAME)

    query_embedding = ollama.embed(
        model="nomic-embed-text", input=question
    )["embeddings"][0]

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
    )

    return results["documents"][0], results["metadatas"][0]


def generate(question, context_chunks):
    """Send the question and retrieved context to the LLM."""
    context = "\n\n---\n\n".join(context_chunks)

    prompt = f"""You are a security analyst assistant. Answer the question
using only the context provided below. If the context doesn't contain
enough information to answer, say so — do not guess.

Context:
{context}

Question: {question}"""

    response = ollama.chat(
        model="llama3.2",
        messages=[{"role": "user", "content": prompt}],
    )

    return response["message"]["content"]


def ask(question):
    """Full RAG pipeline: retrieve context, then generate an answer."""
    print(f"Question: {question}\n")

    chunks, metadata = retrieve(question)

    print("Retrieved from:")
    for meta in metadata:
        print(f"  - {meta['source']} (chunk {meta['chunk_index']})")
    print()

    answer = generate(question, chunks)
    print(f"Answer:\n{answer}\n")
    return answer


if __name__ == "__main__":
    ask("How can I detect if my system is affected by the XZ backdoor?")

Run it:

python query.py

Step 6: Test retrieval quality

A RAG pipeline is only as good as its retrieval. Run several queries to understand where it works and where it breaks.

Query that works well — directly matches advisory content:

ask("What is the mitigation for the HTTP/2 Rapid Reset attack?")

Retrieved from:
  - CVE-2023-44487.txt (chunk 2)
  - CVE-2023-44487.txt (chunk 1)
  - CVE-2024-6387.txt (chunk 2)

Answer:
Based on the advisory, mitigation for CVE-2023-44487 includes applying
vendor-specific patches, configuring HTTP/2 stream limits and rate
limiting, monitoring for abnormal RST_STREAM frame rates, and limiting
maximum concurrent streams per connection to 100 or fewer.

The retrieval correctly finds the relevant advisory and the model grounds its answer in the mitigation section.

Query that exposes limitations — requires reasoning across documents:

ask("Which of these vulnerabilities could be used for initial access?")

This kind of comparative question is harder. The retrieval returns chunks from different advisories, but each chunk lacks the full context of its source document. The model may miss connections or fail to compare effectively. This isn’t a flaw in your code — it’s a fundamental limitation of chunk-based retrieval.

Strategies to improve this:

Problem	Solution
Retrieved chunks lack context	Increase chunk size or add overlap
Wrong chunks retrieved	Improve chunking boundaries (split on paragraphs, not character count)
Model ignores context	Use a stronger prompt or a larger model
Not enough documents retrieved	Increase `n_results` (but watch the context window)

Warning

Context window limits Every chunk you inject into the prompt consumes tokens from the model’s context window. llama3.2 has a 128k context window, but generation quality degrades well before you hit the limit. Keep retrieved context under 2,000 tokens (roughly 8,000 characters) for reliable results.

The complete pipeline

Here’s the full pipeline consolidated into a single script. This combines ingestion and querying with a simple command-line interface.

Create rag.py:

import os
import sys
import chromadb
import ollama

ADVISORY_DIR = "advisories"
COLLECTION_NAME = "security_advisories"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
EMBED_MODEL = "nomic-embed-text"
CHAT_MODEL = "llama3.2"
TOP_K = 3


def chunk_text(text, size=CHUNK_SIZE, overlap=CHUNK_OVERLAP):
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap
    return chunks


def embed(text):
    return ollama.embed(model=EMBED_MODEL, input=text)["embeddings"][0]


def ingest():
    client = chromadb.PersistentClient(path="./chroma_db")
    try:
        client.delete_collection(COLLECTION_NAME)
    except ValueError:
        pass

    collection = client.create_collection(
        name=COLLECTION_NAME,
        metadata={"hnsw:space": "cosine"},
    )

    ids, embeddings, documents, metadatas = [], [], [], []

    for filename in sorted(os.listdir(ADVISORY_DIR)):
        if not filename.endswith(".txt"):
            continue
        with open(os.path.join(ADVISORY_DIR, filename)) as f:
            text = f.read().strip()

        chunks = chunk_text(text)
        print(f"{filename}: {len(chunks)} chunks")

        for i, chunk in enumerate(chunks):
            ids.append(f"{filename}::chunk{i}")
            embeddings.append(embed(chunk))
            documents.append(chunk)
            metadatas.append({"source": filename, "chunk_index": i})

    collection.add(
        ids=ids,
        embeddings=embeddings,
        documents=documents,
        metadatas=metadatas,
    )
    print(f"\nIngested {len(ids)} chunks into '{COLLECTION_NAME}'")


def ask(question):
    client = chromadb.PersistentClient(path="./chroma_db")
    collection = client.get_collection(COLLECTION_NAME)

    query_embedding = embed(question)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=TOP_K,
    )

    chunks = results["documents"][0]
    sources = results["metadatas"][0]

    print("Retrieved from:")
    for s in sources:
        print(f"  - {s['source']} (chunk {s['chunk_index']})")
    print()

    context = "\n\n---\n\n".join(chunks)
    prompt = f"""You are a security analyst assistant. Answer the question
using only the context provided below. If the context doesn't contain
enough information to answer, say so — do not guess.

Context:
{context}

Question: {question}"""

    response = ollama.chat(
        model=CHAT_MODEL,
        messages=[{"role": "user", "content": prompt}],
    )
    print(response["message"]["content"])


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage:")
        print("  python rag.py ingest          Load advisories into ChromaDB")
        print('  python rag.py ask "question"  Query the pipeline')
        sys.exit(1)

    command = sys.argv[1]

    if command == "ingest":
        ingest()
    elif command == "ask":
        if len(sys.argv) < 3:
            print('Provide a question: python rag.py ask "your question"')
            sys.exit(1)
        ask(sys.argv[2])
    else:
        print(f"Unknown command: {command}")
        sys.exit(1)

Usage:

python rag.py ingest
python rag.py ask "What systems are affected by the XZ backdoor?"

Common mistakes

Chunk size too large. If your chunks are 2,000+ characters, retrieval becomes imprecise — every chunk contains multiple topics, so similarity search can’t distinguish between them. Start at 500 and adjust based on retrieval quality.

Wrong embedding dimensions. If you switch embedding models, you must re-ingest your documents. ChromaDB collections are locked to the dimensionality of the first embedding added. Delete the collection or create a new one after changing models.

Stale collections. If you update your advisory documents but don’t re-run ingestion, the vector store still contains the old chunks. The ingest() function above handles this by deleting the collection before re-creating it, but be aware of this if you modify the script.

Ignoring the context window. Retrieving 20 chunks and stuffing them all into the prompt can degrade answer quality or cause truncation. Retrieve 3-5 chunks and keep total context under a few thousand tokens.

Note

Treat your vector store like any other data store ChromaDB persists data to disk in the chroma_db/ directory. In a production deployment, apply the same access controls you’d use for any database — restrict file permissions, don’t expose the directory to untrusted users, and back it up if the data is valuable.

Embedding model mismatch. Use the same model for ingestion and querying. If you embed documents with nomic-embed-text but query with a different model, the vectors live in incompatible spaces and similarity search returns garbage.

Next steps

This pipeline is deliberately minimal. Here are directions to extend it:

Add metadata filtering. ChromaDB supports filtering by metadata fields. You could filter results by severity, date range, or affected product — useful when your corpus grows beyond a handful of advisories.
Improve chunking. Split on paragraph boundaries instead of character count. This preserves semantic units and improves retrieval for documents with clear structure.
Connect to a live feed. Replace the static advisory files with an ingestion script that pulls from a CVE feed, RSS source, or your Wazuh alert output. The pipeline pattern stays the same — only the document source changes.
Evaluate systematically. Build a set of test questions with known answers and measure retrieval precision. RAG quality is hard to assess by gut feel.
Understand adversarial risks. Once your pipeline retrieves from external or user-supplied documents, it becomes vulnerable to poisoning. The RAG Poisoning Simulator lets you test how injection payloads affect retrieval ranking and model output.

For the broader picture of running AI tooling in a security operations context — model selection, hardware requirements, and operational considerations — see Private AI in Your SOC: How to Run LLMs Locally.

To build intuition for how embeddings represent meaning and why similarity search works, explore the Embedding Space Explorer.