Local RAG with PDF Documents · Steven Foerster

The base RAG tutorial works with plain text files. That is fine for a proof of concept, but security teams deal with PDFs: vendor advisories shipped as formatted reports, audit findings, compliance frameworks like NIST SP 800-53, and incident postmortems. If your pipeline cannot ingest PDFs, it cannot answer questions about most of the documents your team actually produces.

This tutorial adds PDF support to the pipeline. You will parse PDFs with PyMuPDF, extract text with layout awareness, handle tables, pull metadata, and integrate everything with the existing ChromaDB ingestion flow. By the end, your pipeline ingests .txt and .pdf files from the same directory.

How PDF parsing fits the pipeline

The only thing that changes is the document source. Once text is extracted from a PDF, the rest of the pipeline (chunking, embedding, storing, querying) stays identical.

                  ┌──────────────┐
                  │  .txt files  │───── read ──────┐
                  └──────────────┘                  │
                                                    ▼
                  ┌──────────────┐          ┌──────────────┐
                  │  .pdf files  │── parse ─►│  Text + Meta │
                  └──────────────┘          └──────┬───────┘
                                                   │
                                              chunk + embed
                                                   │
                                            ┌──────▼───────┐
                                            │ Vector Store │
                                            │  (ChromaDB)  │
                                            └──────────────┘

The challenge is extraction quality. PDFs are a display format, not a data format. Text is positioned by coordinates, not by semantic structure. A paragraph that looks contiguous on screen might be stored as dozens of disconnected text fragments. Tables are especially problematic because rows and columns have no explicit structure in the PDF. A naive text extraction produces garbled output that confuses both the chunker and the embedding model.

Step 1: Install PyMuPDF

pip install pymupdf

PyMuPDF (imported as fitz) is fast, has no system dependencies beyond Python, and handles most PDF features well. It extracts text, images, and metadata, and provides page-level and block-level access to content.

Other options exist. pdfplumber is better at table extraction. unstructured handles more document formats but is heavier. For this tutorial, PyMuPDF covers enough ground to build a working pipeline and you can swap in a more specialized parser later if needed.

Verify the install:

import fitz
print(fitz.__doc__)

Step 2: Extract text from a PDF

Create pdf_loader.py with a basic extraction function:

import fitz
import os


def extract_text_from_pdf(filepath):
    """Extract text from a PDF, preserving page structure."""
    doc = fitz.open(filepath)
    pages = []

    for page_num, page in enumerate(doc):
        text = page.get_text("text")
        if text.strip():
            pages.append({
                "page": page_num + 1,
                "text": text.strip(),
            })

    doc.close()
    return pages

The "text" flag in get_text() produces plain text with reading order preserved. PyMuPDF analyzes the page layout and reconstructs the text flow, which handles multi-column layouts and sidebars reasonably well.

You will create a sample PDF in the next step. After that file exists, test the extractor with:

pages = extract_text_from_pdf("advisories/CVE-2024-21626.pdf")
for p in pages:
    print(f"--- Page {p['page']} ---")
    print(p["text"][:200])
    print()

Step 3: Create sample PDF documents

For testing, create a Python script that generates sample advisory PDFs. In production you would use real vendor reports, but controlled test documents let you verify extraction quality.

Create create_sample_pdfs.py:

import fitz
import os


def create_advisory_pdf(filename, title, content):
    """Create a simple PDF advisory document."""
    doc = fitz.open()
    page = doc.new_page()

    # Title
    title_rect = fitz.Rect(72, 72, 540, 120)
    page.insert_textbox(
        title_rect, title,
        fontsize=16, fontname="helv",
    )

    # Body
    body_rect = fitz.Rect(72, 140, 540, 720)
    page.insert_textbox(
        body_rect, content,
        fontsize=11, fontname="helv",
    )

    os.makedirs(os.path.dirname(filename), exist_ok=True)
    doc.save(filename)
    doc.close()


advisory_content = """CVE-2024-21626: runc Container Escape

Severity: High (CVSS 8.6)
Affected versions: runc 1.0.x, 1.1.0 through 1.1.11
Fixed in: runc 1.1.12

A container escape vulnerability was discovered in runc, the low-level \
container runtime used by Docker, Kubernetes, and other container \
platforms. The vulnerability allows an attacker with control over a \
container image or the ability to run arbitrary commands inside a \
container to escape to the host filesystem.

The flaw exists in how runc handles the working directory (cwd) \
specification. By setting the working directory to a path under \
/proc/self/fd/, an attacker can access file descriptors that reference \
the host filesystem. This allows reading and writing arbitrary files on \
the host.

Impact: An attacker who can create or modify a container image, or who \
has command execution inside a container, can break out of the container \
boundary and compromise the host system. This affects multi-tenant \
container environments where untrusted workloads share a host.

Mitigation: Upgrade runc to version 1.1.12 or later. If immediate \
upgrade is not possible, restrict who can create container images and \
monitor for containers with unusual working directory configurations. \
Use seccomp profiles to limit syscall access from containers."""

create_advisory_pdf(
    "advisories/CVE-2024-21626.pdf",
    "CVE-2024-21626: runc Container Escape",
    advisory_content,
)

print("Created advisories/CVE-2024-21626.pdf")

python create_sample_pdfs.py

Step 4: Layout-aware chunking

The base tutorial chunks text by character count with a fixed overlap. This works for uniform text but creates poor chunks when a PDF has headings, bullet lists, or tables. A heading split across two chunks loses its association with the content it introduces.

Add a paragraph-aware chunker to pdf_loader.py:

def chunk_by_paragraphs(text, max_chunk_size=500, overlap=50):
    """Split text on paragraph boundaries, merging small paragraphs."""
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]

    chunks = []
    current_chunk = []
    current_length = 0

    for para in paragraphs:
        para_length = len(para)

        if current_length + para_length > max_chunk_size and current_chunk:
            chunk_text = "\n\n".join(current_chunk)
            chunks.append(chunk_text)

            # Keep the last paragraph as overlap
            if overlap > 0 and current_chunk:
                last = current_chunk[-1]
                current_chunk = [last] if len(last) <= overlap * 2 else []
                current_length = len(last) if current_chunk else 0
            else:
                current_chunk = []
                current_length = 0

        current_chunk.append(para)
        current_length += para_length

    if current_chunk:
        chunks.append("\n\n".join(current_chunk))

    return chunks

This chunker:

Splits on double newlines (paragraph boundaries).
Merges short paragraphs until the chunk approaches max_chunk_size.
Carries the last paragraph forward as overlap, keeping headings attached to their content.

Compare the two approaches on the same text:

# Character chunking (from base tutorial)
def chunk_text(text, size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap
    return chunks

sample = extract_text_from_pdf("advisories/CVE-2024-21626.pdf")
text = sample[0]["text"]

char_chunks = chunk_text(text)
para_chunks = chunk_by_paragraphs(text)

print(f"Character chunking: {len(char_chunks)} chunks")
print(f"Paragraph chunking: {len(para_chunks)} chunks")

Paragraph chunking typically produces fewer, more semantically coherent chunks. Each chunk contains complete thoughts rather than arbitrary slices of text.

Tip

When to use which chunker Character chunking is predictable: every chunk is roughly the same size, which makes retrieval scoring consistent. Paragraph chunking produces better chunks but with variable sizes. For security advisories with clear section structure (Description, Impact, Mitigation), paragraph chunking is almost always better. For dense, unstructured text (log files, raw output), character chunking is safer.

Step 5: Handle tables

Tables in PDFs are the hardest extraction problem. PyMuPDF can find tables on a page, but the quality varies with the PDF’s internal structure.

Warning

Set realistic expectations for find_tables() PyMuPDF’s page.find_tables() is a heuristic. In our internal benchmarks against vendor advisories and compliance documents, it tends to land in roughly these buckets:

Tables with explicit ruling lines (most NIST SP 800-53 / DISA STIG style docs): cell-level extraction is usually solid.

Tables built from whitespace alignment (most CVSS bulletins, PCI ASV reports): cells are detected, but column boundaries shift on rows where text wraps. Plan to spot-check.

Multi-page tables: the header on page N+1 is rarely re-detected; rows are extracted but lose their column labels.

Tables with merged cells: extract() returns the un-merged grid with None in skipped cells. The pipe-formatted output below renders these as empty cells, which can confuse the embedding model.

If your corpus is heavy on tabular data, use a dedicated extractor as the second pass: pdfplumber for ruled tables, camelot for whitespace tables, or tabula-py for documents with strong column structure. None of them are silver bullets either, but on hard documents the extraction quality differential is large enough to matter.

Warning

PyMuPDF cannot OCR scanned PDFs If a “PDF” is actually a scanned image (every page is one big bitmap, copy-paste returns nothing), get_text() returns the empty string and find_tables() returns nothing — and they will do this silently. Detect with: if len(page.get_text().strip()) < 20 for most of the document, you’re looking at a scan. To get text out, run Tesseract over each page rendered as a 300 DPI image (page.get_pixmap(dpi=300)), then feed the OCR’d text into the same chunking pipeline. OCR introduces its own error class (substituted characters, broken word boundaries) that will affect retrieval quality, so flag OCR’d documents in metadata so you can audit them later.

Note

The sample PDF has no tables The advisory PDF you generated in Step 3 is text-only, so the table extractor will return an empty list when run against it. To see table extraction work, test this code against a real vendor advisory or compliance report (NIST SP 800-53, a CVSS bulletin with a CPE table, etc.). The integration in Step 7 still works correctly on text-only PDFs: it simply produces no table blocks.

Add table extraction to pdf_loader.py:

def extract_tables_from_page(page):
    """Extract tables from a PDF page as formatted text."""
    tables = page.find_tables()
    extracted = []

    for table in tables:
        rows = table.extract()
        if not rows:
            continue

        # Format as readable text
        header = rows[0]
        lines = [" | ".join(str(cell or "") for cell in header)]
        lines.append("-" * len(lines[0]))

        for row in rows[1:]:
            lines.append(" | ".join(str(cell or "") for cell in row))

        extracted.append("\n".join(lines))

    return extracted

This converts each table into a pipe-separated text format that the embedding model can process. The format preserves column relationships without relying on visual alignment.

Integrate table extraction into the main loader. This function supersedes extract_text_from_pdf from Step 2: it returns structured blocks (text and tables) with page numbers, which is what the rest of the pipeline needs. You can delete extract_text_from_pdf now, or keep it around for quick debugging.

def extract_pdf_content(filepath):
    """Extract text and tables from a PDF as page-tagged blocks."""
    doc = fitz.open(filepath)
    content_blocks = []

    for page_num, page in enumerate(doc):
        # Extract main text
        text = page.get_text("text").strip()
        if text:
            content_blocks.append({
                "type": "text",
                "page": page_num + 1,
                "content": text,
            })

        # Extract tables
        for table_text in extract_tables_from_page(page):
            content_blocks.append({
                "type": "table",
                "page": page_num + 1,
                "content": table_text,
            })

    doc.close()
    return content_blocks

Warning

Table extraction is best-effort PyMuPDF’s table detection works well for PDFs with explicit table structures (borders, gridlines). For PDFs where tables are just aligned text with no borders, extraction will be partial or wrong. Always inspect the output on representative documents before trusting it in your pipeline.

Step 6: Extract PDF metadata

PDFs carry metadata: title, author, creation date, and keywords. Store this in ChromaDB as filterable fields.

def extract_pdf_metadata(filepath):
    """Pull metadata from a PDF file."""
    doc = fitz.open(filepath)
    meta = doc.metadata

    result = {
        "source": os.path.basename(filepath),
        "title": meta.get("title", ""),
        "author": meta.get("author", ""),
        "created": meta.get("creationDate", ""),
        "page_count": len(doc),
    }

    doc.close()
    return {k: v for k, v in result.items() if v}

Not all PDFs have metadata populated. The source field (filename) is always available as a fallback.

Step 7: Integrate with the pipeline

Create a PDF-aware ingest.py that handles both .txt and .pdf files. If you are following from the consolidated rag.py in the base tutorial, use this file for ingestion from this point forward and keep rag.py ask for querying. Both scripts read and write the same security_advisories collection in ./chroma_db.

import os
import chromadb
import ollama
from pdf_loader import (
    extract_pdf_content,
    extract_pdf_metadata,
    chunk_by_paragraphs,
)

ADVISORY_DIR = "advisories"
COLLECTION_NAME = "security_advisories"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
EMBED_MODEL = "nomic-embed-text"


def chunk_text(text, size=CHUNK_SIZE, overlap=CHUNK_OVERLAP):
    """Character-based chunking for plain text files."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap
    return chunks


def embed(text):
    return ollama.embed(model=EMBED_MODEL, input=text)["embeddings"][0]


def load_txt(filepath):
    """Load a plain text file and return chunks with metadata."""
    filename = os.path.basename(filepath)
    with open(filepath) as f:
        text = f.read().strip()

    chunks = chunk_text(text)
    return [
        {
            "text": chunk,
            "metadata": {"source": filename, "chunk_index": i},
        }
        for i, chunk in enumerate(chunks)
    ]


def load_pdf(filepath):
    """Load a PDF and return chunks with page-level metadata."""
    filename = os.path.basename(filepath)
    pdf_meta = extract_pdf_metadata(filepath)
    content_blocks = extract_pdf_content(filepath)

    # Chunk each block independently so page numbers survive into metadata.
    results = []
    chunk_index = 0
    for block in content_blocks:
        for chunk in chunk_by_paragraphs(block["content"]):
            results.append({
                "text": chunk,
                "metadata": {
                    **pdf_meta,
                    "source": filename,
                    "chunk_index": chunk_index,
                    "page": block["page"],
                    "block_type": block["type"],
                },
            })
            chunk_index += 1

    return results


def ingest():
    client = chromadb.PersistentClient(path="./chroma_db")
    try:
        client.delete_collection(COLLECTION_NAME)
    except ValueError:
        pass

    collection = client.create_collection(
        name=COLLECTION_NAME,
        metadata={"hnsw:space": "cosine"},
    )

    ids, embeddings, documents, metadatas = [], [], [], []

    for filename in sorted(os.listdir(ADVISORY_DIR)):
        filepath = os.path.join(ADVISORY_DIR, filename)

        if filename.endswith(".txt"):
            chunks = load_txt(filepath)
        elif filename.endswith(".pdf"):
            chunks = load_pdf(filepath)
        else:
            continue

        print(f"{filename}: {len(chunks)} chunks")

        for chunk in chunks:
            chunk_id = f"{filename}::chunk{chunk['metadata']['chunk_index']}"
            ids.append(chunk_id)
            embeddings.append(embed(chunk["text"]))
            documents.append(chunk["text"])
            metadatas.append(chunk["metadata"])

    collection.add(
        ids=ids,
        embeddings=embeddings,
        documents=documents,
        metadatas=metadatas,
    )
    print(f"\nIngested {len(ids)} chunks into '{COLLECTION_NAME}'")


if __name__ == "__main__":
    ingest()

Run it:

python ingest.py

You should see both .txt and .pdf files being processed. When your corpus includes PDFs, run python ingest.py instead of python rag.py ingest so the PDF loader is used. The query interface from the base tutorial works without modification since it only interacts with ChromaDB.

Common mistakes

Using get_text() without a flag. page.get_text() defaults to "text" mode, which is usually correct. But for PDFs with complex layouts (multi-column, sidebars), "blocks" mode gives you more control over reading order. Test both on your specific documents.

Ignoring scanned PDFs. If a PDF contains scanned images instead of text (common in older documents), get_text() returns an empty string. You need OCR for these. PyMuPDF does not include OCR; you would need to add pytesseract or easyocr as a separate step. Check for empty extraction results and warn the user.

Embedding table markup. Pipe-separated table text embeds differently than prose. If your queries are natural language questions but your chunks contain column1 | column2 | column3, similarity scores will be lower. Consider prepending a natural language summary to table chunks: “Table showing affected versions and their fix status.”

Chunk size mismatch between formats. If you use character chunking for .txt at 500 characters and paragraph chunking for .pdf at 500 characters, the PDF chunks may be significantly larger (paragraphs do not split mid-sentence). This is fine for retrieval quality but watch the total context size when multiple large PDF chunks are retrieved.

Note

PDF extraction is never perfect Every PDF parser makes tradeoffs. PyMuPDF prioritizes speed and reliability over edge-case accuracy. If you work with a specific vendor’s report format regularly, test extraction on several examples and adjust the parsing logic. The chunker and embedding model can compensate for minor extraction noise, but garbage in still means garbage out.

Next steps

Connect to live data sources. The Connect Your RAG Pipeline to Live CVE Feeds tutorial shows how to pull from the NVD API and OSV.dev instead of relying on static files.
Add a chat interface to your pipeline. See Add a Chat Interface to Your Local RAG Pipeline.
Experiment with chunking parameters in the RAG Pipeline Playground to understand how different chunk sizes affect retrieval for your specific documents.
Understand adversarial risks. Once your pipeline ingests documents from multiple sources, poisoning becomes a concern. See RAG Poisoning.