Tutorial

Add a Chat Interface to Your Local RAG Pipeline

Wrap your local RAG pipeline in a Streamlit chat UI with conversation history, streaming responses, and source citations that show where every answer came from.

7 min read intermediate

Prerequisites

  • Tutorial: Build a Local RAG Pipeline with Ollama and ChromaDB
  • Basic Python knowledge

Part 2 of 5 in Local RAG Pipeline

Table of Contents

The RAG pipeline you built earlier works, but it is a single-shot command-line tool. You type a question, get an answer, and the pipeline forgets everything. That is fine for testing. It is not what an analyst wants to use during an investigation.

This tutorial wraps that pipeline in a chat interface using Streamlit. By the end you will have a browser-based assistant that streams answers token by token, remembers the conversation, and shows exactly which advisory chunks informed each response. Everything still runs locally. No data leaves your machine.

If you want to experiment with chunking parameters before building the UI, the RAG Pipeline Playground lets you visualize how chunk size and overlap affect document splitting.

What the chat layer adds

The existing pipeline has three stages: ingest, retrieve, generate. The chat interface sits on top of that without changing the underlying logic.

                    ┌─────────────────────────┐
                    │     Streamlit Chat UI    │
                    │  ┌───────────────────┐   │
                    │  │ Conversation State │   │
                    │  └────────┬──────────┘   │
                    └───────────┼───────────────┘

                    ┌───────────▼───────────────┐
                    │     RAG Pipeline Core     │
                    │                           │
          question ─►  retrieve() ─► generate() ─► streamed answer
                    │       │                   │
                    │  ┌────▼─────┐             │
                    │  │ ChromaDB │             │
                    │  └──────────┘             │
                    └───────────────────────────┘

The chat layer adds three things:

  1. Conversation history. Previous questions and answers are stored in Streamlit session state and passed to the model as context, so follow-up questions work naturally.
  2. Streaming responses. Tokens appear as they are generated instead of waiting for the full response, which makes the interface feel responsive even on slower hardware.
  3. Source citations. After each answer, an expandable section shows which document chunks were retrieved and their similarity scores.

Step 1: Install Streamlit

Make sure you are in the rag-pipeline directory with the virtual environment activated from the base tutorial.

pip install streamlit

Verify the install:

streamlit --version

Streamlit runs a local web server and opens a browser tab. It re-executes the entire Python script on every user interaction, which is important to understand: any variable not stored in st.session_state resets on each interaction.

Step 2: Build the basic chat loop

Create chat_app.py with a minimal chat interface that connects to your existing pipeline.

import streamlit as st
import chromadb
import ollama

COLLECTION_NAME = "security_advisories"
EMBED_MODEL = "nomic-embed-text"
CHAT_MODEL = "llama3.2"
TOP_K = 3

st.set_page_config(page_title="RAG Assistant", layout="centered")
st.title("Security Advisory Assistant")

if "messages" not in st.session_state:
    st.session_state.messages = []


def retrieve(question, n_results=TOP_K):
    """Find the most relevant chunks for a question."""
    client = chromadb.PersistentClient(path="./chroma_db")
    collection = client.get_collection(COLLECTION_NAME)

    query_embedding = ollama.embed(
        model=EMBED_MODEL, input=question
    )["embeddings"][0]

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        include=["documents", "metadatas", "distances"],
    )

    return (
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    )


# Display conversation history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Handle new input
if prompt := st.chat_input("Ask about a security advisory..."):
    # Show user message
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)

    # Retrieve relevant chunks
    chunks, metadata, distances = retrieve(prompt)

    # Build the prompt with retrieved context
    context = "\n\n---\n\n".join(chunks)
    system_prompt = f"""You are a security analyst assistant. Answer the
question using only the context provided below. If the context doesn't
contain enough information to answer, say so.

Context:
{context}"""

    # Generate response
    with st.chat_message("assistant"):
        response = ollama.chat(
            model=CHAT_MODEL,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt},
            ],
        )
        answer = response["message"]["content"]
        st.markdown(answer)

    st.session_state.messages.append(
        {"role": "assistant", "content": answer}
    )

Make sure you have already ingested the advisories from the base tutorial:

python rag.py ingest

Then launch the chat:

streamlit run chat_app.py

Streamlit opens http://localhost:8501 in your browser. Try asking “What is the XZ backdoor?” and you should get a grounded answer. But notice two problems: the response appears all at once (no streaming), and you cannot see which documents informed the answer. The next steps fix both.

Step 3: Add streaming responses

Ollama supports streaming natively. Instead of waiting for the complete response, you receive tokens as they are generated and display them incrementally.

Replace the generation block (the with st.chat_message("assistant"): section) with:

    # Generate streaming response
    with st.chat_message("assistant"):
        stream = ollama.chat(
            model=CHAT_MODEL,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt},
            ],
            stream=True,
        )

        response_parts = []
        container = st.empty()
        for chunk in stream:
            token = chunk["message"]["content"]
            response_parts.append(token)
            container.markdown("".join(response_parts))

        answer = "".join(response_parts)

The key change is stream=True in the ollama.chat() call. This returns an iterator of partial responses instead of a single complete response. Each chunk contains one or more tokens in chunk["message"]["content"].

st.empty() creates a placeholder that you overwrite on each token. This produces the familiar typing effect without creating duplicate elements in the Streamlit DOM.

Step 4: Display source citations

After each answer, show the retrieved chunks so the user can verify where the information came from. Add this block immediately after the streaming loop and before the st.session_state.messages.append line:

        # Show source citations
        with st.expander("Sources", expanded=False):
            for chunk_text, meta, dist in zip(chunks, metadata, distances):
                similarity = 1 - dist  # ChromaDB returns distance, not similarity
                st.markdown(
                    f"**{meta['source']}** (chunk {meta['chunk_index']}) "
                    f"— similarity: {similarity:.2f}"
                )
                st.code(
                    chunk_text[:200]
                    + ("..." if len(chunk_text) > 200 else "")
                )

ChromaDB returns cosine distance (lower is better), not cosine similarity (higher is better). The conversion is similarity = 1 - distance. A similarity of 0.85+ generally indicates a strong match.

The st.expander keeps the interface clean. Sources are available immediately after a new answer is generated, but they do not clutter the conversation by default.

Note

Source citations are shown once Because st.session_state.messages stores only the answer text, the source expander is rendered when the answer is generated but is not replayed for prior turns when Streamlit re-runs the script. If you want persistent per-turn sources, store the retrieved chunks and metadata alongside each assistant message in session state and re-render them in the display loop.

Tip

Source transparency builds trust Showing sources is not just a debugging aid. Analysts need to verify that the model’s answer is grounded in real advisory data, not hallucinated. Source citations make this a one-click check instead of a manual search.

Step 5: Add conversation context

Right now the model treats each question independently. Ask “What is the XZ backdoor?” followed by “How do I detect it?” and the model does not know what “it” refers to. Fix this by including recent conversation history in both the retrieval query and the prompt.

Move the history-building code before the line that appends the new user message to st.session_state.messages. This way the history contains only prior exchanges, not the current question (which is already sent as the user message).

Then retrieve with a history-aware query. This matters because ChromaDB only sees the text passed to retrieve(). If the current question is “How do I detect it?”, retrieval needs the previous turn to know what “it” refers to.

Replace the retrieval and system prompt construction with:

    # Build conversation context from recent history
    history_window = st.session_state.messages[-6:]  # Last 3 exchanges
    history_text = ""
    for msg in history_window:
        role = "User" if msg["role"] == "user" else "Assistant"
        history_text += f"{role}: {msg['content']}\n\n"

    # Retrieve with enough context for follow-up questions
    retrieval_query = prompt
    if history_text:
        retrieval_query = (
            "Previous conversation:\n"
            f"{history_text}\n"
            f"Current question: {prompt}"
        )

    chunks, metadata, distances = retrieve(retrieval_query)

    # Build the prompt with retrieved context and conversation history
    context = "\n\n---\n\n".join(chunks)
    system_prompt = f"""You are a security analyst assistant. Answer the
question using the retrieved context below. If the context doesn't contain
enough information, say so.

Previous conversation:
{history_text}

Retrieved context:
{context}"""

The [-6:] slice keeps the last three question-answer pairs (6 messages). This is enough for most follow-up chains without consuming too much of the model’s context window. It also keeps the history-augmented retrieval query within nomic-embed-text’s 2048-token input window: longer inputs are silently truncated by the embedder, which can quietly degrade retrieval quality.

Warning

Watch the context window Every conversation turn and every retrieved chunk consumes tokens. With 3 history pairs, 3 retrieved chunks, and the system prompt, you are using roughly 1,500 to 3,000 tokens of context per query. This usually fits within Ollama’s configured runtime context, but that default is VRAM-dependent and can be smaller than the model’s advertised maximum. If you increase TOP_K or the history window, check your OLLAMA_CONTEXT_LENGTH or model num_ctx setting. Generation quality degrades long before you hit the hard limit.

The complete application

Here is the full chat_app.py with all features integrated:

import streamlit as st
import chromadb
import ollama

COLLECTION_NAME = "security_advisories"
EMBED_MODEL = "nomic-embed-text"
CHAT_MODEL = "llama3.2"
TOP_K = 3

st.set_page_config(page_title="RAG Assistant", layout="centered")
st.title("Security Advisory Assistant")

if "messages" not in st.session_state:
    st.session_state.messages = []


def retrieve(question, n_results=TOP_K):
    """Find the most relevant chunks for a question."""
    client = chromadb.PersistentClient(path="./chroma_db")
    collection = client.get_collection(COLLECTION_NAME)

    query_embedding = ollama.embed(
        model=EMBED_MODEL, input=question
    )["embeddings"][0]

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        include=["documents", "metadatas", "distances"],
    )

    return (
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    )


# Display conversation history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Handle new input
if prompt := st.chat_input("Ask about a security advisory..."):
    # Build conversation context before appending the new message
    history_window = st.session_state.messages[-6:]
    history_text = ""
    for msg in history_window:
        role = "User" if msg["role"] == "user" else "Assistant"
        history_text += f"{role}: {msg['content']}\n\n"

    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)

    # Retrieve relevant chunks with enough context for follow-up questions
    retrieval_query = prompt
    if history_text:
        retrieval_query = (
            "Previous conversation:\n"
            f"{history_text}\n"
            f"Current question: {prompt}"
        )

    chunks, metadata, distances = retrieve(retrieval_query)

    # Build the prompt
    context = "\n\n---\n\n".join(chunks)
    system_prompt = f"""You are a security analyst assistant. Answer the
question using the retrieved context below. If the context doesn't contain
enough information, say so.

Previous conversation:
{history_text}

Retrieved context:
{context}"""

    # Generate streaming response
    with st.chat_message("assistant"):
        stream = ollama.chat(
            model=CHAT_MODEL,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": prompt},
            ],
            stream=True,
        )

        response_parts = []
        container = st.empty()
        for chunk in stream:
            token = chunk["message"]["content"]
            response_parts.append(token)
            container.markdown("".join(response_parts))

        answer = "".join(response_parts)

        # Show source citations
        with st.expander("Sources", expanded=False):
            for chunk_text, meta, dist in zip(chunks, metadata, distances):
                similarity = 1 - dist
                st.markdown(
                    f"**{meta['source']}** (chunk {meta['chunk_index']}) "
                    f"— similarity: {similarity:.2f}"
                )
                st.code(
                    chunk_text[:200]
                    + ("..." if len(chunk_text) > 200 else "")
                )

    st.session_state.messages.append(
        {"role": "assistant", "content": answer}
    )

Run it:

streamlit run chat_app.py

Try a multi-turn conversation:

  1. “What is the XZ backdoor?”
  2. “How can I detect if my system is affected?”
  3. “Which of the advisories has the highest CVSS score?”

Each answer should stream in, show relevant sources, and the model should handle follow-up references correctly.

Common mistakes

Not initializing session state. If you forget the if "messages" not in st.session_state guard, the conversation resets on every interaction. Streamlit re-runs the entire script each time the user submits input.

Unbounded conversation history. Passing the full conversation history to the model sounds appealing but fills the context window quickly. Three exchanges (six messages) is a reasonable default. If you need longer memory, consider summarizing older turns instead of passing them verbatim.

Creating a new ChromaDB client per query. PersistentClient reads from disk on initialization. In a high-traffic deployment, move client creation to a cached function with @st.cache_resource to avoid repeated disk reads:

@st.cache_resource
def get_collection():
    client = chromadb.PersistentClient(path="./chroma_db")
    return client.get_collection(COLLECTION_NAME)

Confusing distance and similarity. ChromaDB’s query() returns cosine distance by default. A distance of 0.15 means a similarity of 0.85. Displaying raw distance values confuses users who expect higher numbers to mean better matches.

Note

Streamlit’s execution model Streamlit re-runs your script from top to bottom on every user interaction (button click, chat input, slider change). State that must persist between runs goes in st.session_state. Everything else resets. This is by design, not a bug, but it catches most people the first time.

Next steps

You now have a working chat interface for your RAG pipeline. Here are directions to extend it: