Private AI in Your SOC: How to Run LLMs Locally

Originally published on LinkedIn. Lightly edited for clarity.

SOC teams want LLMs for summarization and triage, but they cannot send raw logs and alerts to a public API.

A private model changes the risk posture: you keep the data local, control retention, and decide what leaves the boundary.

Install Ollama on your SOC host

Use a dedicated host or VM in the same private segment as your Wazuh manager.

On Linux:

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama

On macOS:

brew install ollama
brew services start ollama

Quick health check:

ollama --version
curl http://127.0.0.1:11434/api/tags

If this API is reachable from the network, put it behind a firewall and only allow trusted SOC automation hosts.

Pull and run a real self-hosted model

For alert summarization and first-pass triage, start with llama3.1:8b.

ollama pull llama3.1:8b
ollama run llama3.1:8b "Summarize this alert: multiple failed SSH logins from 203.0.113.44 in 2 minutes."

For automation pipelines, call the local API directly:

curl -sS http://127.0.0.1:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Return JSON only with keys summary,confidence,next_step. Alert: SSH brute force from 203.0.113.44.",
  "stream": false,
  "options": { "temperature": 0.1 }
}'

Use low temperature for repeatable output and enforce JSON output in the prompt so parsing is predictable.

Decide what the model is allowed to see

Start with scope. Most SOC use cases do not require full packet captures or raw identity data.

Define the minimum context the model needs and redact everything else. Think in tiers:

Tier 1: Alert metadata, signatures, severity, and timestamps.

Tier 2: Sanitized log excerpts with IDs and IPs masked.

Tier 3: Full raw events only when a human explicitly requests them.

Your redaction layer is the real security control. The model is just a consumer.

Choose a model and runtime with operational constraints

Running locally means you own latency, cost, and performance.

Make the tradeoffs explicit:

Smaller models are easier to run on CPU and are often sufficient for summarization.

Larger models improve reasoning but require GPU capacity and careful scheduling.

Quantization reduces memory but can change behavior.

Pick a runtime that supports offline operation, audit logging, and explicit model versioning.

Treat models like dependencies, not SaaS.

Wire Ollama into Wazuh alert enrichment

Use Wazuh Integrator on the manager to invoke a custom script for selected alerts.

Add a custom integration block in /var/ossec/etc/ossec.conf:

<integration>
  <name>custom-ollama-enrich</name>
  <level>10</level>
  <alert_format>json</alert_format>
</integration>

Create /var/ossec/integrations/custom-ollama-enrich:

#!/usr/bin/env python3
import datetime
import json
import sys
import urllib.request

OLLAMA_URL = "http://127.0.0.1:11434/api/generate"
MODEL = "llama3.1:8b"
OUTFILE = "/var/ossec/logs/llm-enrichment.json"

def call_ollama(prompt: str) -> dict:
  payload = {
    "model": MODEL,
    "prompt": prompt,
    "stream": False,
    "options": {"temperature": 0.1}
  }
  req = urllib.request.Request(
    OLLAMA_URL,
    data=json.dumps(payload).encode("utf-8"),
    headers={"Content-Type": "application/json"}
  )
  with urllib.request.urlopen(req, timeout=45) as resp:
    raw = json.loads(resp.read().decode("utf-8"))
  text = raw.get("response", "").strip()
  try:
    return json.loads(text)
  except json.JSONDecodeError:
    return {"summary": text, "confidence": "low", "next_step": "Analyst review required"}

def main() -> int:
  alert_path = sys.argv[1]
  with open(alert_path, "r", encoding="utf-8") as f:
    alert = json.load(f)

  rule = alert.get("rule", {})
  agent = alert.get("agent", {})
  srcip = alert.get("data", {}).get("srcip", "unknown")
  prompt = (
    "You are a SOC assistant. Return JSON only with keys: summary, confidence, next_step. "
    f"rule_id={rule.get('id')} rule_level={rule.get('level')} agent={agent.get('name')} srcip={srcip} "
    f"description={rule.get('description')}"
  )

  llm = call_ollama(prompt)
  enriched = {
    "integration": "ollama-enrich",
    "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
    "wazuh_rule_id": rule.get("id"),
    "wazuh_level": rule.get("level"),
    "agent": agent.get("name"),
    "llm": llm
  }

  with open(OUTFILE, "a", encoding="utf-8") as out:
    out.write(json.dumps(enriched) + "\n")
  return 0

if __name__ == "__main__":
  raise SystemExit(main())

Set permissions:

sudo chown root:wazuh /var/ossec/integrations/custom-ollama-enrich
sudo chmod 750 /var/ossec/integrations/custom-ollama-enrich

Now ingest the enrichment file back into Wazuh:

<localfile>
  <location>/var/ossec/logs/llm-enrichment.json</location>
  <log_format>json</log_format>
</localfile>

This closes the loop: Wazuh alert -> local LLM summary -> JSON line written -> Wazuh ingests enrichment as searchable alert context.

Validate the flow end to end

Trigger a known alert (for example, repeated SSH login failures).
Confirm Wazuh runs the integration script.
Check /var/ossec/logs/llm-enrichment.json for appended JSON.
Verify the enrichment event appears in Wazuh with llm.summary, llm.confidence, and llm.next_step.

Keep the model assistive. Use enrichment to prioritize and summarize, then let analysts and rules decide containment.

Build a simple, auditable pipeline

A durable pattern is:

Normalize alerts into a consistent schema.

Redact and classify sensitive fields.

Run local inference for summarization, prioritization, or enrichment.

Store output with provenance and version metadata.

The SOC should be able to answer, “Which model produced this summary?” without guesswork.

Keep the model inside the security boundary

Local LLMs are not magic. They are still software that can be exploited.

Treat the inference service as a sensitive system:

Isolate it on a private network segment.

Disable outbound network access by default.

Log all prompts and responses for auditability.

Apply the same hardening standards you use for other production services.

Accept the limits

Local models can help with triage and narrative building, but they do not replace detection logic or incident response.

Use them to reduce analyst toil, not to make the final call. Keep the model in an assistive role until you have strong validation.

2026 Perspective

Local models are smaller, faster, and easier to operate than they were even a couple of years ago, which makes this approach more practical.

The core discipline is unchanged: minimize what the model sees and treat outputs as untrusted. The SOC that wins is the one that keeps control of its data.