Multi-Tenant GPU Isolation · Steven Foerster

GPU clouds sell the same hardware to multiple tenants simultaneously. The economics demand it: an H100 costs $30,000+ and sits idle most of the time if dedicated to a single workload. But sharing GPUs between tenants creates isolation problems that are fundamentally different from CPU multi-tenancy. GPUs were designed for throughput, not isolation. They have no equivalent of x86 privilege rings, no hardware-enforced memory protection between workloads (outside of MIG), and a shared memory hierarchy that leaks timing information. Every GPU sharing mechanism involves trade-offs between isolation strength, performance overhead, and hardware utilization.

This tutorial examines the four primary GPU sharing mechanisms from a security perspective: what each one actually isolates, what it does not, and what an attacker on a shared GPU can observe about co-tenant workloads. The goal is to give infrastructure architects an accurate threat model for multi-tenant GPU platforms, not to provide exploits, but to make clear where the boundaries are thin so that platform designers can make informed decisions about tenant placement and isolation guarantees.

The GPU security model (or lack of one)

Before diving into sharing mechanisms, it’s worth understanding why GPU isolation is harder than CPU isolation. A modern CPU has decades of multi-tenant security engineering: ring-based privilege levels, page tables with per-process virtual address spaces, hardware TLBs that enforce isolation, and hypervisor extensions (VT-x/AMD-V) that create hardware-enforced boundaries between VMs. These mechanisms exist because CPUs have always run untrusted code from multiple users.

GPUs evolved differently. They were designed as coprocessors for a single application, first graphics rendering, then GPGPU compute. The threat model assumed a single trusted user submitting work through a driver. Multi-tenancy was not a design goal.

        CPU Isolation Model                 GPU Isolation Model (default)

┌────────────────────────────────┐   ┌────────────────────────────────┐
│  Process A    │   Process B    │   │  Kernel A     │   Kernel B     │
│  (ring 3)     │   (ring 3)     │   │  (user mode)  │   (user mode)  │
├───────────────┼────────────────┤   ├───────────────┴────────────────┤
│          Page Tables           │   │     Shared GPU Memory Space    │
│   (hardware-enforced per       │   │   (driver-managed, no HW       │
│    process virtual memory)     │   │    isolation between contexts)  │
├────────────────────────────────┤   ├────────────────────────────────┤
│         Kernel (ring 0)        │   │        GPU Driver (host)       │
├────────────────────────────────┤   ├────────────────────────────────┤
│   Hypervisor (ring -1, opt.)   │   │   MIG / vGPU (hardware, opt.) │
├────────────────────────────────┤   ├────────────────────────────────┤
│         Hardware (CPU)         │   │        Hardware (GPU)          │
└────────────────────────────────┘   └────────────────────────────────┘

The key differences:

Property	CPU	GPU (default)
Memory isolation	Hardware page tables per process	Driver-managed allocation, shared physical address space
Execution isolation	Preemptive scheduling, privilege rings	Cooperative or time-sliced, no privilege separation between kernels
Cache isolation	Per-core L1/L2, partitionable LLC	Shared L2 cache across all SMs, no partitioning (pre-MIG)
Context switching	Full hardware context save/restore with TLB flush	Partial. GPU context switches do not scrub shared memory or caches
Memory scrubbing	OS zeros pages on allocation (default)	Driver behavior varies, residual data may persist between allocations

Insight

The shared memory hierarchy The most significant security difference is the GPU’s shared L2 cache and memory controllers. On a CPU, each process has its own virtual address space backed by hardware page tables, even if two processes share physical memory through a cache, speculative execution side-channels are required to observe it (Spectre/Meltdown class). On a GPU without MIG, two CUDA contexts share the L2 cache and memory bus directly. Timing side-channels are trivial, a co-tenant kernel can measure memory access latency to infer cache line usage patterns of another workload.

Time-slicing is the simplest GPU sharing mechanism. The GPU driver runs one workload at a time and switches between them on a fixed schedule (typically every 1-7ms). It requires no special hardware and works on any NVIDIA GPU.

Time ──────────────────────────────────────────────────────►

┌──────────┐           ┌──────────┐           ┌──────────┐
│ Tenant A │           │ Tenant A │           │ Tenant A │
│ kernels  │           │ kernels  │           │ kernels  │
└──────────┘           └──────────┘           └──────────┘
            ┌──────────┐           ┌──────────┐
            │ Tenant B │           │ Tenant B │
            │ kernels  │           │ kernels  │
            └──────────┘           └──────────┘

◄── 1ms ──►◄── 1ms ──►◄── 1ms ──►◄── 1ms ──►

GPU resources:  Shared SMs, shared memory, shared L2 cache
Isolation:      Temporal only — no spatial isolation

What is isolated: Execution is temporally separated. While Tenant A’s kernels run, Tenant B’s kernels are not executing.

What is not isolated:

GPU memory, Both tenants’ allocations coexist in GPU DRAM. The driver prevents direct pointer access across contexts, but the physical memory is shared and not scrubbed between context switches.
L2 cache; No flush or partition between time slices. Tenant B’s kernels execute with Tenant A’s cache lines still warm, enabling cache timing attacks.
Memory bandwidth, Bursty workloads from one tenant affect the other’s memory access latency even across time slices (bank conflicts, row buffer misses).
Performance counters, NVIDIA’s hardware performance counters are global. A tenant monitoring GPU utilization metrics can infer characteristics of the co-tenant’s workload.

Configuring time-slicing on Kubernetes

Time-slicing is configured through the NVIDIA GPU Operator’s device plugin. Each physical GPU is exposed as multiple Kubernetes resources.

# time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

# Apply the config and patch the cluster policy
kubectl create -f time-slicing-config.yaml
kubectl patch clusterpolicy/cluster-policy \
  -n gpu-operator --type merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config"}}}}'

After applying, each physical GPU appears as 4 nvidia.com/gpu resources. Pods requesting 1 GPU get a time-sliced share.

Warning

Time-slicing provides no memory isolation and no performance guarantees. A pod requesting 1 time-sliced GPU can allocate all available GPU memory, starving co-tenants. There is no admission control for memory. Time-slicing is appropriate for development clusters and batch workloads with trusted users, not for multi-tenant production environments.

Time-slicing attack surface

An attacker co-located on a time-sliced GPU can:

Observe cache residue, Launch a CUDA kernel that probes L2 cache set access times immediately after a context switch. Cache lines touched by the previous tenant’s workload will be warm. This leaks information about the co-tenant’s memory access patterns, which can reveal model architecture details (layer sizes, activation patterns).
Measure memory allocation patterns, Call cudaMemGetInfo() repeatedly to observe how much free memory remains. Changes in free memory between time slices reveal the co-tenant’s allocation behavior, which correlates with model size and batch size.
Infer kernel execution characteristics, Monitor GPU clock speeds and power draw through nvidia-smi or NVML. Different model architectures produce different power signatures. Transformer inference has a distinct pattern from CNN inference.

MPS allows multiple CUDA contexts to submit work to the GPU simultaneously, unlike time-slicing which serializes them. MPS runs a daemon that multiplexes CUDA contexts onto a shared GPU context, enabling kernel-level interleaving.

┌──────────────────────────────────────────────────────────┐
│                    GPU Hardware                          │
│                                                          │
│  SM 0-15: Tenant A kernels    SM 16-31: Tenant B kernels│
│  (concurrent execution)       (concurrent execution)     │
│                                                          │
│  ┌──────────────────────────────────────────────────┐    │
│  │              Shared L2 Cache                      │   │
│  └──────────────────────────────────────────────────┘    │
│  ┌──────────────────────────────────────────────────┐    │
│  │              Shared HBM Memory                    │   │
│  └──────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────┘
         │                              │
    ┌────┴────┐                    ┌────┴────┐
    │ MPS     │                    │ MPS     │
    │ Client  │                    │ Client  │
    │ (A)     │                    │ (B)     │
    └────┬────┘                    └────┬────┘
         │                              │
    ┌────┴──────────────────────────────┴────┐
    │          MPS Daemon (host)             │
    └───────────────────────────────────────┘

What MPS improves over time-slicing:

Concurrent kernel execution, better GPU utilization
Lower context switch overhead
Configurable SM partitioning (percentage of SMs per client)
Pinned device memory limits per client

What MPS does not isolate:

L2 cache, Fully shared between all MPS clients. Concurrent execution makes cache side-channels even more practical than time-slicing because the attacker’s probe kernels run simultaneously with the victim’s workload.
Memory bus, Shared, with contention between concurrent clients.
Scheduling and control plane; All clients still depend on the same MPS server. Resource contention, daemon failures, and recovery events remain shared-fate problems.
Fault containment, Pre-Volta MPS had effectively no fault isolation: a fatal GPU fault in one client could terminate every client sharing the server. Volta and newer GPUs improved this with isolated GPU address spaces and limited fault containment, but this is still weaker than MIG or VM boundaries.

Warning

MPS was designed for single-user workloads that span multiple processes (e.g., MPI ranks). NVIDIA explicitly says it is not recommended for use in multi-tenant environments. On Volta+ hardware MPS is materially better than the pre-Volta model because clients get isolated GPU address spaces and partial fault containment, but clients still share caches, memory bandwidth, scheduling, and the MPS control plane. That makes it inappropriate for hostile-tenant isolation.

Configuring MPS resource limits

# Start MPS daemon with resource limits
export CUDA_VISIBLE_DEVICES=0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log

nvidia-cuda-mps-control -d

# Set active thread percentage per client (limits SM usage)
echo "set_default_active_thread_percentage 50" | \
  nvidia-cuda-mps-control

# Set an 8 GiB pinned device memory limit per client
echo "set_default_device_pinned_mem_limit 0 8192M" | \
  nvidia-cuda-mps-control

MIG is the only GPU sharing mechanism that provides hardware-enforced isolation. Available on A100, A30, H100, and newer GPUs, MIG partitions a single GPU into up to seven independent instances, each with dedicated SMs, L2 cache slices, and memory controllers.

┌────────────────────────────────────────────────────────────────┐
│                     H100 GPU (80 GB HBM3)                      │
│                                                                │
│  ┌─────────────────────┐  ┌─────────────────────┐             │
│  │   MIG Instance 0    │  │   MIG Instance 1    │             │
│  │   (3g.40gb)         │  │   (3g.40gb)         │             │
│  │                     │  │                     │             │
│  │  SM 0-41 (42 SMs)  │  │  SM 42-83 (42 SMs) │             │
│  │  L2: 40 MB         │  │  L2: 40 MB         │             │
│  │  HBM: 40 GB        │  │  HBM: 40 GB        │             │
│  │  Mem ctrl: 4 of 8  │  │  Mem ctrl: 4 of 8  │             │
│  │                     │  │                     │             │
│  │  ┌───────────────┐  │  │  ┌───────────────┐  │             │
│  │  │ Dedicated L2  │  │  │  │ Dedicated L2  │  │             │
│  │  │ cache slice   │  │  │  │ cache slice   │  │             │
│  │  └───────────────┘  │  │  └───────────────┘  │             │
│  │  ┌───────────────┐  │  │  ┌───────────────┐  │             │
│  │  │ Dedicated mem │  │  │  │ Dedicated mem │  │             │
│  │  │ controllers   │  │  │  │ controllers   │  │             │
│  │  └───────────────┘  │  │  └───────────────┘  │             │
│  └─────────────────────┘  └─────────────────────┘             │
│                                                                │
│  P2P: Driver-dependent; same-GPU MIG P2P on recent R570+      │
│  Error isolation: per-instance (Xid errors are contained)      │
└────────────────────────────────────────────────────────────────┘

What MIG isolates (hardware-enforced):

SMs; Each instance gets dedicated Streaming Multiprocessors. A kernel in Instance 0 cannot execute on Instance 1’s SMs.
L2 cache, Physically partitioned. Each instance has its own L2 cache slices. No shared cache lines between instances.
Memory controllers; Each instance has dedicated HBM memory controllers. No memory bandwidth contention between instances.
Memory, Dedicated HBM allocation per instance. One instance cannot address another’s memory.
Error isolation, An Xid error in one instance does not affect other instances. This is critical for production multi-tenancy.

What MIG does not isolate:

Fabric/interconnect semantics, MIG does not give you a general-purpose multi-GPU fabric between tenants. Recent R570 drivers support peer-to-peer access between MIG instances on the same GPU, but cross-GPU P2P and NCCL behavior remain more constrained than full-GPU mode. Do not assume MIG slices can participate in arbitrary distributed-training topologies.
PCIe bandwidth; All instances share the PCIe link to the host. Heavy host-to-device transfers in one instance reduce available PCIe bandwidth for others.
GPU clock frequency; All instances share the same clock. One instance hitting thermal limits throttles all instances.
Power budget, The GPU’s TDP is shared. A compute-heavy workload in one instance reduces available power for others.
NVDEC/NVENC, Video codec engines are shared (not partitioned by MIG).

MIG partition profiles

MIG profiles follow the naming convention <GPU_instance_slices>g.<memory_GB>gb. The H100 supports these partition layouts (among others):

Profile	SMs	L2 Cache	Memory	Max instances
7g.80gb	132 (all)	80 MB	80 GB	1
4g.40gb	60	40 MB	40 GB	1
3g.40gb	42	40 MB	40 GB	2
2g.20gb	28	20 MB	20 GB	3
1g.10gb	14	10 MB	10 GB	7
1g.10gb+me	14	10 MB	10 GB	1 (with media extensions)

Configuring MIG

# Enable MIG mode (requires GPU reset)
sudo nvidia-smi -i 0 -mig 1

# Create two 3g.40gb instances
sudo nvidia-smi mig -i 0 -cgi 9,9 -C

# Verify instances
nvidia-smi mig -i 0 -lgi

+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             ID                             |
|   0   MIG 3g.40gb      1                              |
|   0   MIG 3g.40gb      2                              |
+-------------------------------------------------------+

MIG on Kubernetes

The NVIDIA GPU Operator exposes MIG instances as distinct Kubernetes resources.

# Pod requesting a specific MIG profile
apiVersion: v1
kind: Pod
metadata:
  name: inference-workload
spec:
  containers:
    - name: model-server
      image: nvcr.io/nvidia/tritonserver:24.01-py3
      resources:
        limits:
          nvidia.com/mig-3g.40gb: 1

Tip

MIG strategy modes The GPU Operator supports two MIG strategies: single (all MIG instances on a GPU must be the same profile) and mixed (different profiles can coexist on the same GPU). Mixed mode gives more flexibility but complicates scheduling. For multi-tenant platforms, single strategy is simpler to reason about and avoids fragmentation issues.

MIG security analysis

MIG is the strongest isolation mechanism available, but it has residual attack surface:

Shared PCIe bandwidth, An attacker in one MIG instance can measure PCIe transfer latency. When a co-tenant instance performs heavy host-to-device copies, the attacker’s PCIe transfers slow down. This leaks information about the co-tenant’s data loading patterns, which can reveal batch sizes and data pipeline characteristics.
Shared thermal/power domain, GPU clock throttling is global. An attacker running a power-virus workload can force thermal throttling that affects co-tenant performance. This is a denial-of-service vector, not a data leak, but it matters for SLA guarantees.
GPU firmware and driver bugs, MIG isolation is enforced jointly by GPU firmware and the host driver, so a bug in either can in principle allow cross-instance memory access. Most public NVIDIA security advisories have been general driver bugs rather than MIG-specific escapes — CVE-2024-0090 (out-of-bounds write) and CVE-2024-0092 (improper handling of exceptional conditions), for example, are driver-wide and affect non-MIG configurations as well; NVIDIA’s bulletin does not call out MIG isolation as the primary impact. Don’t read those CVEs as evidence that MIG itself is broken; read them as a reminder that the driver layer underneath MIG is part of your TCB. Keep both the kernel-mode driver and GPU firmware on the latest GSA-fixed versions, and track the NVIDIA Product Security page for advisories that explicitly list MIG impact.
Metadata leakage via NVML; Some GPU-wide metrics remain visible to all MIG instances through NVML: total GPU power draw, GPU temperature, and driver version. Power draw correlates with compute intensity across all instances.

Insight

MIG as the minimum viable isolation For multi-tenant GPU clouds serving untrusted workloads, MIG is the minimum acceptable isolation mechanism. Time-slicing and MPS should be considered internal-only sharing modes for trusted workloads. If your threat model includes adversarial tenants (which any public cloud must assume), MIG or dedicated GPUs are the only defensible positions.

NVIDIA vGPU (formerly GRID) provides GPU virtualization at the hypervisor level. A virtual GPU manager running in the hypervisor mediates access to the physical GPU, presenting each VM with a virtual GPU device.

┌───────────────┐  ┌───────────────┐  ┌───────────────┐
│    VM 1        │  │    VM 2        │  │    VM 3        │
│  ┌──────────┐  │  │  ┌──────────┐  │  │  ┌──────────┐  │
│  │ vGPU     │  │  │  │ vGPU     │  │  │  │ vGPU     │  │
│  │ driver   │  │  │  │ driver   │  │  │  │ driver   │  │
│  └────┬─────┘  │  │  └────┬─────┘  │  │  └────┬─────┘  │
└───────┼────────┘  └───────┼────────┘  └───────┼────────┘
        │                   │                   │
┌───────┴───────────────────┴───────────────────┴────────┐
│              vGPU Manager (Hypervisor)                  │
│         Scheduling  │  Memory management               │
│         Isolation   │  Error containment               │
├────────────────────────────────────────────────────────┤
│                Physical GPU                             │
└────────────────────────────────────────────────────────┘

vGPU profiles define the resources allocated to each virtual GPU: number of SMs, framebuffer size, display heads, and encode/decode sessions. Like MIG, vGPU enforces memory isolation; each vGPU gets a dedicated framebuffer partition.

vGPU vs MIG: On GPUs that support both (A100, H100), vGPU can use MIG as the underlying partitioning mechanism, combining hypervisor-level isolation with hardware-enforced GPU partitioning. This is the strongest available isolation:

Property	Time-slicing	MPS	MIG	vGPU	vGPU + MIG
SM isolation	None	Soft (active threads)	Hardware	Temporal	Hardware + hypervisor
L2 cache isolation	None	None	Hardware	Time-sliced	Hardware
Memory isolation	None	Address space (Volta+)	Hardware	Hypervisor/framebuffer	Hardware + hypervisor
Error isolation	None	Limited (Volta+)	Per-instance	Per-VM	Per-VM + per-instance
P2P / NVLink	Yes	Yes	Driver-dependent	Yes (some profiles)	Driver-dependent
Performance overhead	~5%	~2%	~0%	~5-10%	~5-10%
Licensing	Free	Free	Free	Paid (per vGPU)	Paid (per vGPU)

Note

Two nuances matter here. First, MPS on Volta and newer GPUs is materially stronger than pre-Volta MPS because clients have isolated GPU address spaces and limited fault containment, but caches and scheduling are still shared. Second, MIG peer-to-peer behavior is driver- and topology-dependent: recent R570 drivers support same-GPU MIG-to-MIG P2P, but cross-GPU communication remains more constrained than full-GPU mode.

Memory residue between tenants

One of the most overlooked risks in GPU multi-tenancy is memory residue, data from a previous tenant’s workload remaining accessible in GPU memory after deallocation. This is the GPU equivalent of the “cold boot attack” on DRAM, except it doesn’t require physical access.

What happens on deallocation

When a CUDA context frees GPU memory with cudaFree(), the driver marks the pages as available but does not zero them. The next allocation with cudaMalloc() may return pages that still contain the previous tenant’s data, model weights, activations, gradients, or input data.

# Demonstrate memory residue (run on a GPU with shared contexts)
# This is a simplified illustration — real exploitation uses raw
# CUDA memory access, not cupy.

python3 -c "
import cupy as cp
import numpy as np

# Tenant A: Allocate and write sensitive data
secret = cp.array([0xDEADBEEF] * 1024, dtype=cp.uint32)
secret_ptr = secret.data.ptr
print(f'Tenant A wrote to GPU address: {secret_ptr:#x}')
del secret  # Free memory (pages NOT zeroed)
cp.get_default_memory_pool().free_all_blocks()

# Tenant B: Allocate same region
# The driver may return the same physical pages
probe = cp.empty(1024, dtype=cp.uint32)
residue = cp.asnumpy(probe)
matches = np.sum(residue == 0xDEADBEEF)
print(f'Tenant B found {matches} residual values from Tenant A')
"

Warning

This demonstration works within a single process for illustration. In real multi-tenant scenarios, the attack requires the driver to reuse physical GPU pages across CUDA contexts, which it does, because GPU memory is not zeroed on allocation by default. MIG instances have separate memory pools, which prevents cross-instance residue.

Mitigations

Mitigation	Mechanism	Overhead	Coverage
`CUDA_MEMSET_ON_ALLOC=1`	Zeros memory on every `cudaMalloc`	2-5% per allocation	Application-level, must be set per workload
MIG	Separate memory pools per instance	~0%	Hardware, eliminates cross-instance residue
Driver-level scrubbing	Zero pages on context destroy	1-3% per context teardown	Requires driver support (recent drivers)
`nvidia-smi -r` (GPU reset)	Full GPU reset between tenants	Seconds of downtime	Complete but slow, impractical for elastic scheduling

Practical guidance for platform architects

The choice of GPU isolation mechanism should be driven by your threat model.

Threat model 1: Internal platform (trusted tenants)

Tenants are internal teams within the same organization. The threat is accidental interference (noisy neighbors, memory leaks), not adversarial attacks.

Recommendation: Time-slicing or MPS with resource limits. MIG for workloads with strict SLA requirements.

┌──────────────────────────────────────────────┐
│           Internal GPU Platform              │
│                                              │
│  Kubernetes + NVIDIA GPU Operator            │
│  Time-slicing: dev/test workloads            │
│  MIG: production inference (SLA-bound)       │
│  Dedicated GPUs: training jobs               │
│                                              │
│  Key controls:                               │
│  • Resource quotas per namespace             │
│  • Priority classes for preemption           │
│  • Monitoring: DCGM + Prometheus             │
│  • No untrusted code execution               │
└──────────────────────────────────────────────┘

Threat model 2: Multi-tenant cloud (untrusted tenants)

Tenants run arbitrary code and must be assumed adversarial. This is the neocloud/GPU-as-a-service model.

Recommendation: MIG at minimum. vGPU + MIG for strongest isolation. Dedicated nodes for training workloads that require NVLink.

┌──────────────────────────────────────────────┐
│         Multi-Tenant GPU Cloud               │
│                                              │
│  Per-tenant Kubernetes namespaces            │
│  MIG instances: inference serving            │
│  Dedicated nodes: training (NVLink needed)   │
│  vGPU + MIG: highest isolation tier          │
│                                              │
│  Key controls:                               │
│  • MIG-only sharing (no time-slicing)        │
│  • Network policies between tenant namespaces│
│  • GPU driver pinned to tested version       │
│  • NVML metrics restricted per tenant        │
│  • Memory scrubbing on instance teardown     │
│  • Regular GPU firmware updates              │
│  • Audit logging of MIG reconfigurations     │
└──────────────────────────────────────────────┘

Threat model 3: Regulated environment (compliance-driven)

Compliance frameworks (FedRAMP, SOC 2, HIPAA) may require documented isolation boundaries. GPU sharing mechanisms are not yet well-covered by most compliance frameworks, which creates a documentation burden on the platform provider.

Recommendation: Dedicated GPUs or vGPU + MIG with documented security controls. Prepare for auditors who don’t understand GPU isolation boundaries, have architecture diagrams and testing evidence ready.

Tip

Document what MIG does and does not isolate Compliance auditors will ask “are tenant workloads isolated?” The answer for MIG is nuanced: compute and memory are hardware-isolated, but PCIe bandwidth, power, and thermal domains are shared. Document this explicitly in your System Security Plan rather than claiming “full isolation”, auditors who discover shared resources you didn’t disclose will question everything else.

Where to go from here

GPU isolation is one piece of a larger multi-tenant infrastructure security story. The interconnect topology between GPUs determines what sharing is even possible: a single NVSwitch-connected baseboard cannot provide NVLink access to MIG instances across tenants, which means training workloads requiring multi-GPU communication must be dedicated to whole nodes. Explore the GPU Topology Visualizer to understand how NVLink and NVSwitch affect isolation boundaries.

On the Kubernetes side, GPU isolation mechanisms must integrate with the broader cluster security model: network policies between tenant namespaces, pod security standards preventing privileged containers from accessing raw GPU devices, and RBAC controls on MIG reconfiguration. The Container Escape tutorial covers the Linux namespace and capability boundaries that underpin container isolation: the same principles apply to GPU workload containers.

For inference-specific isolation, emerging approaches like confidential computing on GPUs (NVIDIA H100 Confidential Computing) and TEE-backed GPU enclaves promise hardware-enforced isolation that extends into the GPU memory space. These are early-stage and carry significant performance overhead, but they represent the trajectory of GPU security for truly zero-trust multi-tenancy.

The GPU security model (or lack of one)

Sharing mechanism 1: Time-slicing

Configuring time-slicing on Kubernetes

Time-slicing attack surface

Sharing mechanism 2: MPS (Multi-Process Service)

Configuring MPS resource limits

Sharing mechanism 3: MIG (Multi-Instance GPU)

MIG partition profiles

Configuring MIG

MIG on Kubernetes

MIG security analysis

Sharing mechanism 4: vGPU (Virtual GPU)

Memory residue between tenants

What happens on deallocation

Mitigations

Practical guidance for platform architects

Threat model 1: Internal platform (trusted tenants)

Threat model 2: Multi-tenant cloud (untrusted tenants)

Threat model 3: Regulated environment (compliance-driven)

Where to go from here