Lab

GPU Topology Visualizer

Explore PCIe, NVLink, and NVSwitch GPU interconnect topologies and see how they affect collective communication performance.

Interactive infrastructure tool

Explore how GPU interconnect topology affects training and inference performance. Hover over GPUs to see their connections, and animate all-reduce to watch data flow.

Builder notes

GPU topology is the single largest determinant of distributed training performance. The interconnect between GPUs dictates how fast gradients can be synchronized during all-reduce and how efficiently tensor-parallel layers can communicate. This tool visualizes the three most common topologies found in modern GPU servers and simulates ring all-reduce to show the performance impact.

Why it matters

  • PCIe-only: ~32 GB/s, common in consumer and entry-level servers
  • NVLink bridges: ~450 GB/s between pairs, mixed topology
  • NVSwitch mesh: ~450 GB/s all-to-all, used in DGX/HGX systems
  • Topology awareness is required for optimal NCCL configuration

Topology

Select a GPU interconnect topology to visualize.

GPUs communicate through the PCIe bus via the CPU root complex. Each GPU-to-GPU transfer must traverse the PCIe switch or root complex, creating a bottleneck for collective operations.

PCIe NVLink NVSwitch Active transfer

Ring All-Reduce Simulation

Simulate a ring all-reduce collective across the current topology.

1.0 GB

Bottleneck Analysis

Performance characteristics and multi-tenant security implications of the current topology.

Topology Comparison

Ring all-reduce performance across all three topologies with a 1 GB payload.

Understanding GPU Interconnects

PCIe Gen5

The baseline interconnect. PCIe Gen5 x16 provides ~32 GB/s per direction. GPU-to-GPU transfers route through the CPU root complex or a PCIe switch, adding latency and sharing bandwidth. Adequate for inference serving but a bottleneck for large-scale training.

NVLink

NVIDIA's proprietary high-speed GPU interconnect. NVLink 4.0 (H100) provides 900 GB/s bidirectional (450 GB/s per direction) between connected GPUs. NVLink bridges connect pairs of adjacent GPUs, creating fast local domains with PCIe fallback for cross-pair traffic.

NVSwitch

A dedicated switch chip that connects all GPUs on a baseboard through NVLink. NVSwitch 3.0 (H100) provides 900 GB/s per GPU port, enabling full bisection bandwidth: every GPU can talk to every other at maximum speed simultaneously. This is the topology in DGX H100 (8 GPUs, 3 NVSwitches).

All-Reduce

The most critical collective operation in distributed training. All-reduce sums gradient tensors across all GPUs so each ends up with the same result. Ring all-reduce is bandwidth-optimal: it transfers 2(N-1)/N of the data volume, where N is the GPU count. The slowest link in the ring determines overall throughput.

Infrastructure Decisions

NCCL Tuning

NCCL (NVIDIA Collective Communications Library) auto-detects topology via nvidia-smi topo -m and selects algorithms accordingly. On NVSwitch systems it uses NVLink-optimized tree collectives. On PCIe-only systems it falls back to ring or PCIe-direct. Misconfigured NCCL topology detection is a common cause of unexpectedly slow training.

Topology-Aware Scheduling

Kubernetes GPU scheduling should be topology-aware. On NVLink bridge systems, co-scheduling tensor-parallel ranks onto bridged GPU pairs gives 14x more interconnect bandwidth than random placement. The NVIDIA GPU Operator and topology-aware scheduling plugins expose this information to the scheduler.

Multi-Tenant Isolation

On NVSwitch systems, all 8 GPUs share the NVLink fabric. A workload on GPU 0 can potentially observe memory access patterns on GPU 7 through timing side-channels on the shared switch. Multi-tenant GPU clouds must either dedicate entire nodes per tenant or use MIG partitioning, which disables NVLink between partitions. Read the full tutorial on GPU isolation.

Keyboard shortcuts
  • 1 PCIe topology
  • 2 NVLink bridges
  • 3 Full NVSwitch
  • Space Animate / stop all-reduce
Security model

Everything runs in your browser. No GPU hardware is accessed, no data is sent to any server. The topology data and all-reduce simulation are computed locally in JavaScript.

Further reading