Lab
GPU Topology Visualizer
Explore PCIe, NVLink, and NVSwitch GPU interconnect topologies and see how they affect collective communication performance.
Interactive infrastructure tool
Builder notes
GPU topology is the single largest determinant of distributed training performance. The interconnect between GPUs dictates how fast gradients can be synchronized during all-reduce and how efficiently tensor-parallel layers can communicate. This tool visualizes the three most common topologies found in modern GPU servers and simulates ring all-reduce to show the performance impact.
Why it matters
- PCIe-only: ~32 GB/s, common in consumer and entry-level servers
- NVLink bridges: ~450 GB/s between pairs, mixed topology
- NVSwitch mesh: ~450 GB/s all-to-all, used in DGX/HGX systems
- Topology awareness is required for optimal NCCL configuration
Topology
Select a GPU interconnect topology to visualize.
GPUs communicate through the PCIe bus via the CPU root complex. Each GPU-to-GPU transfer must traverse the PCIe switch or root complex, creating a bottleneck for collective operations.
Ring All-Reduce Simulation
Simulate a ring all-reduce collective across the current topology.
Bottleneck Analysis
Performance characteristics and multi-tenant security implications of the current topology.
Topology Comparison
Ring all-reduce performance across all three topologies with a 1 GB payload.
Understanding GPU Interconnects
PCIe Gen5
The baseline interconnect. PCIe Gen5 x16 provides ~32 GB/s per direction. GPU-to-GPU transfers route through the CPU root complex or a PCIe switch, adding latency and sharing bandwidth. Adequate for inference serving but a bottleneck for large-scale training.
NVLink
NVIDIA's proprietary high-speed GPU interconnect. NVLink 4.0 (H100) provides 900 GB/s bidirectional (450 GB/s per direction) between connected GPUs. NVLink bridges connect pairs of adjacent GPUs, creating fast local domains with PCIe fallback for cross-pair traffic.
NVSwitch
A dedicated switch chip that connects all GPUs on a baseboard through NVLink. NVSwitch 3.0 (H100) provides 900 GB/s per GPU port, enabling full bisection bandwidth: every GPU can talk to every other at maximum speed simultaneously. This is the topology in DGX H100 (8 GPUs, 3 NVSwitches).
All-Reduce
The most critical collective operation in distributed training. All-reduce sums gradient tensors across all GPUs so each ends up with the same result. Ring all-reduce is bandwidth-optimal: it transfers 2(N-1)/N of the data volume, where N is the GPU count. The slowest link in the ring determines overall throughput.
Infrastructure Decisions
NCCL (NVIDIA Collective Communications Library) auto-detects topology via
nvidia-smi topo -m
and selects algorithms accordingly. On NVSwitch systems it uses NVLink-optimized
tree collectives. On PCIe-only systems it falls back to ring or PCIe-direct. Misconfigured
NCCL topology detection is a common cause of unexpectedly slow training.
Kubernetes GPU scheduling should be topology-aware. On NVLink bridge systems, co-scheduling tensor-parallel ranks onto bridged GPU pairs gives 14x more interconnect bandwidth than random placement. The NVIDIA GPU Operator and topology-aware scheduling plugins expose this information to the scheduler.
On NVSwitch systems, all 8 GPUs share the NVLink fabric. A workload on GPU 0 can potentially observe memory access patterns on GPU 7 through timing side-channels on the shared switch. Multi-tenant GPU clouds must either dedicate entire nodes per tenant or use MIG partitioning, which disables NVLink between partitions. Read the full tutorial on GPU isolation.
Keyboard shortcuts
- 1 PCIe topology
- 2 NVLink bridges
- 3 Full NVSwitch
- Space Animate / stop all-reduce
Security model
Everything runs in your browser. No GPU hardware is accessed, no data is sent to any server. The topology data and all-reduce simulation are computed locally in JavaScript.