Virtualizing Nvidia HGX B200 GPUs with Open Source
NVIDIA's powerful HGX B200 GPUs present unique virtualization challenges due to their integrated NVLink/NVSwitch fabric. This deep dive from Ubicloud meticulously details the intricate, open-source process of enabling multi-tenant GPU VMs, navigating hurdles like PCI topology and colossal BARs. It's a must-read for anyone taming these AI behemoths outside of proprietary cloud ecosystems.
The Lowdown
Ubicloud provides a comprehensive guide to virtualizing NVIDIA's advanced HGX B200 GPUs using entirely open-source tools. Unlike previous generations, the B200's tightly integrated NVLink and NVSwitch fabric makes it particularly challenging to virtualize efficiently and securely for multi-tenant environments. This post documents their journey and solutions, offering a path for others to replicate high-performance GPU VMs.
- Hardware Complexity: The B200 HGX platform uses SXM modules and a high-bandwidth, all-to-all NVLink/NVSwitch fabric, making it excellent for performance but difficult to virtualize.
- Virtualization Models Explored: The article discusses Full Passthrough (all 8 GPUs or isolated 1-GPU), Shared NVSwitch Multitenancy (partitioned GPU groups with internal NVLink), and vGPU-based Multitenancy (fractional GPU sharing).
- Ubicloud's Choice: Shared NVSwitch Multitenancy was chosen for its flexibility in GPU assignment (1, 2, 4, or 8 GPUs) while preserving full NVLink bandwidth within partitions, ideal for high-performance ML workloads.
- Host Preparation: Steps include detaching GPUs from the NVIDIA driver, binding them to vfio-pci, configuring IOMMU in GRUB, preloading VFIO modules, and blacklisting host NVIDIA drivers for permanent passthrough.
- Driver Alignment: A critical requirement is ensuring an exact version match between the VM's nvidia-open driver and the host's Fabric Manager, as the host manages the NVSwitch fabric.
- PCI Topology Trap: Initial hypervisor attempts resulted in a flat PCI topology within the VM, causing CUDA initialization failures; the solution involved using QEMU to construct a multi-level PCIe hierarchy with pcie-root-port devices.
- Large-BAR Stall Problem: The B200's massive 256 GB Base Address Registers (BARs) caused significant VM boot stalls; solutions include upgrading to QEMU 10.1+ or using the x-no-mmap=true option to avoid direct BAR mapping.
- Fabric Manager Integration: The host's Fabric Manager is configured for Shared NVSwitch Multitenancy (FABRIC_MODE=1) to manage predefined GPU partitions via an API (fmpm), ensuring isolated, high-bandwidth GPU clusters for VMs.
- GPU ID Mapping: Crucially, Fabric Manager uses "Module IDs" (from nvidia-smi -q) for GPU identification and partitioning, which must be mapped correctly to PCI device addresses for passthrough.
- Open Source Commitment: Ubicloud emphasizes that all described methods and underlying management components are implemented and available in the open-source domain. This detailed guide demystifies the complex process of virtualizing NVIDIA HGX B200 GPUs, showcasing how meticulous configuration across hardware, drivers, and hypervisors can yield a robust, flexible, and high-performance multi-tenant AI infrastructure using purely open-source components. Ubicloud's work effectively brings enterprise-grade GPU virtualization into the open-source ecosystem.