Virtualizing Nvidia HGX B200 GPUs with Open Source

Ubicloud provides a comprehensive guide to virtualizing NVIDIA's advanced HGX B200 GPUs using entirely open-source tools. Unlike previous generations, the B200's tightly integrated NVLink and NVSwitch fabric makes it particularly challenging to virtualize efficiently and securely for multi-tenant environments. This post documents their journey and solutions, offering a path for others to replicate high-performance GPU VMs.

Hardware Complexity: The B200 HGX platform uses SXM modules and a high-bandwidth, all-to-all NVLink/NVSwitch fabric, making it excellent for performance but difficult to virtualize.
Virtualization Models Explored: The article discusses Full Passthrough (all 8 GPUs or isolated 1-GPU), Shared NVSwitch Multitenancy (partitioned GPU groups with internal NVLink), and vGPU-based Multitenancy (fractional GPU sharing).
Ubicloud's Choice: Shared NVSwitch Multitenancy was chosen for its flexibility in GPU assignment (1, 2, 4, or 8 GPUs) while preserving full NVLink bandwidth within partitions, ideal for high-performance ML workloads.
Host Preparation: Steps include detaching GPUs from the NVIDIA driver, binding them to vfio-pci, configuring IOMMU in GRUB, preloading VFIO modules, and blacklisting host NVIDIA drivers for permanent passthrough.
Driver Alignment: A critical requirement is ensuring an exact version match between the VM's nvidia-open driver and the host's Fabric Manager, as the host manages the NVSwitch fabric.
PCI Topology Trap: Initial hypervisor attempts resulted in a flat PCI topology within the VM, causing CUDA initialization failures; the solution involved using QEMU to construct a multi-level PCIe hierarchy with pcie-root-port devices.
Large-BAR Stall Problem: The B200's massive 256 GB Base Address Registers (BARs) caused significant VM boot stalls; solutions include upgrading to QEMU 10.1+ or using the x-no-mmap=true option to avoid direct BAR mapping.
Fabric Manager Integration: The host's Fabric Manager is configured for Shared NVSwitch Multitenancy (FABRIC_MODE=1) to manage predefined GPU partitions via an API (fmpm), ensuring isolated, high-bandwidth GPU clusters for VMs.
GPU ID Mapping: Crucially, Fabric Manager uses "Module IDs" (from nvidia-smi -q) for GPU identification and partitioning, which must be mapped correctly to PCI device addresses for passthrough.
Open Source Commitment: Ubicloud emphasizes that all described methods and underlying management components are implemented and available in the open-source domain. This detailed guide demystifies the complex process of virtualizing NVIDIA HGX B200 GPUs, showcasing how meticulous configuration across hardware, drivers, and hypervisors can yield a robust, flexible, and high-performance multi-tenant AI infrastructure using purely open-source components. Ubicloud's work effectively brings enterprise-grade GPU virtualization into the open-source ecosystem.

Virtualizing Nvidia HGX B200 GPUs with Open Source

The Lowdown