HN
Today

Virtualizing Nvidia HGX B200 GPUs with Open Source

NVIDIA's powerful HGX B200 GPUs present unique virtualization challenges due to their integrated NVLink/NVSwitch fabric. This deep dive from Ubicloud meticulously details the intricate, open-source process of enabling multi-tenant GPU VMs, navigating hurdles like PCI topology and colossal BARs. It's a must-read for anyone taming these AI behemoths outside of proprietary cloud ecosystems.

17
Score
0
Comments
#2
Highest Rank
7h
on Front Page
First Seen
Dec 18, 2:00 PM
Last Seen
Dec 18, 8:00 PM
Rank Over Time
24547926

The Lowdown

Ubicloud provides a comprehensive guide to virtualizing NVIDIA's advanced HGX B200 GPUs using entirely open-source tools. Unlike previous generations, the B200's tightly integrated NVLink and NVSwitch fabric makes it particularly challenging to virtualize efficiently and securely for multi-tenant environments. This post documents their journey and solutions, offering a path for others to replicate high-performance GPU VMs.

  • Hardware Complexity: The B200 HGX platform uses SXM modules and a high-bandwidth, all-to-all NVLink/NVSwitch fabric, making it excellent for performance but difficult to virtualize.
  • Virtualization Models Explored: The article discusses Full Passthrough (all 8 GPUs or isolated 1-GPU), Shared NVSwitch Multitenancy (partitioned GPU groups with internal NVLink), and vGPU-based Multitenancy (fractional GPU sharing).
  • Ubicloud's Choice: Shared NVSwitch Multitenancy was chosen for its flexibility in GPU assignment (1, 2, 4, or 8 GPUs) while preserving full NVLink bandwidth within partitions, ideal for high-performance ML workloads.
  • Host Preparation: Steps include detaching GPUs from the NVIDIA driver, binding them to vfio-pci, configuring IOMMU in GRUB, preloading VFIO modules, and blacklisting host NVIDIA drivers for permanent passthrough.
  • Driver Alignment: A critical requirement is ensuring an exact version match between the VM's nvidia-open driver and the host's Fabric Manager, as the host manages the NVSwitch fabric.
  • PCI Topology Trap: Initial hypervisor attempts resulted in a flat PCI topology within the VM, causing CUDA initialization failures; the solution involved using QEMU to construct a multi-level PCIe hierarchy with pcie-root-port devices.
  • Large-BAR Stall Problem: The B200's massive 256 GB Base Address Registers (BARs) caused significant VM boot stalls; solutions include upgrading to QEMU 10.1+ or using the x-no-mmap=true option to avoid direct BAR mapping.
  • Fabric Manager Integration: The host's Fabric Manager is configured for Shared NVSwitch Multitenancy (FABRIC_MODE=1) to manage predefined GPU partitions via an API (fmpm), ensuring isolated, high-bandwidth GPU clusters for VMs.
  • GPU ID Mapping: Crucially, Fabric Manager uses "Module IDs" (from nvidia-smi -q) for GPU identification and partitioning, which must be mapped correctly to PCI device addresses for passthrough.
  • Open Source Commitment: Ubicloud emphasizes that all described methods and underlying management components are implemented and available in the open-source domain. This detailed guide demystifies the complex process of virtualizing NVIDIA HGX B200 GPUs, showcasing how meticulous configuration across hardware, drivers, and hypervisors can yield a robust, flexible, and high-performance multi-tenant AI infrastructure using purely open-source components. Ubicloud's work effectively brings enterprise-grade GPU virtualization into the open-source ecosystem.