Skip to content

Track upstream gaps for NVIDIA vGPU SR-IOV ("vendor-specific VFIO framework") support #2608

@lexfrei

Description

@lexfrei

Context

NVIDIA retired the mdev (mediated devices) framework for Ada Lovelace and newer GPUs (L40, L40S, H100, H200, GH200, B100, and successors). The replacement — NVIDIA calls it the "vendor-specific VFIO framework" — works through SR-IOV virtual functions that the proprietary nvidia.ko driver registers as VFIO providers via iommufd. The legacy mdev_supported_types sysfs surface is gone; vGPU profiles are now selected by writing into per-VF nvidia/current_vgpu_type. NVIDIA documents this in the vGPU 17.0 User Guide, section "Creating an NVIDIA vGPU on a Linux with KVM Hypervisor that Uses a Vendor-Specific VFIO Framework", quote: "A vendor-specific VFIO framework does not support the mediated VFIO mdev driver framework."

This shift breaks the existing KubeVirt + gpu-operator flow end to end. KubeVirt has merged the minimum viable discovery patch upstream (kubevirt/kubevirt#16890, 2026-04-10), but it's not on any tagged release yet, and the surrounding pieces — display/ramfb, e2e, docs, live migration, and the entire node-side orchestration on the NVIDIA side — are not yet there. This issue tracks what cozystack is currently carrying downstream as a result, and what upstream changes would let us drop each piece.

Primary source for the upstream analysis: PR #2323 (gpu-operator: vgpu variant) and website PR #467.

POC setup these findings come from

A 3-node test cluster with 12× NVIDIA L40S (4 per node):

  • OS: Ubuntu 26.04 LTS, kernel 7.0.0-14-generic
  • Kubernetes: k3s v1.35.3
  • Cozystack: isp-full-generic variant (non-Talos bundle, see PR [gpu-operator] Update to v26.3.1 and add experimental vGPU variant #2323)
  • GPU host driver: NVIDIA-Linux-x86_64-595.58.02-vgpu-kvm-aie.run from NVAIE 5.x
  • vGPU Manager container image: built locally from NVIDIA/gpu-driver-container, Ubuntu 26.04 base, ~376 MB
  • KubeVirt: v1.8.2 operator with virt-handler image override pinned to a nightly main build that contains #16890
  • NVIDIA License System: in-cluster appliance, Client Config Token obtained, guest licensing verified

Experimental findings — what the new framework looks like at sysfs level

With L40S + driver 595.58.02 + kernel 7.0, SR-IOV enabled (echo N > /sys/bus/pci/devices/<PF>/sriov_numvfs) and a vGPU profile written to current_vgpu_type on each VF, we observe:

Path State
/sys/class/mdev_bus/ empty
/sys/bus/pci/devices/<PF>/mdev_supported_types/ does not exist
/sys/bus/pci/devices/<VF>/nvidia/current_vgpu_type populated, writable
/sys/bus/pci/devices/<VF>/nvidia/creatable_vgpu_types populated
/sys/bus/pci/devices/<VF>/driver symlink to nvidia, NOT vfio-pci
/sys/bus/pci/devices/<VF>/vfio-dev/vfio0/ exists — the VF IS a VFIO device
/dev/vfio/<group> char device exists, nvidia driver owns binding
Unbind nvidia + bind vfio-pci fails with -EINVAL on probe

Key takeaway: the VF is a fully functional VFIO endpoint and <hostdev type='pci'> in libvirt XML works against it. What changed is purely the discovery story — KubeVirt's virt-handler walks /sys/bus/pci/devices/* and historically only accepted devices with driver -> vfio-pci. PR #16890 relaxes that selector. Once it's in, permittedHostDevices.pciHostDevices permits the VF by vendor:device, capacity appears as nvidia.com/<profile>, and a VirtualMachine boots with the vGPU exposed. We've verified this end-to-end on the cluster above: VM boots, GRID guest driver activates, NLS issues a license, nvidia-smi reports the configured L40S-XX profile, CUDA workloads run.

The Ubuntu 26.04 / gcc-15 / kernel 7.0 combination — which was the biggest known unknown going in — turned out fine: the driver's open-source wrappers compile cleanly against linux-headers-7.0.0-14-generic, vermagic matches, only cosmetic warnings (objtool frame-pointer, missing MODULE_DESCRIPTION). No patches needed against the 595.58.02 .run.

Downstream patches cozystack is currently carrying

Each line below is something cozystack carries because the upstream piece is not ready. The "drop when" column is the exit condition.

What we carry Where Drop when
variant: vgpu for gpu-operator package PR #2323 n/a (cozystack-side feature, kept)
KubeVirt.spec.imageTag override to a nightly main build of virt-handler on top of v1.8.2 values in vGPU install guide (PR #467) KubeVirt cherry-picks #16890 to release-1.8 for v1.8.3, OR v1.9.0 ships
Custom node-side DaemonSet that programs current_vgpu_type per VF from a ConfigMap (in PR #2323 / install guide) NVIDIA vgpu-device-manager adds support for the vendor-specific VFIO sysfs path
Manual nvidia-sandbox-validator disable / workaround install guide NVIDIA/gpu-operator#2365 fixed (validator looks for vfio-dev/ on the PF, wrong path)
Detailed install notes maintained downstream website PR #467 KubeVirt user-guide gets an SR-IOV vGPU section (does not exist today — zero PRs in kubevirt/user-guide)

Upstream gaps being tracked

KubeVirt side (within PR #16890 thread, not yet addressed)

# Gap Source
K1 display + ramfb passthrough for SR-IOV vGPU not wired into libvirt XML — author deferred to "later patch" PR body
K2 No e2e tests — CI has no SR-IOV-capable GPU; fake-kernel-module helper #16717 closed unmerged PR body + #16717 thread
K3 Generic validators abstraction so the path extends to AMD/Intel (shouldPermitNvidia is current hardcode) approver feedback in-thread
K4 Additional unit-test coverage of discoverPermittedHostPCIDevices itself review comment
K5 No separate API concept for "VF bound to a vendor driver"; SR-IOV vGPU resources are indistinguishable from regular passthrough in permittedHostDevices.pciHostDevices structural; no design proposal exists
K6 Live migration of SR-IOV vGPU VMs not addressed; VEP-109 is still 1.9 Alpha and was scoped against mdev VEP-109 thread
K7 No backport to release-1.8 for a v1.8.x tagged release confirmed via tree — pkg/virt-handler/device-manager/nvidia.go absent on release-1.8

There is no umbrella issue or VEP for SR-IOV vGPU in KubeVirt. Verified via kubevirt/enhancements search. VEP-186 (IOMMUFD host devices) was the obvious umbrella candidate; it was closed without merging on 2026-04-15. As of 2026-05-12 no follow-up PR has been opened on K1–K7.

NVIDIA side (node-level orchestration that KubeVirt assumes is "someone else's problem")

# Gap Source
N1 vgpu-device-manager still iterates mdev_supported_types/ exclusively, no current_vgpu_type support NVIDIA/vgpu-device-manager — no relevant PRs/issues
N2 gpu-operator documentation does not mention the vendor-specific VFIO framework, iommufd, or current_vgpu_type official KubeVirt + gpu-operator guide
N3 nvidia-sandbox-validator looks for vfio-dev/ on the PF; crash-loops on capable hardware NVIDIA/gpu-operator#2365
N4 k8s-device-plugin has no surface for the new framework k8s-device-plugin search — zero results

The only public community precedent for a node-side orchestrator that programs current_vgpu_type per VF is OpenNebula/one#6841; no shared tooling exists, every distribution rolls its own.

"Mergeable without downstream overrides" checklist

When all of these are done, cozystack can drop the carried patches and ship vGPU SR-IOV through unmodified upstream components:

  • KubeVirt v1.9.0 released, OR #16890 cherry-picked to release-1.8 for a v1.8.x tag → drop virt-handler image override
  • NVIDIA vgpu-device-manager learns to write current_vgpu_type (N1) → drop custom DaemonSet
  • NVIDIA gpu-operator issue #2365 fixed (N3) → drop validator workaround
  • kubevirt/user-guide gains an SR-IOV vGPU section → website install guide can shrink to a delta from upstream

Items that don't block "merge-able" but block "feature-complete":

  • display + ramfb (K1) — currently console access to vGPU VMs is reduced
  • live migration (K6) — vGPU VMs are pin-to-node
  • generic validators (K3) and per-vendor API surface (K5) — needed for AMD/Intel parity, no impact on NVIDIA-only flow

What's not needed before vGPU works on a cozystack cluster

To set expectations: every numbered item above is a stretch goal for "supported, no-overrides upstream-native". The actual feature works today in cozystack with the carried patches — POC confirmed end-to-end. This issue is for tracking when we can stop carrying them, not for tracking when the feature itself becomes usable.

Reproducing this outside cozystack

Vendor-neutral checklist for someone running plain KubeVirt on any distribution with NVIDIA L40 / L40S / H100 / H200 / GH200 / B100 silicon. None of the cozystack pieces above are required — they just make the path shorter:

  1. Build a vGPU Manager driver image yourself from NVIDIA's licensed .run. The upstream base to use is NVIDIA/gpu-driver-container; the older gitlab.com/nvidia/container-images/driver is archived. You need an NVAIE subscription to obtain the .run. No public registry hosts a usable image — licensing prevents NVIDIA from distributing it.
  2. Enable SR-IOV on each PF at boot (echo N > /sys/bus/pci/devices/<PF>/sriov_numvfs — exact count depends on the silicon and the profile family you want). NVIDIA does not ship a Kubernetes-native primitive for this; a systemd unit on each node, or a kernel cmdline, works.
  3. Run your own DaemonSet that watches a per-node ConfigMap or label of "this VF → this profile id" and writes the id into /sys/bus/pci/devices/<VF>/nvidia/current_vgpu_type. This is the piece NVIDIA's vgpu-device-manager does not yet do. OpenNebula's thread one#6841 is the most concrete public walkthrough.
  4. Pick up a KubeVirt build with PR #16890 in the virt-handler image. Until v1.9.0 ships (or release-1.8 gets a cherry-pick), this means an imageTag override pointed at a main build or a custom build of your own.
  5. Configure permittedHostDevices.pciHostDevices in the KubeVirt CR to permit the vendor/device ID of your VF (10DE:26B9 for L40S; check lspci -nn on the host). The resource name you pick (nvidia.com/L40S-24Q etc.) is referenced from VMI domain.devices.gpus[].deviceName. There is no separate API surface for SR-IOV vGPU yet — it goes through the same knob as plain passthrough.
  6. Inside each guest, install the matching GRID guest driver and either point it at an NLS appliance for a license, or accept the unlicensed 20-minute throttle. The host driver and guest driver versions must match the NVAIE bundle you obtained.
  7. Disable / ignore nvidia-sandbox-validator if you are using the gpu-operator chart — it crash-loops on capable hardware, see NVIDIA/gpu-operator#2365.
  8. Do not expect display/ramfb (gap K1) and do not expect live migration (gap K6). VMs are pin-to-node, console access is via vnc-to-vm-without-display (i.e. serial-console-only, or whatever the guest exposes over the network).

Budget for ~2 weeks of node-side glue if you are starting from zero — KubeVirt's side is the easy half.

Outlook

The KubeVirt-side gaps (K1–K7) will likely close over the next two or three releases — display/ramfb and the generic-validators refactor are small follow-ups, the SIG has shown clear interest, and the release-1.8 cherry-pick is the kind of thing that gets done once one downstream consumer asks loudly enough.

The harder open question is the NVIDIA side. Whether NVIDIA ships a vgpu-device-manager replacement (or an in-tree change) that handles the vendor-specific VFIO framework, or whether the gpu-operator stack remains de-facto mdev-only and every distribution that wants SR-IOV vGPU writes its own node-side DaemonSet, will determine how soon "carried patches" above go away. As of May 2026, the second outcome looks more likely — there is no public engineering signal from NVIDIA that the device-manager will catch up.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions