Track upstream gaps for NVIDIA vGPU SR-IOV ("vendor-specific VFIO framework") support


### Context

NVIDIA retired the `mdev` (mediated devices) framework for Ada Lovelace and newer GPUs (L40, L40S, H100, H200, GH200, B100, and successors). The replacement — NVIDIA calls it the "vendor-specific VFIO framework" — works through SR-IOV virtual functions that the proprietary `nvidia.ko` driver registers as VFIO providers via `iommufd`. The legacy `mdev_supported_types` sysfs surface is gone; vGPU profiles are now selected by writing into per-VF `nvidia/current_vgpu_type`. NVIDIA documents this in the [vGPU 17.0 User Guide](https://docs.nvidia.com/vgpu/17.0/grid-vgpu-user-guide/index.html), section "Creating an NVIDIA vGPU on a Linux with KVM Hypervisor that Uses a Vendor-Specific VFIO Framework", quote: *"A vendor-specific VFIO framework does not support the mediated VFIO mdev driver framework."*

This shift breaks the existing KubeVirt + `gpu-operator` flow end to end. KubeVirt has merged the minimum viable discovery patch upstream ([kubevirt/kubevirt#16890](https://github.com/kubevirt/kubevirt/pull/16890), 2026-04-10), but it's not on any tagged release yet, and the surrounding pieces — display/ramfb, e2e, docs, live migration, and the entire node-side orchestration on the NVIDIA side — are not yet there. This issue tracks what cozystack is currently carrying downstream as a result, and what upstream changes would let us drop each piece.

Primary source for the upstream analysis: PR [#2323 (gpu-operator: vgpu variant)](https://github.com/cozystack/cozystack/pull/2323) and website PR [#467](https://github.com/cozystack/website/pull/467).

### POC setup these findings come from

A 3-node test cluster with 12× NVIDIA L40S (4 per node):

- OS: Ubuntu 26.04 LTS, kernel 7.0.0-14-generic
- Kubernetes: k3s v1.35.3
- Cozystack: `isp-full-generic` variant (non-Talos bundle, see PR #2323)
- GPU host driver: `NVIDIA-Linux-x86_64-595.58.02-vgpu-kvm-aie.run` from NVAIE 5.x
- vGPU Manager container image: built locally from [NVIDIA/gpu-driver-container](https://github.com/NVIDIA/gpu-driver-container), Ubuntu 26.04 base, ~376 MB
- KubeVirt: v1.8.2 operator with `virt-handler` image override pinned to a nightly `main` build that contains [#16890](https://github.com/kubevirt/kubevirt/pull/16890)
- NVIDIA License System: in-cluster appliance, Client Config Token obtained, guest licensing verified

### Experimental findings — what the new framework looks like at sysfs level

With L40S + driver 595.58.02 + kernel 7.0, SR-IOV enabled (`echo N > /sys/bus/pci/devices/<PF>/sriov_numvfs`) and a vGPU profile written to `current_vgpu_type` on each VF, we observe:

| Path | State |
| --- | --- |
| `/sys/class/mdev_bus/` | empty |
| `/sys/bus/pci/devices/<PF>/mdev_supported_types/` | does not exist |
| `/sys/bus/pci/devices/<VF>/nvidia/current_vgpu_type` | populated, writable |
| `/sys/bus/pci/devices/<VF>/nvidia/creatable_vgpu_types` | populated |
| `/sys/bus/pci/devices/<VF>/driver` | symlink to `nvidia`, NOT `vfio-pci` |
| `/sys/bus/pci/devices/<VF>/vfio-dev/vfio0/` | exists — the VF IS a VFIO device |
| `/dev/vfio/<group>` | char device exists, `nvidia` driver owns binding |
| Unbind `nvidia` + bind `vfio-pci` | fails with `-EINVAL` on probe |

Key takeaway: the VF is a fully functional VFIO endpoint and `<hostdev type='pci'>` in libvirt XML works against it. What changed is purely the **discovery story** — KubeVirt's `virt-handler` walks `/sys/bus/pci/devices/*` and historically only accepted devices with `driver -> vfio-pci`. PR #16890 relaxes that selector. Once it's in, `permittedHostDevices.pciHostDevices` permits the VF by `vendor:device`, capacity appears as `nvidia.com/<profile>`, and a VirtualMachine boots with the vGPU exposed. We've verified this end-to-end on the cluster above: VM boots, GRID guest driver activates, NLS issues a license, `nvidia-smi` reports the configured L40S-XX profile, CUDA workloads run.

The Ubuntu 26.04 / gcc-15 / kernel 7.0 combination — which was the biggest known unknown going in — turned out fine: the driver's open-source wrappers compile cleanly against `linux-headers-7.0.0-14-generic`, vermagic matches, only cosmetic warnings (objtool frame-pointer, missing MODULE_DESCRIPTION). No patches needed against the 595.58.02 `.run`.

### Downstream patches cozystack is currently carrying

Each line below is something cozystack carries because the upstream piece is not ready. The "drop when" column is the exit condition.

| What we carry | Where | Drop when |
| --- | --- | --- |
| `variant: vgpu` for `gpu-operator` package | PR [#2323](https://github.com/cozystack/cozystack/pull/2323) | n/a (cozystack-side feature, kept) |
| `KubeVirt.spec.imageTag` override to a nightly `main` build of virt-handler on top of v1.8.2 | values in vGPU install guide (PR [#467](https://github.com/cozystack/website/pull/467)) | KubeVirt cherry-picks [#16890](https://github.com/kubevirt/kubevirt/pull/16890) to `release-1.8` for v1.8.3, OR v1.9.0 ships |
| Custom node-side DaemonSet that programs `current_vgpu_type` per VF from a ConfigMap | (in PR #2323 / install guide) | NVIDIA [vgpu-device-manager](https://github.com/NVIDIA/vgpu-device-manager) adds support for the vendor-specific VFIO sysfs path |
| Manual `nvidia-sandbox-validator` disable / workaround | install guide | [NVIDIA/gpu-operator#2365](https://github.com/NVIDIA/gpu-operator/issues/2365) fixed (validator looks for `vfio-dev/` on the PF, wrong path) |
| Detailed install notes maintained downstream | website PR [#467](https://github.com/cozystack/website/pull/467) | KubeVirt user-guide gets an SR-IOV vGPU section (does not exist today — zero PRs in [kubevirt/user-guide](https://github.com/kubevirt/user-guide/pulls?q=is%3Apr+vGPU+SRIOV)) |

### Upstream gaps being tracked

#### KubeVirt side (within [PR #16890](https://github.com/kubevirt/kubevirt/pull/16890) thread, not yet addressed)

| # | Gap | Source |
| --- | --- | --- |
| K1 | display + ramfb passthrough for SR-IOV vGPU not wired into libvirt XML — author deferred to "later patch" | PR body |
| K2 | No e2e tests — CI has no SR-IOV-capable GPU; fake-kernel-module helper [#16717](https://github.com/kubevirt/kubevirt/pull/16717) closed unmerged | PR body + #16717 thread |
| K3 | Generic `validators` abstraction so the path extends to AMD/Intel (`shouldPermitNvidia` is current hardcode) | approver feedback in-thread |
| K4 | Additional unit-test coverage of `discoverPermittedHostPCIDevices` itself | review comment |
| K5 | No separate API concept for "VF bound to a vendor driver"; SR-IOV vGPU resources are indistinguishable from regular passthrough in `permittedHostDevices.pciHostDevices` | structural; no design proposal exists |
| K6 | Live migration of SR-IOV vGPU VMs not addressed; [VEP-109](https://github.com/kubevirt/enhancements/issues/109) is still 1.9 Alpha and was scoped against mdev | VEP-109 thread |
| K7 | No backport to `release-1.8` for a v1.8.x tagged release | confirmed via tree — `pkg/virt-handler/device-manager/nvidia.go` absent on `release-1.8` |

There is **no umbrella issue or VEP** for SR-IOV vGPU in KubeVirt. Verified via [kubevirt/enhancements search](https://github.com/kubevirt/enhancements/issues?q=vGPU+OR+SRIOV+OR+VFIO). VEP-186 (IOMMUFD host devices) was the obvious umbrella candidate; it was closed without merging on 2026-04-15. As of 2026-05-12 no follow-up PR has been opened on K1–K7.

#### NVIDIA side (node-level orchestration that KubeVirt assumes is "someone else's problem")

| # | Gap | Source |
| --- | --- | --- |
| N1 | `vgpu-device-manager` still iterates `mdev_supported_types/` exclusively, no `current_vgpu_type` support | [NVIDIA/vgpu-device-manager](https://github.com/NVIDIA/vgpu-device-manager) — no relevant PRs/issues |
| N2 | `gpu-operator` documentation does not mention the vendor-specific VFIO framework, `iommufd`, or `current_vgpu_type` | [official KubeVirt + gpu-operator guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-kubevirt.html) |
| N3 | `nvidia-sandbox-validator` looks for `vfio-dev/` on the PF; crash-loops on capable hardware | [NVIDIA/gpu-operator#2365](https://github.com/NVIDIA/gpu-operator/issues/2365) |
| N4 | `k8s-device-plugin` has no surface for the new framework | [k8s-device-plugin search](https://github.com/NVIDIA/k8s-device-plugin/issues?q=vendor-specific+VFIO+OR+current_vgpu_type) — zero results |

The only public community precedent for a node-side orchestrator that programs `current_vgpu_type` per VF is [OpenNebula/one#6841](https://github.com/OpenNebula/one/issues/6841); no shared tooling exists, every distribution rolls its own.

### "Mergeable without downstream overrides" checklist

When all of these are done, cozystack can drop the carried patches and ship vGPU SR-IOV through unmodified upstream components:

- [ ] KubeVirt v1.9.0 released, OR [#16890](https://github.com/kubevirt/kubevirt/pull/16890) cherry-picked to `release-1.8` for a v1.8.x tag → drop virt-handler image override
- [ ] NVIDIA `vgpu-device-manager` learns to write `current_vgpu_type` (N1) → drop custom DaemonSet
- [ ] NVIDIA `gpu-operator` issue [#2365](https://github.com/NVIDIA/gpu-operator/issues/2365) fixed (N3) → drop validator workaround
- [ ] `kubevirt/user-guide` gains an SR-IOV vGPU section → website install guide can shrink to a delta from upstream

Items that don't block "merge-able" but block "feature-complete":

- [ ] display + ramfb (K1) — currently console access to vGPU VMs is reduced
- [ ] live migration (K6) — vGPU VMs are pin-to-node
- [ ] generic validators (K3) and per-vendor API surface (K5) — needed for AMD/Intel parity, no impact on NVIDIA-only flow

### What's *not* needed before vGPU works on a cozystack cluster

To set expectations: every numbered item above is a stretch goal for "supported, no-overrides upstream-native". The actual feature **works today** in cozystack with the carried patches — POC confirmed end-to-end. This issue is for tracking when we can stop carrying them, not for tracking when the feature itself becomes usable.
### Reproducing this outside cozystack

Vendor-neutral checklist for someone running plain KubeVirt on any distribution with NVIDIA L40 / L40S / H100 / H200 / GH200 / B100 silicon. None of the cozystack pieces above are required — they just make the path shorter:

1. **Build a vGPU Manager driver image yourself** from NVIDIA's licensed `.run`. The upstream base to use is [NVIDIA/gpu-driver-container](https://github.com/NVIDIA/gpu-driver-container); the older `gitlab.com/nvidia/container-images/driver` is archived. You need an NVAIE subscription to obtain the `.run`. No public registry hosts a usable image — licensing prevents NVIDIA from distributing it.
2. **Enable SR-IOV on each PF at boot** (`echo N > /sys/bus/pci/devices/<PF>/sriov_numvfs` — exact count depends on the silicon and the profile family you want). NVIDIA does not ship a Kubernetes-native primitive for this; a systemd unit on each node, or a kernel cmdline, works.
3. **Run your own DaemonSet** that watches a per-node ConfigMap or label of "this VF → this profile id" and writes the id into `/sys/bus/pci/devices/<VF>/nvidia/current_vgpu_type`. This is the piece NVIDIA's `vgpu-device-manager` does not yet do. OpenNebula's thread [one#6841](https://github.com/OpenNebula/one/issues/6841) is the most concrete public walkthrough.
4. **Pick up a KubeVirt build with [PR #16890](https://github.com/kubevirt/kubevirt/pull/16890)** in the virt-handler image. Until v1.9.0 ships (or `release-1.8` gets a cherry-pick), this means an `imageTag` override pointed at a `main` build or a custom build of your own.
5. **Configure `permittedHostDevices.pciHostDevices`** in the KubeVirt CR to permit the vendor/device ID of your VF (`10DE:26B9` for L40S; check `lspci -nn` on the host). The resource name you pick (`nvidia.com/L40S-24Q` etc.) is referenced from VMI `domain.devices.gpus[].deviceName`. There is no separate API surface for SR-IOV vGPU yet — it goes through the same knob as plain passthrough.
6. **Inside each guest**, install the matching GRID guest driver and either point it at an NLS appliance for a license, or accept the unlicensed 20-minute throttle. The host driver and guest driver versions must match the NVAIE bundle you obtained.
7. **Disable / ignore `nvidia-sandbox-validator`** if you are using the `gpu-operator` chart — it crash-loops on capable hardware, see [NVIDIA/gpu-operator#2365](https://github.com/NVIDIA/gpu-operator/issues/2365).
8. **Do not expect display/ramfb** (gap K1) and **do not expect live migration** (gap K6). VMs are pin-to-node, console access is via vnc-to-vm-without-display (i.e. serial-console-only, or whatever the guest exposes over the network).

Budget for ~2 weeks of node-side glue if you are starting from zero — KubeVirt's side is the easy half.

### Outlook

The KubeVirt-side gaps (K1–K7) will likely close over the next two or three releases — display/ramfb and the generic-validators refactor are small follow-ups, the SIG has shown clear interest, and the `release-1.8` cherry-pick is the kind of thing that gets done once one downstream consumer asks loudly enough.

The harder open question is the NVIDIA side. Whether NVIDIA ships a `vgpu-device-manager` replacement (or an in-tree change) that handles the vendor-specific VFIO framework, or whether the `gpu-operator` stack remains de-facto mdev-only and every distribution that wants SR-IOV vGPU writes its own node-side DaemonSet, will determine how soon "carried patches" above go away. As of May 2026, the second outcome looks more likely — there is no public engineering signal from NVIDIA that the device-manager will catch up.


### References

- KubeVirt PR: <https://github.com/kubevirt/kubevirt/pull/16890>
- Abandoned KubeVirt CI helper for fake SR-IOV vGPU: <https://github.com/kubevirt/kubevirt/pull/16717>
- KubeVirt enhancements search (no umbrella): <https://github.com/kubevirt/enhancements/issues?q=vGPU+OR+SRIOV+OR+VFIO>
- VEP-109 vGPU live migration: <https://github.com/kubevirt/enhancements/issues/109>
- VEP-186 IOMMUFD host devices (closed unmerged): <https://github.com/kubevirt/enhancements/pull/187>
- NVIDIA vGPU 17.0 user guide, vendor-specific VFIO section: <https://docs.nvidia.com/vgpu/17.0/grid-vgpu-user-guide/index.html>
- NVIDIA gpu-operator sandbox-validator broken: <https://github.com/NVIDIA/gpu-operator/issues/2365>
- NVIDIA vgpu-device-manager (still mdev-only): <https://github.com/NVIDIA/vgpu-device-manager>
- NVIDIA gpu-driver-container (upstream build infra): <https://github.com/NVIDIA/gpu-driver-container>
- OpenNebula precedent for `current_vgpu_type` orchestration: <https://github.com/OpenNebula/one/issues/6841>
- cozystack PR #2323 (gpu-operator: vgpu variant): <https://github.com/cozystack/cozystack/pull/2323>
- cozystack/website PR #467 (vGPU install guide): <https://github.com/cozystack/website/pull/467>


Path	State
`/sys/class/mdev_bus/`	empty
`/sys/bus/pci/devices/<PF>/mdev_supported_types/`	does not exist
`/sys/bus/pci/devices/<VF>/nvidia/current_vgpu_type`	populated, writable
`/sys/bus/pci/devices/<VF>/nvidia/creatable_vgpu_types`	populated
`/sys/bus/pci/devices/<VF>/driver`	symlink to `nvidia`, NOT `vfio-pci`
`/sys/bus/pci/devices/<VF>/vfio-dev/vfio0/`	exists — the VF IS a VFIO device
`/dev/vfio/<group>`	char device exists, `nvidia` driver owns binding
Unbind `nvidia` + bind `vfio-pci`	fails with `-EINVAL` on probe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track upstream gaps for NVIDIA vGPU SR-IOV ("vendor-specific VFIO framework") support #2608

Context

POC setup these findings come from

Experimental findings — what the new framework looks like at sysfs level

Downstream patches cozystack is currently carrying

Upstream gaps being tracked

KubeVirt side (within PR #16890 thread, not yet addressed)

NVIDIA side (node-level orchestration that KubeVirt assumes is "someone else's problem")

"Mergeable without downstream overrides" checklist

What's not needed before vGPU works on a cozystack cluster

Reproducing this outside cozystack

Outlook

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

What we carry	Where	Drop when
`variant: vgpu` for `gpu-operator` package	PR #2323	n/a (cozystack-side feature, kept)
`KubeVirt.spec.imageTag` override to a nightly `main` build of virt-handler on top of v1.8.2	values in vGPU install guide (PR #467)	KubeVirt cherry-picks #16890 to `release-1.8` for v1.8.3, OR v1.9.0 ships
Custom node-side DaemonSet that programs `current_vgpu_type` per VF from a ConfigMap	(in PR #2323 / install guide)	NVIDIA vgpu-device-manager adds support for the vendor-specific VFIO sysfs path
Manual `nvidia-sandbox-validator` disable / workaround	install guide	NVIDIA/gpu-operator#2365 fixed (validator looks for `vfio-dev/` on the PF, wrong path)
Detailed install notes maintained downstream	website PR #467	KubeVirt user-guide gets an SR-IOV vGPU section (does not exist today — zero PRs in kubevirt/user-guide)

#	Gap	Source
K1	display + ramfb passthrough for SR-IOV vGPU not wired into libvirt XML — author deferred to "later patch"	PR body
K2	No e2e tests — CI has no SR-IOV-capable GPU; fake-kernel-module helper #16717 closed unmerged	PR body + #16717 thread
K3	Generic `validators` abstraction so the path extends to AMD/Intel (`shouldPermitNvidia` is current hardcode)	approver feedback in-thread
K4	Additional unit-test coverage of `discoverPermittedHostPCIDevices` itself	review comment
K5	No separate API concept for "VF bound to a vendor driver"; SR-IOV vGPU resources are indistinguishable from regular passthrough in `permittedHostDevices.pciHostDevices`	structural; no design proposal exists
K6	Live migration of SR-IOV vGPU VMs not addressed; VEP-109 is still 1.9 Alpha and was scoped against mdev	VEP-109 thread
K7	No backport to `release-1.8` for a v1.8.x tagged release	confirmed via tree — `pkg/virt-handler/device-manager/nvidia.go` absent on `release-1.8`

#	Gap	Source
N1	`vgpu-device-manager` still iterates `mdev_supported_types/` exclusively, no `current_vgpu_type` support	NVIDIA/vgpu-device-manager — no relevant PRs/issues
N2	`gpu-operator` documentation does not mention the vendor-specific VFIO framework, `iommufd`, or `current_vgpu_type`	official KubeVirt + gpu-operator guide
N3	`nvidia-sandbox-validator` looks for `vfio-dev/` on the PF; crash-loops on capable hardware	NVIDIA/gpu-operator#2365
N4	`k8s-device-plugin` has no surface for the new framework	k8s-device-plugin search — zero results

Track upstream gaps for NVIDIA vGPU SR-IOV ("vendor-specific VFIO framework") support #2608

Description

Context

POC setup these findings come from

Experimental findings — what the new framework looks like at sysfs level

Downstream patches cozystack is currently carrying

Upstream gaps being tracked

KubeVirt side (within PR #16890 thread, not yet addressed)

NVIDIA side (node-level orchestration that KubeVirt assumes is "someone else's problem")

"Mergeable without downstream overrides" checklist

What's not needed before vGPU works on a cozystack cluster

Reproducing this outside cozystack

Outlook

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions