Commit 3dff806
authored
[docs] Add vGPU setup guide for GPU sharing between VMs (#467)
## What this PR does
Adds a practical guide for running VMs with NVIDIA vGPU on Cozystack to
the existing GPU passthrough page.
The guide covers the **SR-IOV vGPU path** used by current data-centre
GPUs (L4, L40, L40S, B100) on the vGPU 20.x driver branch. The
mediated-devices path used by older GPUs (Pascal–Ampere) is explicitly
out of scope and the reader is pointed at upstream NVIDIA docs.
Steps documented:
- Build the proprietary vGPU Manager container image from
`github.com/NVIDIA/gpu-driver-container` (the older
`gitlab.com/nvidia/container-images/driver` is archived).
- Deploy GPU Operator with the `vgpu` variant via Package CR (depends on
cozystack/cozystack#2323).
- Assign vGPU profiles to SR-IOV VFs (`current_vgpu_type` sysfs), with a
periodic profile-loader DaemonSet skeleton and an explicit experimental
warning.
- Configure DLS licensing via ClientConfigToken (the legacy NLS /
`ServerAddress=` / `ServerPort=7070` flow no longer applies).
- Patch the KubeVirt CR with `permittedHostDevices.pciHostDevices`
(after kubevirt/kubevirt#16890; first stable KubeVirt release with the
patch is targeted at v1.9.0).
- Sample `VirtualMachine` (raw `kubevirt.io/v1`) using a CDI
`DataVolume` so the rootfs has room for in-VM driver install, with a
`cloudInitNoCloud` disk that drops the licensing token, `gridd.conf`, an
SSH key, and the build dependencies.
- vGPU profile reference table for L40S with the Q/A/B suffix taxonomy.
- Warning about the 2.4 GiB containerDisk root overflow during in-VM
driver install (we observed `SIGBUS` from a write into an mmap of a file
the kernel could no longer extend; the new example sidesteps it via
`DataVolume`).
- Talos is explicitly noted as not recommended for vGPU; passthrough on
Talos is unaffected.
### Release note
```release-note
NONE
```1 file changed
Lines changed: 453 additions & 23 deletions
0 commit comments