Skip to content

compatibility with k3s + nvidia container #153

@bglgwyng

Description

@bglgwyng

I tried k3s + nvidia container + nix-snapshotter and found it didn't work work well.
I tested k3s + nvidia container and k3s + nix-snapshotter invidually and they worked well.
However, when I put them together, there were some problems.

Here is the nix script I tried.
I made this following k3s configuration guide in NixOS docs and nix-snapshotter docs.
I can provide the entire working nixos configuration If needed so please ask me.

The problem is that when I run a container runtime with this k8s configuration,
it failed to find nvidia runtime. It worked well before I added nix-snapshotter configuration.

Warning  FailedCreatePodSandBox  2m27s (x1378 over 5h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = unable to get OCI runtime for sandbox "34209d7f367586d856ce61dfc35010997619cfcae8280d4eb319b389f790a64f": no runtime for "nvidia" is configured

Here I share some of my speculations

When nix-snapshotter is enabled, then the extra flags provided for k3s is

--container-runtime-endpoint unix:///run/containerd/containerd.sock --image-service-endpoint unix:///run/nix-snapshotter/nix-snapshotter.sock --node-name=hserver6 --tls-san=k8s.internal --node-label=nvidia.com/gpu.present=true

this is when nix-snapshotter is NOT enabled

--node-name=hserver6 --tls-san=k8s.internal --node-label=nvidia.com/gpu.present=true

Does --container-runtime-endpoint unix:///run/containerd/containerd.sock cause this problem?
I removed that flag but it still failed.
I checked that /var/lib/rancher/k3s/agent/etc/containerd/config.toml is properlay configured to include the following part, but it still failed.

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  privileged_without_host_devices = false
  runtime_engine = ""
  runtime_root = ""
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
  BinaryName = "/run/current-system/sw/bin/nvidia-container-runtime.cdi"

Also nix-snapshotter introduces k3s.moreFlags as replacement of k3s.extraFlags. Is it relevant? I don't see the necessity of that option tbh. It doesn't help resolve conflicts of multiple flag declaration in any sense.

Does anyone have expreince this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions