Skip to content

HyperNode creation fails silently if webhook is not ready during controller startup #5027

@zhangguorui304

Description

@zhangguorui304

Description

When Volcano is installed after a Kubernetes cluster is already running and nodes have been labeled with topology information (e.g., volcano.sh/hypernode), the Volcano controller attempts to create HyperNode CRs based on those labels during its initial sync. However, if the Volcano admission webhook is not yet ready (e.g., still starting up or not scheduled), the HyperNode creation request is rejected by the API server, and the controller does not retry. This leads to missing HyperNode resources permanently, breaking topology-aware scheduling features that depend on them.

Steps to reproduce the issue

  1. Set up a Kubernetes cluster with worker nodes already labeled:
    kubectl label node <node-name> volcano.sh/hypernode=hypernode-a
  2. Install Volcano (e.g., via Helm or YAML manifests) after labeling.
  3. Observe controller logs:
    "Failed to create HyperNode" err="Internal error occurred: failed calling webhook "\ validatehypernodes.volcano.sh": failed to call webhook: Post \https://volcano-admission-service.kube-system.svc:443/hypernodes/validate?timeout=10s\": dial tcp xxx:443: connect: connection refused" name="hypernode-testtopology-tier2-pm4z2"
  4. Wait several minutes and check HyperNode resources:
    kubectl get hypernodes

Describe the results you received and expected

After Volcano is fully installed, HyperNode CRs should be created for all topology domains present on nodes.

What version of Volcano are you using?

v1.13

Any other relevant information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions