Fix: the cluster is upgraded from 2.27 to 2.28 cilium will break #12254

tico88612 · 2025-05-24T14:42:33Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

If the cluster is upgraded from 2.27 to 2.28 and kube_network_plugin is set to cilium, it will fail.

So, I added the cilium_remove_old_resources option, which will delete the old cilium resources (template).

Be careful: While this will have downtime, at least the installation will be complete.

ansible-playbook -i <INVENTORY> upgrade-cluster.yml -e "cilium_remove_old_resources=true" (if your kube_owner is not root, please add -e "kube_owner=root" in the command.)

Or if you don't want to upgrade Cilium (or no downtime), you can skip like this:

ansible-playbook -i <INVENTORY> upgrade-cluster.yml --skip-tags cilium

However, the old installation template will not be maintained. Please choose according to your situation.

Which issue(s) this PR fixes:

Fixes #12252

Special notes for your reviewer:

It will backport to 2.28. I tested on my environment (upgrade from 2.27 to 2.28).

Add kube_owner=root (if your original cluster's kube_owner is kube) and cilium_remove_old_resources=true

Does this PR introduce a user-facing change?:

Fix the Cilium cluster, which is upgraded from 2.27 to 2.28 will break
Fix helm release re-use message when installing repeatedly

tico88612 · 2025-05-24T15:13:12Z

/ok-to-test

tico88612 · 2025-05-24T15:21:30Z

/label tide/merge-method-merge

vdveldet · 2025-05-25T15:05:22Z

There might be a use case where /opt/cni/bin is owned by kube.
In that case the cilium containers will error
Also without adding the variable the upgrade playbook just fails.

tico88612 · 2025-05-25T16:02:00Z

@vdveldet Yep, that's why I wrote if your kube_owner is not root, please add -e "kube_owner=root" in the command. in description.

ledroide · 2025-06-02T13:58:12Z

Tested on my cluster : the fix solves the error message "Unable to install Cilium: cannot re-use a name that is still in use" when running cluster.yml -> good for me

yankay · 2025-06-03T05:43:30Z

HI @cyclinder
Would you please help to review it. :-)

cyclinder

Hi, I think it's a breaking change for deleting all old Cilium resources, which causes a breakdown for the cluster, and we also miss the Cilium install configuration beforehand.

Do we use cilium install or kubectl apply -f command when the cluster is creating and upgrading?

tico88612 · 2025-06-03T07:12:13Z

I think it's a breaking change for deleting all old Cilium resources, which causes a breakdown for the cluster, and we also miss the Cilium install configuration beforehand.

The original design would have wanted the cilium helm to take over when upgrading, but due to ansible collection version limitations this didn't work.

Deleting the old version (old templates managed by Kubespray) of the resource is a good way to ensure that the new installation will be fine. (which will cause the cluster to be disconnected internally, the cilium_remove_old_resources default is false)

I note in the PR content that it is optional, you can skip the cilium tag --skip-tags cilium, but that means it won't be maintained (old templates) anymore.

Do we use cilium install or kubectl apply -f command when the cluster is creating and upgrading?

Currently, cluster installations of cilium use cilium install (#12101)

This PR fixes the situation where install can't be used repeatedly. cilium_action will look at the cluster status and choose install or upgrade.

cyclinder · 2025-06-03T11:03:27Z

Currently, cluster installations of cilium use cilium install (#12101)

Thanks for your information. If we have a new task to upgrade cilium by using cilium upgrade command only when we are upgrading the cluster, does it work? which can keep the original installed configuration without loss. And also, it don't affect the cluster upgrade, how about this?

If we fix this issue by removing Cilium resources, we must promptly update our tasks whenever there are any additions or deletions of resource manifests in the upstream Cilium. Otherwise, we may encounter similar upgrade issues. This approach would make maintenance challenging for us...

tico88612 · 2025-06-03T12:45:15Z

If we have a new task to upgrade cilium by using cilium upgrade command only when we are upgrading the cluster, does it work? which can keep the original installed configuration without loss. And also, it don't affect the cluster upgrade, how about this?

This PR will automatically determine that if only upgrade is left when upgrading the cluster and it has not been installed before, the task will fail.
If you only leave install when installing the cluster, assuming that cluster.yml runs more than twice, it will also fail.

If we fix this issue by removing Cilium resources, we must promptly update our tasks whenever there are any additions or deletions of resource manifests in the upstream Cilium. Otherwise, we may encounter similar upgrade issues. This approach would make maintenance challenging for us...

wdym? This only removes old templates that we previously maintained. All future cilium upgrade processes will be cilium upgrade.

ledroide · 2025-06-03T14:58:38Z

@cyclinder : I had to run cluster.yml multiple times, for various meanings, since I have merged this fix into my own fork, on a running cluster. I confirm that it did not cilium daemonset, and I did not notice any downtime.

By the way, ansible itself is designed to be idempotent. If I run the same set of ansible tasks for many occurrences, ansible should only apply changes when needed. In other words, there is no reason to make a difference in the tasks code between bootstraping (= installing) and upgrading (even when there is nothing to upgrade). Ansible is designed to only change what is strictly different from the desired state.

cyclinder · 2025-06-04T02:22:24Z

wdym? This only removes old templates that we previously maintained. All future cilium upgrade processes will be cilium upgrade.

From my understanding, this issue only happened when we upgraded from 2.27 to 2.28? In future releases, we will be using cilium install for cluster installation and cilium upgrade for cluster upgrade, correct?

tico88612 · 2025-06-04T02:56:53Z

From my understanding, this issue only happened when we upgraded from 2.27 to 2.28?

Yes, this issue only happened when the cluster upgraded from 2.27 to 2.28. (or jinja templates to cilium cli)

In future releases, we will be using cilium install for cluster installation and cilium upgrade for cluster upgrade, correct?

cilium install will use on the first installation, and it will use cilium upgrade if cilium installed prevent Error: Unable to install Cilium: cannot re-use a name that is still in use.

cyclinder · 2025-06-04T05:40:50Z

Thanks for your explaintaion, I don't have future questions.

/lgtm

spantaleev · 2025-06-04T19:22:47Z

This is still incomplete.

In Kubespray v2.27.0 times, I had Hubble installed via:

cilium_enable_hubble: true

cilium_hubble_install: true

cilium_hubble_metrics:
  - dns
  - drop
  - tcp
  - flow
  - icmp
  - http

cilium_hubble_tls_generate: true

Upgrading Cilium (for Kubespray v2.28.0) choked a few times, because resources were lacking Helm annotations. It first choked on some Hubble-related Secret resources, then on ClusterRoleBinding resources.

Adjusting roles/network_plugin/cilium/tasks/remove_old_resources.yml like this helps:

     - { kind: ClusterRole, name: hubble-ui }
     - { kind: ClusterRoleBinding, name: cilium }
     - { kind: ClusterRoleBinding, name: cilium-operator }
+    - { kind: ClusterRoleBinding, name: hubble-generate-certs }
+    - { kind: ClusterRoleBinding, name: hubble-relay }
+    - { kind: ClusterRoleBinding, name: hubble-ui }
+    - { kind: Secret, name: hubble-ca-secret }
+    - { kind: Secret, name: hubble-relay-client-certs }
+    - { kind: Secret, name: hubble-server-certs }
   register: patch_result

Please feel free to incorporate this patch in your pull request.

Give users two options: besides skip Cilium, add `cilium_remove_old_resources`, default is `false`, when set to `true`, it will remove the content of the old version, but it will cause the downtime, need to be careful to use. Signed-off-by: ChengHao Yang <[email protected]>

`cilium install` is equivalent to `helm install`, it will failed if cilium relase exist. `cilium version` can know the release exist without helm binary Signed-off-by: ChengHao Yang <[email protected]>

k8s-ci-robot · 2025-06-05T13:15:12Z

New changes are detected. LGTM label has been removed.

k8s-ci-robot · 2025-06-05T13:15:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tico88612

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tico88612]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 24, 2025

k8s-ci-robot requested review from cyclinder and VannTen May 24, 2025 14:42

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 24, 2025

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label May 24, 2025

k8s-ci-robot added the tide/merge-method-merge Denotes a PR that should use a standard merge by tide when it merges. label May 24, 2025

tico88612 mentioned this pull request May 27, 2025

feat: Add cilium_install_extra_args #12262

Open

tico88612 force-pushed the fix/cilium-migration branch from 854bd16 to 62c26d7 Compare May 29, 2025 16:12

tico88612 mentioned this pull request Jun 2, 2025

cannot create regular file '/hostbin/cilium-mount': Permission denied #12276

Closed

cyclinder reviewed Jun 3, 2025

View reviewed changes

k8s-ci-robot assigned cyclinder Jun 4, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 4, 2025

VannTen mentioned this pull request Jun 5, 2025

fix(helm): add pyyaml install task for fedora coreos #12269

Draft

Fix: if cilium release exist, the action will set upgrade

1f9020f

`cilium install` is equivalent to `helm install`, it will failed if cilium relase exist. `cilium version` can know the release exist without helm binary Signed-off-by: ChengHao Yang <[email protected]>

tico88612 force-pushed the fix/cilium-migration branch from 62c26d7 to 1f9020f Compare June 5, 2025 13:15

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2025

Fix: the cluster is upgraded from 2.27 to 2.28 cilium will break #12254

Are you sure you want to change the base?

Fix: the cluster is upgraded from 2.27 to 2.28 cilium will break #12254

Uh oh!

Conversation

tico88612 commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tico88612 commented May 24, 2025

Uh oh!

tico88612 commented May 24, 2025

Uh oh!

vdveldet commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tico88612 commented May 25, 2025

Uh oh!

ledroide commented Jun 2, 2025

Uh oh!

yankay commented Jun 3, 2025

Uh oh!

cyclinder left a comment

Choose a reason for hiding this comment

Uh oh!

tico88612 commented Jun 3, 2025

Uh oh!

cyclinder commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tico88612 commented Jun 3, 2025

Uh oh!

ledroide commented Jun 3, 2025

Uh oh!

cyclinder commented Jun 4, 2025

Uh oh!

tico88612 commented Jun 4, 2025

Uh oh!

cyclinder commented Jun 4, 2025

Uh oh!

spantaleev commented Jun 4, 2025

Uh oh!

k8s-ci-robot commented Jun 5, 2025

Uh oh!

k8s-ci-robot commented Jun 5, 2025

Uh oh!

Uh oh!

tico88612 commented May 24, 2025 •

edited

Loading

vdveldet commented May 25, 2025 •

edited

Loading

cyclinder commented Jun 3, 2025 •

edited

Loading