Skip to content

Fix: the cluster is upgraded from 2.27 to 2.28 cilium will break #12254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tico88612
Copy link
Member

@tico88612 tico88612 commented May 24, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

If the cluster is upgraded from 2.27 to 2.28 and kube_network_plugin is set to cilium, it will fail.

So, I added the cilium_remove_old_resources option, which will delete the old cilium resources (template).

Be careful: While this will have downtime, at least the installation will be complete.

ansible-playbook -i <INVENTORY> upgrade-cluster.yml -e "cilium_remove_old_resources=true" (if your kube_owner is not root, please add -e "kube_owner=root" in the command.)

Or if you don't want to upgrade Cilium (or no downtime), you can skip like this:

ansible-playbook -i <INVENTORY> upgrade-cluster.yml --skip-tags cilium

However, the old installation template will not be maintained. Please choose according to your situation.

Which issue(s) this PR fixes:

Fixes #12252

Special notes for your reviewer:

It will backport to 2.28. I tested on my environment (upgrade from 2.27 to 2.28).

Add kube_owner=root (if your original cluster's kube_owner is kube) and cilium_remove_old_resources=true

Does this PR introduce a user-facing change?:

Fix the Cilium cluster, which is upgraded from 2.27 to 2.28 will break
Fix helm release re-use message when installing repeatedly

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 24, 2025
@k8s-ci-robot k8s-ci-robot requested review from cyclinder and VannTen May 24, 2025 14:42
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 24, 2025
@tico88612
Copy link
Member Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label May 24, 2025
@tico88612
Copy link
Member Author

/label tide/merge-method-merge

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-merge Denotes a PR that should use a standard merge by tide when it merges. label May 24, 2025
@vdveldet
Copy link

vdveldet commented May 25, 2025

There might be a use case where /opt/cni/bin is owned by kube.
In that case the cilium containers will error
Also without adding the variable the upgrade playbook just fails.

@tico88612
Copy link
Member Author

@vdveldet Yep, that's why I wrote if your kube_owner is not root, please add -e "kube_owner=root" in the command. in description.

@ledroide
Copy link
Contributor

ledroide commented Jun 2, 2025

Tested on my cluster : the fix solves the error message "Unable to install Cilium: cannot re-use a name that is still in use" when running cluster.yml -> good for me

@yankay
Copy link
Member

yankay commented Jun 3, 2025

HI @cyclinder
Would you please help to review it. :-)

Copy link
Contributor

@cyclinder cyclinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think it's a breaking change for deleting all old Cilium resources, which causes a breakdown for the cluster, and we also miss the Cilium install configuration beforehand.

Do we use cilium install or kubectl apply -f command when the cluster is creating and upgrading?

@tico88612
Copy link
Member Author

I think it's a breaking change for deleting all old Cilium resources, which causes a breakdown for the cluster, and we also miss the Cilium install configuration beforehand.

The original design would have wanted the cilium helm to take over when upgrading, but due to ansible collection version limitations this didn't work.

Deleting the old version (old templates managed by Kubespray) of the resource is a good way to ensure that the new installation will be fine. (which will cause the cluster to be disconnected internally, the cilium_remove_old_resources default is false)

I note in the PR content that it is optional, you can skip the cilium tag --skip-tags cilium, but that means it won't be maintained (old templates) anymore.

Do we use cilium install or kubectl apply -f command when the cluster is creating and upgrading?

Currently, cluster installations of cilium use cilium install (#12101)

This PR fixes the situation where install can't be used repeatedly. cilium_action will look at the cluster status and choose install or upgrade.

@cyclinder
Copy link
Contributor

cyclinder commented Jun 3, 2025

Currently, cluster installations of cilium use cilium install (#12101)

Thanks for your information. If we have a new task to upgrade cilium by using cilium upgrade command only when we are upgrading the cluster, does it work? which can keep the original installed configuration without loss. And also, it don't affect the cluster upgrade, how about this?

If we fix this issue by removing Cilium resources, we must promptly update our tasks whenever there are any additions or deletions of resource manifests in the upstream Cilium. Otherwise, we may encounter similar upgrade issues. This approach would make maintenance challenging for us...

@tico88612
Copy link
Member Author

If we have a new task to upgrade cilium by using cilium upgrade command only when we are upgrading the cluster, does it work? which can keep the original installed configuration without loss. And also, it don't affect the cluster upgrade, how about this?

This PR will automatically determine that if only upgrade is left when upgrading the cluster and it has not been installed before, the task will fail.
If you only leave install when installing the cluster, assuming that cluster.yml runs more than twice, it will also fail.

If we fix this issue by removing Cilium resources, we must promptly update our tasks whenever there are any additions or deletions of resource manifests in the upstream Cilium. Otherwise, we may encounter similar upgrade issues. This approach would make maintenance challenging for us...

wdym? This only removes old templates that we previously maintained. All future cilium upgrade processes will be cilium upgrade.

@ledroide
Copy link
Contributor

ledroide commented Jun 3, 2025

@cyclinder : I had to run cluster.yml multiple times, for various meanings, since I have merged this fix into my own fork, on a running cluster. I confirm that it did not cilium daemonset, and I did not notice any downtime.

By the way, ansible itself is designed to be idempotent. If I run the same set of ansible tasks for many occurrences, ansible should only apply changes when needed. In other words, there is no reason to make a difference in the tasks code between bootstraping (= installing) and upgrading (even when there is nothing to upgrade). Ansible is designed to only change what is strictly different from the desired state.

@cyclinder
Copy link
Contributor

wdym? This only removes old templates that we previously maintained. All future cilium upgrade processes will be cilium upgrade.

From my understanding, this issue only happened when we upgraded from 2.27 to 2.28? In future releases, we will be using cilium install for cluster installation and cilium upgrade for cluster upgrade, correct?

@tico88612
Copy link
Member Author

From my understanding, this issue only happened when we upgraded from 2.27 to 2.28?

Yes, this issue only happened when the cluster upgraded from 2.27 to 2.28. (or jinja templates to cilium cli)

In future releases, we will be using cilium install for cluster installation and cilium upgrade for cluster upgrade, correct?

cilium install will use on the first installation, and it will use cilium upgrade if cilium installed prevent Error: Unable to install Cilium: cannot re-use a name that is still in use.

@cyclinder
Copy link
Contributor

Thanks for your explaintaion, I don't have future questions.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 4, 2025
@spantaleev
Copy link
Contributor

This is still incomplete.

In Kubespray v2.27.0 times, I had Hubble installed via:

cilium_enable_hubble: true

cilium_hubble_install: true

cilium_hubble_metrics:
  - dns
  - drop
  - tcp
  - flow
  - icmp
  - http

cilium_hubble_tls_generate: true

Upgrading Cilium (for Kubespray v2.28.0) choked a few times, because resources were lacking Helm annotations. It first choked on some Hubble-related Secret resources, then on ClusterRoleBinding resources.

Adjusting roles/network_plugin/cilium/tasks/remove_old_resources.yml like this helps:

     - { kind: ClusterRole, name: hubble-ui }
     - { kind: ClusterRoleBinding, name: cilium }
     - { kind: ClusterRoleBinding, name: cilium-operator }
+    - { kind: ClusterRoleBinding, name: hubble-generate-certs }
+    - { kind: ClusterRoleBinding, name: hubble-relay }
+    - { kind: ClusterRoleBinding, name: hubble-ui }
+    - { kind: Secret, name: hubble-ca-secret }
+    - { kind: Secret, name: hubble-relay-client-certs }
+    - { kind: Secret, name: hubble-server-certs }
   register: patch_result

Please feel free to incorporate this patch in your pull request.

Give users two options: besides skip Cilium, add
`cilium_remove_old_resources`, default is `false`, when set to `true`,
it will remove the content of the old version, but it will cause the
downtime, need to be careful to use.

Signed-off-by: ChengHao Yang <[email protected]>
`cilium install` is equivalent to `helm install`, it will failed if
cilium relase exist. `cilium version` can know the release exist without
helm binary

Signed-off-by: ChengHao Yang <[email protected]>
@tico88612 tico88612 force-pushed the fix/cilium-migration branch from 62c26d7 to 1f9020f Compare June 5, 2025 13:15
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tico88612

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. tide/merge-method-merge Denotes a PR that should use a standard merge by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kubespray' s Cilium upgrade fails
7 participants