-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Fix: the cluster is upgraded from 2.27 to 2.28 cilium will break #12254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
/ok-to-test |
/label tide/merge-method-merge |
There might be a use case where /opt/cni/bin is owned by kube. |
@vdveldet Yep, that's why I wrote |
854bd16
to
62c26d7
Compare
Tested on my cluster : the fix solves the error message "Unable to install Cilium: cannot re-use a name that is still in use" when running cluster.yml -> good for me |
HI @cyclinder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I think it's a breaking change for deleting all old Cilium resources, which causes a breakdown for the cluster, and we also miss the Cilium install configuration beforehand.
Do we use cilium install
or kubectl apply -f
command when the cluster is creating and upgrading?
The original design would have wanted the cilium helm to take over when upgrading, but due to ansible collection version limitations this didn't work. Deleting the old version (old templates managed by Kubespray) of the resource is a good way to ensure that the new installation will be fine. (which will cause the cluster to be disconnected internally, the I note in the PR content that it is optional, you can skip the cilium tag
Currently, cluster installations of cilium use This PR fixes the situation where install can't be used repeatedly. |
Thanks for your information. If we have a new task to upgrade cilium by using If we fix this issue by removing Cilium resources, we must promptly update our tasks whenever there are any additions or deletions of resource manifests in the upstream Cilium. Otherwise, we may encounter similar upgrade issues. This approach would make maintenance challenging for us... |
This PR will automatically determine that if only
wdym? This only removes old templates that we previously maintained. All future cilium upgrade processes will be |
@cyclinder : I had to run cluster.yml multiple times, for various meanings, since I have merged this fix into my own fork, on a running cluster. I confirm that it did not cilium daemonset, and I did not notice any downtime. By the way, ansible itself is designed to be idempotent. If I run the same set of ansible tasks for many occurrences, ansible should only apply changes when needed. In other words, there is no reason to make a difference in the tasks code between bootstraping (= installing) and upgrading (even when there is nothing to upgrade). Ansible is designed to only change what is strictly different from the desired state. |
From my understanding, this issue only happened when we upgraded from 2.27 to 2.28? In future releases, we will be using |
Yes, this issue only happened when the cluster upgraded from 2.27 to 2.28. (or jinja templates to cilium cli)
|
Thanks for your explaintaion, I don't have future questions. /lgtm |
This is still incomplete. In Kubespray v2.27.0 times, I had Hubble installed via: cilium_enable_hubble: true
cilium_hubble_install: true
cilium_hubble_metrics:
- dns
- drop
- tcp
- flow
- icmp
- http
cilium_hubble_tls_generate: true Upgrading Cilium (for Kubespray v2.28.0) choked a few times, because resources were lacking Helm annotations. It first choked on some Hubble-related Adjusting - { kind: ClusterRole, name: hubble-ui }
- { kind: ClusterRoleBinding, name: cilium }
- { kind: ClusterRoleBinding, name: cilium-operator }
+ - { kind: ClusterRoleBinding, name: hubble-generate-certs }
+ - { kind: ClusterRoleBinding, name: hubble-relay }
+ - { kind: ClusterRoleBinding, name: hubble-ui }
+ - { kind: Secret, name: hubble-ca-secret }
+ - { kind: Secret, name: hubble-relay-client-certs }
+ - { kind: Secret, name: hubble-server-certs }
register: patch_result Please feel free to incorporate this patch in your pull request. |
Give users two options: besides skip Cilium, add `cilium_remove_old_resources`, default is `false`, when set to `true`, it will remove the content of the old version, but it will cause the downtime, need to be careful to use. Signed-off-by: ChengHao Yang <[email protected]>
`cilium install` is equivalent to `helm install`, it will failed if cilium relase exist. `cilium version` can know the release exist without helm binary Signed-off-by: ChengHao Yang <[email protected]>
62c26d7
to
1f9020f
Compare
New changes are detected. LGTM label has been removed. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tico88612 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
If the cluster is upgraded from 2.27 to 2.28 and
kube_network_plugin
is set tocilium
, it will fail.So, I added the
cilium_remove_old_resources
option, which will delete the old cilium resources (template).Be careful: While this will have downtime, at least the installation will be complete.
ansible-playbook -i <INVENTORY> upgrade-cluster.yml -e "cilium_remove_old_resources=true"
(if yourkube_owner
is notroot
, please add-e "kube_owner=root"
in the command.)Or if you don't want to upgrade Cilium (or no downtime), you can skip like this:
ansible-playbook -i <INVENTORY> upgrade-cluster.yml --skip-tags cilium
However, the old installation template will not be maintained. Please choose according to your situation.
Which issue(s) this PR fixes:
Fixes #12252
Special notes for your reviewer:
It will backport to 2.28. I tested on my environment (upgrade from 2.27 to 2.28).
Add
kube_owner=root
(if your original cluster'skube_owner
iskube
) andcilium_remove_old_resources=true
Does this PR introduce a user-facing change?: