-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OTA-1637: ClusterOperators should not go Progressing only for cluster scaling #30297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@hongkailiu: This pull request references OTA-1637 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
This is what I expect to see (from this job):
And the time matches perfectly. ![]() ![]() Interestingly, it is the same list caught by another case. I feel they are caused from the same source of issue and the same bug can be shared by two cases. |
/cc |
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
This is to cover the cluster scaling case from the rule [1] that is introduced recently: ``` Operators should not report Progressing only because DaemonSets owned by them are adjusting to a new node from cluster scaleup or a node rebooting from cluster upgrade. ``` The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime. [1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164
5891e83
to
787e9be
Compare
/wip Creating bugs for exceptions ... |
1 similar comment
/wip Creating bugs for exceptions ... |
/hold |
Job Failure Risk Analysis for sha: 787e9be
|
6e43bdc
to
0d05b6a
Compare
The bugs are created for the case of node rebooting. The condition goes to Progressing=True with the same reason that we found for the cluster scaling up/down. Thus, we re-use the bugs instead of recreating a new set of bugs that might be closed as duplciates.
0d05b6a
to
c9c5fa5
Compare
@hongkailiu: This pull request references OTA-1637 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
The result from e2e-aws-ovn-serial-2of2 looks good: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30297/pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2/1973475766385512448/artifacts/e2e-aws-ovn-serial/openshift-e2e-test/artifacts/e2e.log | grep
grow
started: 0/5/36 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"
passed: (5m9s) 2025-10-01T22:58:36 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]" I wanted to show some logs for the exceptions but it does not seem an easy thing to do when the job succeeds. 🤷 |
/hold cancel |
@hongkailiu: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Job Failure Risk Analysis for sha: c9c5fa5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e2e-aws-ovn-single-node-serial
failure is unrelated:
: [Monitor:audit-log-analyzer][sig-arch][Late] operators should not create watch channels very often 0s
{Operator "prometheus-operator" produces more watch requests than expected: watchrequestcount=286, upperbound=278, ratio=1.03 }
and the changes:
/lgtm
Run up some more numbers, just to reduce the risk of turning up post-merge surprises, before we add the verified
label:
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial 10
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hongkailiu, petr-muller, wking The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial 10 |
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c31ffd80-a237-11f0-833a-bfc530ef8961-0 |
2025-10-05 22:08:05 +0000 UTC: AllJobsTriggered: WithErrors: Jobs triggered with errors /payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial 10 |
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ef487a90-a2aa-11f0-928c-a5212e7af64a-0 |
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2 10 |
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/734a11e0-a2c0-11f0-8b71-344f46514be6-0 |
Job execution failed: Pod got deleted unexpectedly /payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2 10 |
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/eac04270-a2ee-11f0-8492-5239380b4088-0 |
/verified "periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2" |
@hongkailiu: The In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/verified by periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2 |
@hongkailiu: This PR has been marked as verified by In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
This is to cover the cluster scaling case from the rule [1] that is introduced recently:
The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime.
The bugs are created for the case of node rebooting. The condition
goes to Progressing=True with the same reason that we found for the
cluster scaling up/down. Thus, we re-use the bugs instead of
recreating a new set of bugs that might be closed as duplicates.
[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164