OTA-1637: ClusterOperators should not go Progressing only for cluster scaling #30297

hongkailiu · 2025-09-23T19:33:36Z

This is to cover the cluster scaling case from the rule [1] that is introduced recently:

Operators should not report Progressing only because DaemonSets
owned by them are adjusting to a new node from cluster scaleup or
a node rebooting from cluster upgrade.

The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime.

The bugs are created for the case of node rebooting. The condition
goes to Progressing=True with the same reason that we found for the
cluster scaling up/down. Thus, we re-use the bugs instead of
recreating a new set of bugs that might be closed as duplicates.

[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164

openshift-ci-robot · 2025-09-23T20:27:12Z

@hongkailiu: This pull request references OTA-1637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This is to cover the cluster scaling case from the rule [1] that is introduced recently:
Operators should not report Progressing only because DaemonSets
owned by them are adjusting to a new node from cluster scaleup or
a node rebooting from cluster upgrade.
The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime.

[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2025-09-24T13:47:21Z

This is what I expect to see (from this job):

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30297/pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2/1970572328379092992/artifacts/e2e-aws-ovn-serial/openshift-e2e-test/build-log.txt | rg 'failed.*scaling different machineSets simultaneously|fail.*Progressing=False'
fail [github.com/openshift/origin/test/extended/machines/scale.go:253]: those cluster operators left Progressing=False while cluster was scaling: [network image-registry node-tuning storage dns]
failed: (6m0s) 2025-09-23T22:57:26 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"

And the time matches perfectly.

Interestingly, it is the same list caught by another case. I feel they are caused from the same source of issue and the same bug can be shared by two cases.

petr-muller · 2025-09-25T23:32:30Z

/cc

test/extended/machines/scale.go

DavidHurta · 2025-09-29T12:22:30Z

/cc

petr-muller

LGTM

petr-muller

/lgtm

This is to cover the cluster scaling case from the rule [1] that is introduced recently: ``` Operators should not report Progressing only because DaemonSets owned by them are adjusting to a new node from cluster scaleup or a node rebooting from cluster upgrade. ``` The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime. [1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164

hongkailiu · 2025-09-30T13:52:34Z

/wip

Creating bugs for exceptions ...

hongkailiu · 2025-09-30T13:53:05Z

/wip

Creating bugs for exceptions ...

hongkailiu · 2025-09-30T13:54:22Z

/hold

openshift-trt · 2025-09-30T18:04:08Z

Job Failure Risk Analysis for sha: 787e9be

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn	IncompleteTests Tests for this run (143) are below the historical average (2170): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

The bugs are created for the case of node rebooting. The condition goes to Progressing=True with the same reason that we found for the cluster scaling up/down. Thus, we re-use the bugs instead of recreating a new set of bugs that might be closed as duplciates.

openshift-ci-robot · 2025-10-02T13:12:54Z

@hongkailiu: This pull request references OTA-1637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This is to cover the cluster scaling case from the rule [1] that is introduced recently:
Operators should not report Progressing only because DaemonSets
owned by them are adjusting to a new node from cluster scaleup or
a node rebooting from cluster upgrade.
The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime.

The bugs are created for the case of node rebooting. The condition
goes to Progressing=True with the same reason that we found for the
cluster scaling up/down. Thus, we re-use the bugs instead of
recreating a new set of bugs that might be closed as duplicates.

[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2025-10-02T13:12:55Z

The result from e2e-aws-ovn-serial-2of2 looks good:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30297/pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2/1973475766385512448/artifacts/e2e-aws-ovn-serial/openshift-e2e-test/artifacts/e2e.log | grep
grow
started: 0/5/36 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"
passed: (5m9s) 2025-10-01T22:58:36 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"

I wanted to show some logs for the exceptions but it does not seem an easy thing to do when the job succeeds. 🤷

hongkailiu · 2025-10-02T13:13:07Z

/hold cancel

openshift-ci · 2025-10-03T18:48:13Z

@hongkailiu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node-serial	`c9c5fa5`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/e2e-openstack-ovn	`c9c5fa5`	link	false	`/test e2e-openstack-ovn`
ci/prow/e2e-aws-ovn-single-node	`c9c5fa5`	link	false	`/test e2e-aws-ovn-single-node`
ci/prow/e2e-aws-ovn-edge-zones	`0d05b6a`	link	false	`/test e2e-aws-ovn-edge-zones`
ci/prow/okd-scos-e2e-aws-ovn	`c9c5fa5`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-gcp-csi	`c9c5fa5`	link	true	`/test e2e-gcp-csi`
ci/prow/e2e-aws-csi	`c9c5fa5`	link	true	`/test e2e-aws-csi`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-trt · 2025-10-03T19:15:21Z

Job Failure Risk Analysis for sha: c9c5fa5

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-csi	IncompleteTests Tests for this run (145) are below the historical average (647): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-csi	IncompleteTests Tests for this run (146) are below the historical average (956): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-openstack-ovn	IncompleteTests Tests for this run (143) are below the historical average (2510): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn	IncompleteTests Tests for this run (140) are below the historical average (1542): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

wking

e2e-aws-ovn-single-node-serial failure is unrelated:

: [Monitor:audit-log-analyzer][sig-arch][Late] operators should not create watch channels very often	0s
{Operator "prometheus-operator" produces more watch requests than expected: watchrequestcount=286, upperbound=278, ratio=1.03  }

and the changes:

/lgtm

Run up some more numbers, just to reduce the risk of turning up post-merge surprises, before we add the verified label:

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial 10

openshift-ci · 2025-10-04T02:41:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hongkailiu, petr-muller, wking
Once this PR has been reviewed and has the lgtm label, please assign sosiouxme for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hongkailiu · 2025-10-05T22:08:02Z

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial 10

openshift-ci · 2025-10-05T22:08:05Z

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c31ffd80-a237-11f0-833a-bfc530ef8961-0

hongkailiu · 2025-10-06T11:52:27Z

2025-10-05 22:08:05 +0000 UTC: AllJobsTriggered: WithErrors: Jobs triggered with errors

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial 10

openshift-ci · 2025-10-06T11:52:33Z

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ef487a90-a2aa-11f0-928c-a5212e7af64a-0

hongkailiu · 2025-10-06T14:26:29Z

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2 10

openshift-ci · 2025-10-06T14:26:33Z

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/734a11e0-a2c0-11f0-8b71-344f46514be6-0

hongkailiu · 2025-10-06T19:59:05Z

Job execution failed: Pod got deleted unexpectedly

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2 10

openshift-ci · 2025-10-06T20:00:12Z

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/eac04270-a2ee-11f0-8492-5239380b4088-0

hongkailiu · 2025-10-07T01:47:06Z

/verified "periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2"

openshift-ci-robot · 2025-10-07T01:47:08Z

@hongkailiu: The /verified command must be used with one of the following actions: by, later, remove, or bypass. See https://docs.ci.openshift.org/docs/architecture/jira/#premerge-verification for more information.

In response to this:

/verified "periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

hongkailiu · 2025-10-07T01:47:30Z

/verified by periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2

openshift-ci-robot · 2025-10-07T01:47:41Z

@hongkailiu: This PR has been marked as verified by periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2.

In response to this:

/verified by periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci bot requested review from deads2k and p0lyn0mial September 23, 2025 19:34

hongkailiu changed the title ~~ClusterOperators should not go Progressing only for cluster scaling~~ OTA-1637: ClusterOperators should not go Progressing only for cluster scaling Sep 23, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 23, 2025

openshift-ci bot requested a review from petr-muller September 25, 2025 23:32

wking reviewed Sep 26, 2025

View reviewed changes

test/extended/machines/scale.go Outdated Show resolved Hide resolved

openshift-ci bot requested a review from DavidHurta September 29, 2025 12:22

petr-muller reviewed Sep 30, 2025

View reviewed changes

petr-muller approved these changes Sep 30, 2025

View reviewed changes

openshift-ci bot assigned petr-muller Sep 30, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2025

hongkailiu force-pushed the OTA-1637-scale branch from 5891e83 to 787e9be Compare September 30, 2025 13:49

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2025

hongkailiu force-pushed the OTA-1637-scale branch from 6e43bdc to 0d05b6a Compare October 1, 2025 15:39

hongkailiu force-pushed the OTA-1637-scale branch from 0d05b6a to c9c5fa5 Compare October 1, 2025 19:50

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 2, 2025

wking approved these changes Oct 4, 2025

View reviewed changes

openshift-ci bot assigned wking Oct 4, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2025

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 7, 2025

OTA-1637: ClusterOperators should not go Progressing only for cluster scaling #30297

Are you sure you want to change the base?

OTA-1637: ClusterOperators should not go Progressing only for cluster scaling #30297

Conversation

hongkailiu commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Sep 23, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hongkailiu commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petr-muller commented Sep 25, 2025

Uh oh!

Uh oh!

DavidHurta commented Sep 29, 2025

Uh oh!

petr-muller left a comment

Choose a reason for hiding this comment

Uh oh!

petr-muller left a comment

Choose a reason for hiding this comment

Uh oh!

hongkailiu commented Sep 30, 2025

Uh oh!

hongkailiu commented Sep 30, 2025

Uh oh!

hongkailiu commented Sep 30, 2025

Uh oh!

openshift-trt bot commented Sep 30, 2025

Uh oh!

openshift-ci-robot commented Oct 2, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hongkailiu commented Oct 2, 2025

Uh oh!

hongkailiu commented Oct 2, 2025

Uh oh!

openshift-ci bot commented Oct 3, 2025

Uh oh!

openshift-trt bot commented Oct 3, 2025

Uh oh!

wking left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Oct 4, 2025

Uh oh!

hongkailiu commented Oct 5, 2025

Uh oh!

openshift-ci bot commented Oct 5, 2025

Uh oh!

hongkailiu commented Oct 6, 2025

Uh oh!

openshift-ci bot commented Oct 6, 2025

Uh oh!

hongkailiu commented Oct 6, 2025

Uh oh!

openshift-ci bot commented Oct 6, 2025

Uh oh!

hongkailiu commented Oct 6, 2025

Uh oh!

openshift-ci bot commented Oct 6, 2025

Uh oh!

hongkailiu commented Oct 7, 2025

Uh oh!

openshift-ci-robot commented Oct 7, 2025

Uh oh!

hongkailiu commented Oct 7, 2025

Uh oh!

openshift-ci-robot commented Oct 7, 2025

Uh oh!

Uh oh!

hongkailiu commented Sep 23, 2025 •

edited

Loading

openshift-ci-robot commented Sep 23, 2025 •

edited by openshift-ci bot

Loading

hongkailiu commented Sep 24, 2025 •

edited

Loading

openshift-ci-robot commented Oct 2, 2025 •

edited by openshift-ci bot

Loading