Skip to content

Conversation

hongkailiu
Copy link
Member

@hongkailiu hongkailiu commented Sep 23, 2025

This is to cover the cluster scaling case from the rule [1] that is introduced recently:

Operators should not report Progressing only because DaemonSets
owned by them are adjusting to a new node from cluster scaleup or
a node rebooting from cluster upgrade.

The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime.

The bugs are created for the case of node rebooting. The condition
goes to Progressing=True with the same reason that we found for the
cluster scaling up/down. Thus, we re-use the bugs instead of
recreating a new set of bugs that might be closed as duplicates.

[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164

@hongkailiu hongkailiu changed the title ClusterOperators should not go Progressing only for cluster scaling OTA-1637: ClusterOperators should not go Progressing only for cluster scaling Sep 23, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 23, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 23, 2025

@hongkailiu: This pull request references OTA-1637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This is to cover the cluster scaling case from the rule [1] that is introduced recently:

Operators should not report Progressing only because DaemonSets
owned by them are adjusting to a new node from cluster scaleup or
a node rebooting from cluster upgrade.

The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime.

[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu
Copy link
Member Author

hongkailiu commented Sep 24, 2025

This is what I expect to see (from this job):

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30297/pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2/1970572328379092992/artifacts/e2e-aws-ovn-serial/openshift-e2e-test/build-log.txt | rg 'failed.*scaling different machineSets simultaneously|fail.*Progressing=False'
fail [github.com/openshift/origin/test/extended/machines/scale.go:253]: those cluster operators left Progressing=False while cluster was scaling: [network image-registry node-tuning storage dns]
failed: (6m0s) 2025-09-23T22:57:26 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"

And the time matches perfectly.

Screenshot 2025-09-25 at 09 39 47 Screenshot 2025-09-25 at 09 40 56

Interestingly, it is the same list caught by another case. I feel they are caused from the same source of issue and the same bug can be shared by two cases.

@petr-muller
Copy link
Member

/cc

@openshift-ci openshift-ci bot requested a review from petr-muller September 25, 2025 23:32
@DavidHurta
Copy link

/cc

@openshift-ci openshift-ci bot requested a review from DavidHurta September 29, 2025 12:22
Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2025
This is to cover the cluster scaling case from the rule [1] that is
introduced recently:

```
Operators should not report Progressing only because DaemonSets
owned by them are adjusting to a new node from cluster scaleup or
a node rebooting from cluster upgrade.
```

The test plugs into the existing scaling test. It checks each
CO's Progressing condition before and after the test, and identifies
every CO that either left Progressing=False or re-entered
Progressing=False with a different LastTransitionTime.

[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 30, 2025
@hongkailiu
Copy link
Member Author

/wip

Creating bugs for exceptions ...

1 similar comment
@hongkailiu
Copy link
Member Author

/wip

Creating bugs for exceptions ...

@hongkailiu
Copy link
Member Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2025
Copy link

openshift-trt bot commented Sep 30, 2025

Job Failure Risk Analysis for sha: 787e9be

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-openstack-ovn IncompleteTests
Tests for this run (143) are below the historical average (2170): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

The bugs are created for the case of node rebooting. The condition
goes to Progressing=True with the same reason that we found for the
cluster scaling up/down. Thus, we re-use the bugs instead of
recreating a new set of bugs that might be closed as duplciates.
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 2, 2025

@hongkailiu: This pull request references OTA-1637 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This is to cover the cluster scaling case from the rule [1] that is introduced recently:

Operators should not report Progressing only because DaemonSets
owned by them are adjusting to a new node from cluster scaleup or
a node rebooting from cluster upgrade.

The test plugs into the existing scaling test. It checks each CO's Progressing condition before and after the test, and identifies every CO that either left Progressing=False or re-entered Progressing=False with a different LastTransitionTime.

The bugs are created for the case of node rebooting. The condition
goes to Progressing=True with the same reason that we found for the
cluster scaling up/down. Thus, we re-use the bugs instead of
recreating a new set of bugs that might be closed as duplicates.

[1]. https://github.com/openshift/api/blob/61248d910ff74aef020492922d14e6dadaba598b/config/v1/types_cluster_operator.go#L163-L164

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu
Copy link
Member Author

The result from e2e-aws-ovn-serial-2of2 looks good:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30297/pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2/1973475766385512448/artifacts/e2e-aws-ovn-serial/openshift-e2e-test/artifacts/e2e.log | grep
grow
started: 0/5/36 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"
passed: (5m9s) 2025-10-01T22:58:36 "[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously [Timeout:30m][apigroup:machine.openshift.io] [Suite:openshift/conformance/serial]"

I wanted to show some logs for the exceptions but it does not seem an easy thing to do when the job succeeds. 🤷

@hongkailiu
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 2, 2025
Copy link
Contributor

openshift-ci bot commented Oct 3, 2025

@hongkailiu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-serial c9c5fa5 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-openstack-ovn c9c5fa5 link false /test e2e-openstack-ovn
ci/prow/e2e-aws-ovn-single-node c9c5fa5 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-edge-zones 0d05b6a link false /test e2e-aws-ovn-edge-zones
ci/prow/okd-scos-e2e-aws-ovn c9c5fa5 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-gcp-csi c9c5fa5 link true /test e2e-gcp-csi
ci/prow/e2e-aws-csi c9c5fa5 link true /test e2e-aws-csi

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link

openshift-trt bot commented Oct 3, 2025

Job Failure Risk Analysis for sha: c9c5fa5

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-csi IncompleteTests
Tests for this run (145) are below the historical average (647): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-gcp-csi IncompleteTests
Tests for this run (146) are below the historical average (956): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-openstack-ovn IncompleteTests
Tests for this run (143) are below the historical average (2510): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn IncompleteTests
Tests for this run (140) are below the historical average (1542): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e2e-aws-ovn-single-node-serial failure is unrelated:

: [Monitor:audit-log-analyzer][sig-arch][Late] operators should not create watch channels very often	0s
{Operator "prometheus-operator" produces more watch requests than expected: watchrequestcount=286, upperbound=278, ratio=1.03  }

and the changes:

/lgtm

Run up some more numbers, just to reduce the risk of turning up post-merge surprises, before we add the verified label:

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial 10

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2025
Copy link
Contributor

openshift-ci bot commented Oct 4, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hongkailiu, petr-muller, wking
Once this PR has been reviewed and has the lgtm label, please assign sosiouxme for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hongkailiu
Copy link
Member Author

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial 10

Copy link
Contributor

openshift-ci bot commented Oct 5, 2025

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c31ffd80-a237-11f0-833a-bfc530ef8961-0

@hongkailiu
Copy link
Member Author

2025-10-05 22:08:05 +0000 UTC: AllJobsTriggered: WithErrors: Jobs triggered with errors

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial 10

Copy link
Contributor

openshift-ci bot commented Oct 6, 2025

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ef487a90-a2aa-11f0-928c-a5212e7af64a-0

@hongkailiu
Copy link
Member Author

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2 10

Copy link
Contributor

openshift-ci bot commented Oct 6, 2025

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/734a11e0-a2c0-11f0-8b71-344f46514be6-0

@hongkailiu
Copy link
Member Author

Job execution failed: Pod got deleted unexpectedly

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2 10

Copy link
Contributor

openshift-ci bot commented Oct 6, 2025

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/eac04270-a2ee-11f0-8492-5239380b4088-0

@hongkailiu
Copy link
Member Author

/verified "periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2"

@openshift-ci-robot
Copy link

@hongkailiu: The /verified command must be used with one of the following actions: by, later, remove, or bypass. See https://docs.ci.openshift.org/docs/architecture/jira/#premerge-verification for more information.

In response to this:

/verified "periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu
Copy link
Member Author

/verified by periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 7, 2025
@openshift-ci-robot
Copy link

@hongkailiu: This PR has been marked as verified by periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2.

In response to this:

/verified by periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants