Skip to content

NO-JIRA: Remove fixed bugs on CO conditions (2) - 2nd try#31207

Open
hongkailiu wants to merge 12 commits into
openshift:mainfrom
hongkailiu:remove-fixed-CO-bugs-c
Open

NO-JIRA: Remove fixed bugs on CO conditions (2) - 2nd try#31207
hongkailiu wants to merge 12 commits into
openshift:mainfrom
hongkailiu:remove-fixed-CO-bugs-c

Conversation

@hongkailiu
Copy link
Copy Markdown
Member

@hongkailiu hongkailiu commented May 21, 2026

This is to redo #31112 which is revert by #31201 because of TRT-2669.

Now https://redhat.atlassian.net/browse/OCPBUGS-86308 is added to address TRT-2669.

I did not reuse https://issues.redhat.com/browse/OCPBUGS-62626 because it turned out to be different cases: OCPBUGS-86308 scale up vs OCPBUGS-62626 node reboot.

Comparing to #31112, OCPBUGS-62635 is missing in this pull because it has been removed by #30775 already.

I will do rebase after #30775 gets in.

hongkailiu and others added 12 commits May 21, 2026 00:51
This pull skips all CO tests on SNO. SingleNode is may briefly go Available=False for many operators during updates or Node reboots. Several operators also lack the capacity to teach their Degraded logic about single-node quality-of-service expectations. And we don't have capacity to file and track single-node Degraded exceptions or to set Available grace periods in this test suite at the moment.

- `Available=False` and `Degrade=True` are not checked at all no matter if the test case is executed in an upgrade test suite, or not. Before it was handled as an exception and thus the job would be just flaky instead of failing. Thus, the relevant exceptions are removed.

- All checks on the `Progressing` condition are skipped as well on a SNO cluster.

The logging logic was inherited if it fails to determine the control plane topology because I am not sure on which type of clusters an error will show up.
OCPBUGS-23745 has been fixed and shipped with 4.15. However, the symptom is still there in 4.21 and we create OCPBUGS-66230 to track the issue.

OCPBUGS-66230 is closed as the fix is included in 4.21.0-0.nightly-2025-11-30-094855 [1].

[1]. https://redhat.atlassian.net/browse/OCPBUGS-66230?focusedCommentId=16861698
62633 is for co/service-ca and the correct one is 62634.
NTO should now correctly report ClusterOperator status conditions
Available, Progressing and Degraded.  See: OCPBUGS-62632
The symptom is still there after OCPBUGS-62630 is shipped.
The details are in the bug.
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot
Copy link
Copy Markdown

@hongkailiu: This pull request explicitly references no jira issue.

Details

In response to this:

This is to redo #31112 which is revert by #31112 because of TRT-2669.

Now https://redhat.atlassian.net/browse/OCPBUGS-86009 is added for it.

I did not reuse https://issues.redhat.com/browse/OCPBUGS-62623 because they turned out to be different cases: OCPBUGS-86009 scale up vs OCPBUGS-62623 node reboot.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 21, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Walkthrough

Refactors CVO legacy monitor tests to centralize control-plane topology fetching and pass TopologyMode to test helpers instead of *rest.Config. Removes redundant single-node-specific exception logic by consolidating single-node skip behavior into core helpers. Updates operator exception URLs in scale test and makes violations assertions conditional on cluster topology.

Changes

CVO Legacy Monitor Test Topology Refactoring

Layer / File(s) Summary
Exception callback signature refactor
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
exceptionCallback type drops the clientConfig parameter, and its invocation in testOperatorStateTransitions passes only operator, condition, and event interval.
Topology fetching and data flow in monitortest
pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go
Adds e2e framework import for logging; EvaluateTestsFromConstructedIntervals fetches control-plane topology once and passes it to upgrade and stable-system test helpers instead of passing raw REST config.
Core operator state transitions helper with topology parameter
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
testOperatorStateTransitions accepts topology parameter and emits skipped JUnit cases for single-node clusters per operator test case, consolidating single-node-specific handling into the core helper.
Stable system operator transitions refactoring
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
testStableSystemOperatorStateTransitions accepts topology instead of clientConfig, removes single-node-specific exception logic and the network operator exception case, and passes topology to the core helper.
Upgrade operator transitions refactoring
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
testUpgradeOperatorStateTransitions accepts topology and derives two-node behavior from it, removes exception cases for monitoring and kube-apiserver, and passes topology to the core helper.
Progressing state transitions and exception URL updates
pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
testUpgradeOperatorProgressingStateTransitions accepts topology and derives isTwoNode from it, adds explicit single-node skip behavior, updates network exception URL to Atlassian link, removes image-registry reason exception, and emits skipped cases instead of failure logic on single-node.
Scale test topology-gated enforcement
test/extended/machines/scale.go
Imports typed config v1 client; updates operator exception URLs to Atlassian links for dns, image-registry, and network; makes violations assertion conditional on cluster topology, only enforcing when topology is not SingleReplicaTopologyMode.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • openshift/origin#31172: Both PRs modify the CVO legacy monitor operator transition tests to derive topology/single-node behavior and scope or remove OLM/progression exceptions.
  • openshift/origin#31112: Both PRs modify the same CVO legacy monitor exception logic in operators.go and test/extended/machines/scale.go (changing operator condition/reason exception cases and URLs).
  • openshift/origin#31201: Both PRs modify the legacy CVO monitor exception/skip logic in operators.go and the scaling test's operator condition exception handling in test/extended/machines/scale.go.

Suggested reviewers

  • sjenning
  • p0lyn0mial
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Assertion messages lack specificity (scale.go lines 287, 289 missing context), and error handling bypasses critical logic when topology lookup fails, hiding regressions per review comments. Add meaningful messages to error assertions; make topology lookup mandatory with proper error handling instead of silent skips; ensure downstream functions don't silently degrade when topology retrieval fails.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning Test "grow and decrease when scaling machineSets" assumes multi-node clusters (tests scaling) with no SNO protection in test body or labels, only in AfterEach assertion. Add [Skipped:SingleReplicaTopology] label to test name or guard with exutil.IsSingleNode() check and skip if true, since SNO cannot scale multiple nodes.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main objective of the changeset: removing fixed bugs from ClusterOperator condition test exception handling. The title clearly indicates this is the second attempt at this work.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test title declarations (Describe, It) in modified files use static strings with no dynamic information that changes between test runs.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. All changes modify existing code/helpers. The existing scale.go test has [apigroup:machine.openshift.io] protection.
Topology-Aware Scheduling Compatibility ✅ Passed Changes are test/monitoring code, not deployment manifests. They add topology-aware validation (skip SNO tests, conditional exceptions), not scheduling constraints.
Ote Binary Stdout Contract ✅ Passed PR uses only proper logging frameworks (e2e.Logf via klog, logrus) that default to stderr, not stdout. No fmt.Print/Println/Printf or direct os.Stdout writes found.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No NEW Ginkgo e2e tests are added. All modified files (monitortest.go, operators.go, scale.go) are existing code; only refactoring of helpers and updates to existing test assertion logic.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hongkailiu
Once this PR has been reviewed and has the lgtm label, please assign stbenjam for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go`:
- Around line 94-107: getControlPlaneTopology can fail and leave topology empty
causing downstream helpers (testUpgradeOperatorStateTransitions,
testUpgradeOperatorProgressingStateTransitions,
testStableSystemOperatorStateTransitions) to misbehave; change the
getControlPlaneTopology(err) handling so that if err != nil you either return
the error from the enclosing function or short-circuit/skip topology-dependent
tests immediately (do not proceed with empty topology). Locate the call to
getControlPlaneTopology and replace the current e2e.Logf-only branch with logic
to return fmt.Errorf(...) (or a defined skip/short-circuit path) when err != nil
so topology-dependent calls never run with an invalid topology value.

In `@test/extended/machines/scale.go`:
- Around line 286-296: The code currently logs and skips the topology-based
assertion when exutil.GetControlPlaneTopologyFromConfigClient returns an error
or nil; change this to fail fast by asserting the topology call succeeded and
returned a non-nil value. Replace the manual error log with
o.Expect(err).NotTo(o.HaveOccurred()) for the
exutil.GetControlPlaneTopologyFromConfigClient call and add
o.Expect(topo).NotTo(o.BeNil()) (and then check *topo !=
configv1.SingleReplicaTopologyMode) so the violations assertion always runs when
topology lookup fails or is missing; use the existing symbols cfg,
configv1client.NewForConfig, exutil.GetControlPlaneTopologyFromConfigClient,
topo, and violations to locate and update the logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5f2f414c-4146-4330-b2af-1fadbfae691b

📥 Commits

Reviewing files that changed from the base of the PR and between 00c4cba and ff1ae2c.

📒 Files selected for processing (3)
  • pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go
  • pkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.go
  • test/extended/machines/scale.go

Comment on lines +94 to +107
topology, err := getControlPlaneTopology(w.adminRESTConfig)
if err != nil {
e2e.Logf("failed to get control plane topology: %v", err)
}

if isUpgrade {
junits = append(junits, testUpgradeOperatorStateTransitions(finalIntervals, w.adminRESTConfig)...)
junits = append(junits, testUpgradeOperatorStateTransitions(finalIntervals, w.adminRESTConfig, topology)...)
level, err := getUpgradeLevel(w.adminRESTConfig)
if err != nil || level == unknownUpgradeLevel {
return nil, fmt.Errorf("failed to determine upgrade level: %w", err)
}
junits = append(junits, testUpgradeOperatorProgressingStateTransitions(finalIntervals, level == patchUpgradeLevel, w.adminRESTConfig)...)
junits = append(junits, testUpgradeOperatorProgressingStateTransitions(finalIntervals, level == patchUpgradeLevel, topology)...)
} else {
junits = append(junits, testStableSystemOperatorStateTransitions(finalIntervals, w.adminRESTConfig)...)
junits = append(junits, testStableSystemOperatorStateTransitions(finalIntervals, topology)...)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail or skip when topology lookup fails.

If getControlPlaneTopology errors here, topology stays empty and the downstream helpers behave as if the cluster is neither single-node nor two-node. That bypasses the new SNO skips and dual-replica/arbiter exception paths, so a topology read failure can turn into false monitor-test failures. Return the error or short-circuit the topology-dependent checks instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go`
around lines 94 - 107, getControlPlaneTopology can fail and leave topology empty
causing downstream helpers (testUpgradeOperatorStateTransitions,
testUpgradeOperatorProgressingStateTransitions,
testStableSystemOperatorStateTransitions) to misbehave; change the
getControlPlaneTopology(err) handling so that if err != nil you either return
the error from the enclosing function or short-circuit/skip topology-dependent
tests immediately (do not proceed with empty topology). Locate the call to
getControlPlaneTopology and replace the current e2e.Logf-only branch with logic
to return fmt.Errorf(...) (or a defined skip/short-circuit path) when err != nil
so topology-dependent calls never run with an invalid topology value.

Comment on lines +286 to +296
cfg, err := e2e.LoadConfig()
o.Expect(err).NotTo(o.HaveOccurred())
configV1Client, err := configv1client.NewForConfig(cfg)
o.Expect(err).NotTo(o.HaveOccurred())
topo, err := exutil.GetControlPlaneTopologyFromConfigClient(configV1Client)
if err != nil {
e2e.Logf("failed to get control plane topology: %v", err)
}
if topo != nil && *topo != configv1.SingleReplicaTopologyMode {
o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when topology lookup fails instead of skipping the assertion.

If topology retrieval fails or returns nil, the violations check is silently skipped, which can hide real regressions. Please make topology discovery mandatory in this path.

Proposed fix
 		cfg, err := e2e.LoadConfig()
 		o.Expect(err).NotTo(o.HaveOccurred())
 		configV1Client, err := configv1client.NewForConfig(cfg)
 		o.Expect(err).NotTo(o.HaveOccurred())
 		topo, err := exutil.GetControlPlaneTopologyFromConfigClient(configV1Client)
-		if err != nil {
-			e2e.Logf("failed to get control plane topology: %v", err)
-		}
-		if topo != nil && *topo != configv1.SingleReplicaTopologyMode {
+		o.Expect(err).NotTo(o.HaveOccurred(), "failed to get control plane topology")
+		o.Expect(topo).NotTo(o.BeNil(), "control plane topology must be discoverable")
+		if *topo != configv1.SingleReplicaTopologyMode {
 			o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations)
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cfg, err := e2e.LoadConfig()
o.Expect(err).NotTo(o.HaveOccurred())
configV1Client, err := configv1client.NewForConfig(cfg)
o.Expect(err).NotTo(o.HaveOccurred())
topo, err := exutil.GetControlPlaneTopologyFromConfigClient(configV1Client)
if err != nil {
e2e.Logf("failed to get control plane topology: %v", err)
}
if topo != nil && *topo != configv1.SingleReplicaTopologyMode {
o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations)
}
cfg, err := e2e.LoadConfig()
o.Expect(err).NotTo(o.HaveOccurred())
configV1Client, err := configv1client.NewForConfig(cfg)
o.Expect(err).NotTo(o.HaveOccurred())
topo, err := exutil.GetControlPlaneTopologyFromConfigClient(configV1Client)
o.Expect(err).NotTo(o.HaveOccurred(), "failed to get control plane topology")
o.Expect(topo).NotTo(o.BeNil(), "control plane topology must be discoverable")
if *topo != configv1.SingleReplicaTopologyMode {
o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/machines/scale.go` around lines 286 - 296, The code currently
logs and skips the topology-based assertion when
exutil.GetControlPlaneTopologyFromConfigClient returns an error or nil; change
this to fail fast by asserting the topology call succeeded and returned a
non-nil value. Replace the manual error log with
o.Expect(err).NotTo(o.HaveOccurred()) for the
exutil.GetControlPlaneTopologyFromConfigClient call and add
o.Expect(topo).NotTo(o.BeNil()) (and then check *topo !=
configv1.SingleReplicaTopologyMode) so the violations assertion always runs when
topology lookup fails or is missing; use the existing symbols cfg,
configv1client.NewForConfig, exutil.GetControlPlaneTopologyFromConfigClient,
topo, and violations to locate and update the logic.

@hongkailiu hongkailiu marked this pull request as ready for review May 21, 2026 11:49
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2026
@openshift-ci openshift-ci Bot requested review from deads2k and p0lyn0mial May 21, 2026 11:51
@hongkailiu
Copy link
Copy Markdown
Member Author

hongkailiu commented May 21, 2026

TRT-2669 says https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-serial-2of2/2057096534654193664 was failing.

/payload-aggregate periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-serial-2of2 5

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-serial-2of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/64b4fca0-550b-11f1-9211-87f77cbc54ea-0

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@hongkailiu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-upi ff1ae2c link true /test e2e-vsphere-ovn-upi
ci/prow/e2e-aws-ovn-microshift ff1ae2c link true /test e2e-aws-ovn-microshift

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hongkailiu
Copy link
Copy Markdown
Member Author

The job from #31207 (comment) succeeded 4/5 runs. The failing run was due to a failing cluster installation.

Let us redo.

/payload-aggregate periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-serial-2of2 5

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 21, 2026

@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-serial-2of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2e00df00-5536-11f1-8c96-56c71015608c-0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants