NO-JIRA: Remove fixed bugs on CO conditions (2) - 2nd try#31207
NO-JIRA: Remove fixed bugs on CO conditions (2) - 2nd try#31207hongkailiu wants to merge 12 commits into
Conversation
This pull skips all CO tests on SNO. SingleNode is may briefly go Available=False for many operators during updates or Node reboots. Several operators also lack the capacity to teach their Degraded logic about single-node quality-of-service expectations. And we don't have capacity to file and track single-node Degraded exceptions or to set Available grace periods in this test suite at the moment. - `Available=False` and `Degrade=True` are not checked at all no matter if the test case is executed in an upgrade test suite, or not. Before it was handled as an exception and thus the job would be just flaky instead of failing. Thus, the relevant exceptions are removed. - All checks on the `Progressing` condition are skipped as well on a SNO cluster. The logging logic was inherited if it fails to determine the control plane topology because I am not sure on which type of clusters an error will show up.
OCPBUGS-23745 has been fixed and shipped with 4.15. However, the symptom is still there in 4.21 and we create OCPBUGS-66230 to track the issue. OCPBUGS-66230 is closed as the fix is included in 4.21.0-0.nightly-2025-11-30-094855 [1]. [1]. https://redhat.atlassian.net/browse/OCPBUGS-66230?focusedCommentId=16861698
62633 is for co/service-ca and the correct one is 62634.
The bug is fixed [1]. [1]. https://redhat.atlassian.net/browse/OCPBUGS-62634?focusedCommentId=16981952
NTO should now correctly report ClusterOperator status conditions Available, Progressing and Degraded. See: OCPBUGS-62632
The bug is fixed [1]. [1]. https://redhat.atlassian.net/browse/OCPBUGS-62626?focusedCommentId=16983042
The bug is fixed [1]. [1]. https://redhat.atlassian.net/browse/CORENET-6605?focusedCommentId=16984425
See the details in [1]. [1]. https://redhat.atlassian.net/browse/OCPBUGS-86009
The symptom is still there after OCPBUGS-62630 is shipped.
The details are in the bug.
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@hongkailiu: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
WalkthroughRefactors CVO legacy monitor tests to centralize control-plane topology fetching and pass ChangesCVO Legacy Monitor Test Topology Refactoring
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hongkailiu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go`:
- Around line 94-107: getControlPlaneTopology can fail and leave topology empty
causing downstream helpers (testUpgradeOperatorStateTransitions,
testUpgradeOperatorProgressingStateTransitions,
testStableSystemOperatorStateTransitions) to misbehave; change the
getControlPlaneTopology(err) handling so that if err != nil you either return
the error from the enclosing function or short-circuit/skip topology-dependent
tests immediately (do not proceed with empty topology). Locate the call to
getControlPlaneTopology and replace the current e2e.Logf-only branch with logic
to return fmt.Errorf(...) (or a defined skip/short-circuit path) when err != nil
so topology-dependent calls never run with an invalid topology value.
In `@test/extended/machines/scale.go`:
- Around line 286-296: The code currently logs and skips the topology-based
assertion when exutil.GetControlPlaneTopologyFromConfigClient returns an error
or nil; change this to fail fast by asserting the topology call succeeded and
returned a non-nil value. Replace the manual error log with
o.Expect(err).NotTo(o.HaveOccurred()) for the
exutil.GetControlPlaneTopologyFromConfigClient call and add
o.Expect(topo).NotTo(o.BeNil()) (and then check *topo !=
configv1.SingleReplicaTopologyMode) so the violations assertion always runs when
topology lookup fails or is missing; use the existing symbols cfg,
configv1client.NewForConfig, exutil.GetControlPlaneTopologyFromConfigClient,
topo, and violations to locate and update the logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 5f2f414c-4146-4330-b2af-1fadbfae691b
📒 Files selected for processing (3)
pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.gopkg/monitortests/clusterversionoperator/legacycvomonitortests/operators.gotest/extended/machines/scale.go
| topology, err := getControlPlaneTopology(w.adminRESTConfig) | ||
| if err != nil { | ||
| e2e.Logf("failed to get control plane topology: %v", err) | ||
| } | ||
|
|
||
| if isUpgrade { | ||
| junits = append(junits, testUpgradeOperatorStateTransitions(finalIntervals, w.adminRESTConfig)...) | ||
| junits = append(junits, testUpgradeOperatorStateTransitions(finalIntervals, w.adminRESTConfig, topology)...) | ||
| level, err := getUpgradeLevel(w.adminRESTConfig) | ||
| if err != nil || level == unknownUpgradeLevel { | ||
| return nil, fmt.Errorf("failed to determine upgrade level: %w", err) | ||
| } | ||
| junits = append(junits, testUpgradeOperatorProgressingStateTransitions(finalIntervals, level == patchUpgradeLevel, w.adminRESTConfig)...) | ||
| junits = append(junits, testUpgradeOperatorProgressingStateTransitions(finalIntervals, level == patchUpgradeLevel, topology)...) | ||
| } else { | ||
| junits = append(junits, testStableSystemOperatorStateTransitions(finalIntervals, w.adminRESTConfig)...) | ||
| junits = append(junits, testStableSystemOperatorStateTransitions(finalIntervals, topology)...) |
There was a problem hiding this comment.
Fail or skip when topology lookup fails.
If getControlPlaneTopology errors here, topology stays empty and the downstream helpers behave as if the cluster is neither single-node nor two-node. That bypasses the new SNO skips and dual-replica/arbiter exception paths, so a topology read failure can turn into false monitor-test failures. Return the error or short-circuit the topology-dependent checks instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/monitortests/clusterversionoperator/legacycvomonitortests/monitortest.go`
around lines 94 - 107, getControlPlaneTopology can fail and leave topology empty
causing downstream helpers (testUpgradeOperatorStateTransitions,
testUpgradeOperatorProgressingStateTransitions,
testStableSystemOperatorStateTransitions) to misbehave; change the
getControlPlaneTopology(err) handling so that if err != nil you either return
the error from the enclosing function or short-circuit/skip topology-dependent
tests immediately (do not proceed with empty topology). Locate the call to
getControlPlaneTopology and replace the current e2e.Logf-only branch with logic
to return fmt.Errorf(...) (or a defined skip/short-circuit path) when err != nil
so topology-dependent calls never run with an invalid topology value.
| cfg, err := e2e.LoadConfig() | ||
| o.Expect(err).NotTo(o.HaveOccurred()) | ||
| configV1Client, err := configv1client.NewForConfig(cfg) | ||
| o.Expect(err).NotTo(o.HaveOccurred()) | ||
| topo, err := exutil.GetControlPlaneTopologyFromConfigClient(configV1Client) | ||
| if err != nil { | ||
| e2e.Logf("failed to get control plane topology: %v", err) | ||
| } | ||
| if topo != nil && *topo != configv1.SingleReplicaTopologyMode { | ||
| o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations) | ||
| } |
There was a problem hiding this comment.
Fail fast when topology lookup fails instead of skipping the assertion.
If topology retrieval fails or returns nil, the violations check is silently skipped, which can hide real regressions. Please make topology discovery mandatory in this path.
Proposed fix
cfg, err := e2e.LoadConfig()
o.Expect(err).NotTo(o.HaveOccurred())
configV1Client, err := configv1client.NewForConfig(cfg)
o.Expect(err).NotTo(o.HaveOccurred())
topo, err := exutil.GetControlPlaneTopologyFromConfigClient(configV1Client)
- if err != nil {
- e2e.Logf("failed to get control plane topology: %v", err)
- }
- if topo != nil && *topo != configv1.SingleReplicaTopologyMode {
+ o.Expect(err).NotTo(o.HaveOccurred(), "failed to get control plane topology")
+ o.Expect(topo).NotTo(o.BeNil(), "control plane topology must be discoverable")
+ if *topo != configv1.SingleReplicaTopologyMode {
o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cfg, err := e2e.LoadConfig() | |
| o.Expect(err).NotTo(o.HaveOccurred()) | |
| configV1Client, err := configv1client.NewForConfig(cfg) | |
| o.Expect(err).NotTo(o.HaveOccurred()) | |
| topo, err := exutil.GetControlPlaneTopologyFromConfigClient(configV1Client) | |
| if err != nil { | |
| e2e.Logf("failed to get control plane topology: %v", err) | |
| } | |
| if topo != nil && *topo != configv1.SingleReplicaTopologyMode { | |
| o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations) | |
| } | |
| cfg, err := e2e.LoadConfig() | |
| o.Expect(err).NotTo(o.HaveOccurred()) | |
| configV1Client, err := configv1client.NewForConfig(cfg) | |
| o.Expect(err).NotTo(o.HaveOccurred()) | |
| topo, err := exutil.GetControlPlaneTopologyFromConfigClient(configV1Client) | |
| o.Expect(err).NotTo(o.HaveOccurred(), "failed to get control plane topology") | |
| o.Expect(topo).NotTo(o.BeNil(), "control plane topology must be discoverable") | |
| if *topo != configv1.SingleReplicaTopologyMode { | |
| o.Expect(violations).To(o.BeEmpty(), "those cluster operators left Progressing=False while cluster was scaling: %v", violations) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/extended/machines/scale.go` around lines 286 - 296, The code currently
logs and skips the topology-based assertion when
exutil.GetControlPlaneTopologyFromConfigClient returns an error or nil; change
this to fail fast by asserting the topology call succeeded and returned a
non-nil value. Replace the manual error log with
o.Expect(err).NotTo(o.HaveOccurred()) for the
exutil.GetControlPlaneTopologyFromConfigClient call and add
o.Expect(topo).NotTo(o.BeNil()) (and then check *topo !=
configv1.SingleReplicaTopologyMode) so the violations assertion always runs when
topology lookup fails or is missing; use the existing symbols cfg,
configv1client.NewForConfig, exutil.GetControlPlaneTopologyFromConfigClient,
topo, and violations to locate and update the logic.
|
TRT-2669 says https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-serial-2of2/2057096534654193664 was failing. /payload-aggregate periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-serial-2of2 5 |
|
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/64b4fca0-550b-11f1-9211-87f77cbc54ea-0 |
|
Scheduling required tests: |
|
@hongkailiu: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
The job from #31207 (comment) succeeded 4/5 runs. The failing run was due to a failing cluster installation. Let us redo. /payload-aggregate periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-serial-2of2 5 |
|
@hongkailiu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2e00df00-5536-11f1-8c96-56c71015608c-0 |
This is to redo #31112 which is revert by #31201 because of TRT-2669.
Now https://redhat.atlassian.net/browse/OCPBUGS-86308 is added to address TRT-2669.
I did not reuse https://issues.redhat.com/browse/OCPBUGS-62626 because it turned out to be different cases: OCPBUGS-86308 scale up vs OCPBUGS-62626 node reboot.
Comparing to #31112, OCPBUGS-62635 is missing in this pull because it has been removed by #30775 already.
I will do rebase after #30775 gets in.