no fail mon tests#31212
Conversation
…t not fail junits
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughThis PR introduces a new SpotCheck cluster stability mode that enables lightweight monitoring runs without influencing test results. It migrates etcd leader-changes verification from Ginkgo tests to the monitor framework, updates multiple monitor tests to optionally suppress JUnit output, and integrates the new mode throughout the test infrastructure. ChangesSpotCheck Cluster Stability Mode
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-gcp-ovn-etcd-scaling |
|
@dgoodwin: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/87a97980-5843-11f1-8824-e7855c3d2249-0 |
9f8b39b to
7525d8a
Compare
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-ovn-etcd-scaling |
|
@dgoodwin: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/81856720-5844-11f1-805b-5abfdac98a10-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitortests/etcd/leaderchanges/monitortest.go`:
- Around line 83-85: The code currently only treats
configv1.ExternalTopologyMode as external by setting etcdNamespace =
"clusters-.*" in monitortest.go; update the conditional around
infra.Status.ControlPlaneTopology to also treat the DualReplica topology as
external (or explicitly map it) so etcdNamespace is set appropriately for
dual-replica clusters and avoid querying in-cluster metrics when etcd is
external; modify the branch that sets etcdNamespace (referencing
infra.Status.ControlPlaneTopology, configv1.ExternalTopologyMode, and the
etcdNamespace variable) to include configv1.DualReplicaTopologyMode (or the
correct DualReplica constant) as an additional case.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: b9a3a38a-08e7-4578-92da-2d3d995a9356
📒 Files selected for processing (11)
pkg/defaultmonitortests/types.gopkg/monitortestframework/types.gopkg/monitortests/clusterversionoperator/clusterversionchecker/monitortest.gopkg/monitortests/etcd/etcdloganalyzer/monitortest.gopkg/monitortests/etcd/leaderchanges/monitortest.gopkg/monitortests/node/watchnodes/monitortest.gopkg/monitortests/testframework/legacytestframeworkmonitortests/alerts_monitortest.gopkg/test/ginkgo/cmd_runsuite.gopkg/test/ginkgo/test_suite.gopkg/testsuites/standard_suites.gotest/extended/etcd/leader_changes.go
💤 Files with no reviewable changes (1)
- test/extended/etcd/leader_changes.go
| if infra.Status.ControlPlaneTopology == configv1.ExternalTopologyMode { | ||
| etcdNamespace = "clusters-.*" | ||
| } |
There was a problem hiding this comment.
Handle DualReplica topology as unsupported (or map it explicitly).
At Line 83, only ExternalTopologyMode is handled. DualReplica clusters can run etcd externally, so querying etcd_server_leader_changes_seen_total as if in-cluster can return no series and trigger false failures.
Suggested fix
- etcdNamespace := "openshift-etcd"
- if infra.Status.ControlPlaneTopology == configv1.ExternalTopologyMode {
+ etcdNamespace := "openshift-etcd"
+ if infra.Status.ControlPlaneTopology == configv1.DualReplicaTopologyMode {
+ return nil, nil, &monitortestframework.NotSupportedError{
+ Reason: "DualReplica topology not supported for etcd leader-change monitor",
+ }
+ }
+ if infra.Status.ControlPlaneTopology == configv1.ExternalTopologyMode {
etcdNamespace = "clusters-.*"
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/monitortests/etcd/leaderchanges/monitortest.go` around lines 83 - 85, The
code currently only treats configv1.ExternalTopologyMode as external by setting
etcdNamespace = "clusters-.*" in monitortest.go; update the conditional around
infra.Status.ControlPlaneTopology to also treat the DualReplica topology as
external (or explicitly map it) so etcdNamespace is set appropriately for
dual-replica clusters and avoid querying in-cluster metrics when etcd is
external; modify the branch that sets etcdNamespace (referencing
infra.Status.ControlPlaneTopology, configv1.ExternalTopologyMode, and the
etcdNamespace variable) to include configv1.DualReplicaTopologyMode (or the
correct DualReplica constant) as an additional case.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitortests/etcd/leaderchanges/monitortest.go`:
- Around line 119-123: The code truncates Prometheus leader-change counts by
converting leaderChanges (vec[0].Value, a model.SampleValue/float) to int before
comparing against allowedLeaderChanges; change the comparison to use
floating-point so no truncation occurs (e.g., compare leaderChanges as a float64
to float64(allowedLeaderChanges)), update any fmt.Sprintf/return messages to
format the original float value (leaderChanges) rather than the truncated int,
and keep the existing symbols (leaderChanges, allowedLeaderChanges,
numberOfRevisions, vec, failJunit) to locate and apply the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: a97e84c4-7b22-4803-85fb-5ad06593327b
📒 Files selected for processing (4)
pkg/defaultmonitortests/types.gopkg/monitortests/etcd/leaderchanges/monitortest.gopkg/testsuites/standard_suites.gotest/extended/etcd/leader_changes.go
💤 Files with no reviewable changes (1)
- test/extended/etcd/leader_changes.go
| allowedLeaderChanges := numberOfRevisions * 3 | ||
| leaderChanges := vec[0].Value | ||
|
|
||
| if int(leaderChanges) > allowedLeaderChanges { | ||
| return nil, failJunit(fmt.Sprintf( |
There was a problem hiding this comment.
Avoid truncating Prometheus leader-change counts before threshold checks.
leaderChanges is taken from vec[0].Value (model.SampleValue, a float). Converting to int truncates and can undercount near the boundary.
Suggested fix
- allowedLeaderChanges := numberOfRevisions * 3
- leaderChanges := vec[0].Value
-
- if int(leaderChanges) > allowedLeaderChanges {
+ allowedLeaderChanges := numberOfRevisions * 3
+ leaderChanges := vec[0].Value
+
+ if leaderChanges > model.SampleValue(allowedLeaderChanges) {
return nil, failJunit(fmt.Sprintf(
"observed %s leader changes (expected at most %d) in %s: Leader changes are a result of stopping the etcd leader process or from latency (disk or network), review etcd performance metrics",
leaderChanges, allowedLeaderChanges, testDuration,
)), nil
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/monitortests/etcd/leaderchanges/monitortest.go` around lines 119 - 123,
The code truncates Prometheus leader-change counts by converting leaderChanges
(vec[0].Value, a model.SampleValue/float) to int before comparing against
allowedLeaderChanges; change the comparison to use floating-point so no
truncation occurs (e.g., compare leaderChanges as a float64 to
float64(allowedLeaderChanges)), update any fmt.Sprintf/return messages to format
the original float value (leaderChanges) rather than the truncated int, and keep
the existing symbols (leaderChanges, allowedLeaderChanges, numberOfRevisions,
vec, failJunit) to locate and apply the change.
|
Scheduling required tests: |
|
Risk analysis has seen new tests most likely introduced by this PR. New tests seen in this PR at sha: 7525d8a
|
Summary by CodeRabbit