Fail backup validation when built-in data mover has no running node-agent#9697
Fail backup validation when built-in data mover has no running node-agent#9697Joeavaikath wants to merge 10 commits intovelero-io:mainfrom
Conversation
When SnapshotMoveData is enabled but no node-agent pods are running, DataUpload CRs sit unprocessed until timeout. This adds a pre-flight check in the PVC backup item action that verifies running node-agent pods exist before creating DataUpload CRs, mirroring the check FSB already performs. Cleans up the VolumeSnapshot on failure to prevent orphaned resources. Signed-off-by: Joseph <jvaikath@redhat.com>
116fd60 to
dca63e7
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9697 +/- ##
==========================================
+ Coverage 60.94% 60.97% +0.02%
==========================================
Files 384 384
Lines 36594 36614 +20
==========================================
+ Hits 22303 22325 +22
+ Misses 12681 12678 -3
- Partials 1610 1611 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@Joeavaikath Please check with Velero data mover design, Velero supports customized data movers which don't rely on node-agent necessarily, so it is not reasonable to check the status of node-agent. |
|
@Lyndon-Li Could add a check to ensure it's the default datamover before the nodeagent pod running check runs |
Custom data movers operate independently of node-agent, so the HasRunningPods check should only run when the built-in "velero" data mover is in use. Adds test case verifying custom data movers bypass the check. Signed-off-by: Joseph <jvaikath@redhat.com>
| if datamover.IsBuiltInUploader(backup.Spec.DataMover) { | ||
| if err := nodeagent.HasRunningPods(context.Background(), backup.Namespace, p.crClient); err != nil { | ||
| dataUploadLog.WithError(err).Error("cannot perform snapshot data movement without running node-agent pods") | ||
| csi.CleanupVolumeSnapshot(vs, p.crClient, p.log) |
There was a problem hiding this comment.
It is not reasonable to do this check after creating the snapshot
There was a problem hiding this comment.
Moved the check further up
| } | ||
|
|
||
| // HasRunningPods checks if any node agent pod is running in the namespace through controller client. If not, return the error found. | ||
| func HasRunningPods(ctx context.Context, namespace string, crClient ctrlclient.Client) error { |
There was a problem hiding this comment.
From the plugin, we cannot make a decisive check:
It doesn't know which node the data mover is going to run
So as the code here, it only check if any node-agent pod is running
But that is not enough to guarantee the data mover could be processed by the node-agent pod
Then in what extend this code change would help?
There was a problem hiding this comment.
The intention is to fast-fail. If node-agent is not deployed or no pods are running, DataUpload will time out as the issue describes
Backup times out at 4h at phase WaitingForPluginOperations, this check stops it from hanging
It is a "will any node-agent run this" check
There was a problem hiding this comment.
Yes, but I will doubt how this "will any node-agent run this" check would help, as mentioned above, in many cases, the data mover pods will run in a dedicated node only and require the node-agent pod running in that node only.
Signed-off-by: Joseph <jvaikath@redhat.com>
The node-agent running check now lives in prepareBackupRequest() alongside other pre-flight validations (BSL availability, snapshot locations). This fails the entire backup with FailedValidation instead of producing per-PVC errors, and avoids creating snapshots that would need cleanup. Signed-off-by: Joseph <jvaikath@redhat.com>
|
@Lyndon-Li I think it properly lives in validation now: we are ensuring that there is at least one node-agent pod running before proceeding |
The node-agent validation added in prepareBackupRequest causes test cases with SnapshotMoveData enabled to fail validation. Add a running node-agent pod to the baseline test environment so all cases pass the check. Signed-off-by: Joseph <jvaikath@redhat.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
is this correct link? its very old. |
Summary
When
snapshotMoveData: trueis set with the built-in data mover but no node-agent pods are running, backups hang inWaitingForPluginOperationsuntilitemOperationTimeout(default 4h) expires. The DataUpload CR is created but never reconciled because the DataUpload controller runs inside node-agent pods.This PR adds a pre-flight validation check in
prepareBackupRequest()that fails the backup withFailedValidationif the built-in data mover is requested but no node-agent pods are running. The check is scoped to the built-in data mover only — custom data movers that don't rely on node-agent are unaffected.Changes:
pkg/controller/backup_controller.go— add node-agent validation inprepareBackupRequest(), consistent with existing BSL/snapshot location checkspkg/controller/backup_controller_test.go— 4 test cases (no pods, running pod, custom mover, disabled)pkg/nodeagent/node_agent.go— addHasRunningPods()functionpkg/nodeagent/node_agent_test.go— unit tests forHasRunningPods()Does your change fix a particular issue?
Please indicate you've done the following:
make new-changelog) or comment/kind changelog-not-requiredon this PR.site/content/docs/main.