Fail backup validation when built-in data mover has no running node-agent by Joeavaikath · Pull Request #9697 · velero-io/velero

Joeavaikath · 2026-04-09T19:23:34Z

Summary

When snapshotMoveData: true is set with the built-in data mover but no node-agent pods are running, backups hang in WaitingForPluginOperations until itemOperationTimeout (default 4h) expires. The DataUpload CR is created but never reconciled because the DataUpload controller runs inside node-agent pods.

This PR adds a pre-flight validation check in prepareBackupRequest() that fails the backup with FailedValidation if the built-in data mover is requested but no node-agent pods are running. The check is scoped to the built-in data mover only — custom data movers that don't rely on node-agent are unaffected.

Changes:

pkg/controller/backup_controller.go — add node-agent validation in prepareBackupRequest(), consistent with existing BSL/snapshot location checks
pkg/controller/backup_controller_test.go — 4 test cases (no pods, running pod, custom mover, disabled)
pkg/nodeagent/node_agent.go — add HasRunningPods() function
pkg/nodeagent/node_agent_test.go — unit tests for HasRunningPods()

Does your change fix a particular issue?

Please indicate you've done the following:

Accepted the DCO. Commits without the DCO will delay acceptance.
Created a changelog file (make new-changelog) or comment /kind changelog-not-required on this PR.
Updated the corresponding documentation in site/content/docs/main.

When SnapshotMoveData is enabled but no node-agent pods are running, DataUpload CRs sit unprocessed until timeout. This adds a pre-flight check in the PVC backup item action that verifies running node-agent pods exist before creating DataUpload CRs, mirroring the check FSB already performs. Cleans up the VolumeSnapshot on failure to prevent orphaned resources. Signed-off-by: Joseph <jvaikath@redhat.com>

Signed-off-by: Joseph <jvaikath@redhat.com>

codecov · 2026-04-09T19:34:49Z

Codecov Report

❌ Patch coverage is 78.94737% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.97%. Comparing base (37abfb4) to head (0a164d2).
⚠️ Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/nodeagent/node_agent.go	73.33%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9697      +/-   ##
==========================================
+ Coverage   60.94%   60.97%   +0.02%     
==========================================
  Files         384      384              
  Lines       36594    36614      +20     
==========================================
+ Hits        22303    22325      +22     
+ Misses      12681    12678       -3     
- Partials     1610     1611       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Lyndon-Li · 2026-04-10T03:18:09Z

@Joeavaikath Please check with Velero data mover design, Velero supports customized data movers which don't rely on node-agent necessarily, so it is not reasonable to check the status of node-agent.

Joeavaikath · 2026-04-10T12:48:11Z

@Lyndon-Li Could add a check to ensure it's the default datamover before the nodeagent pod running check runs
Unless this should live somewhere else as a part of validation...what do you think?

Custom data movers operate independently of node-agent, so the HasRunningPods check should only run when the built-in "velero" data mover is in use. Adds test case verifying custom data movers bypass the check. Signed-off-by: Joseph <jvaikath@redhat.com>

Signed-off-by: Joseph <jvaikath@redhat.com>

Lyndon-Li · 2026-04-13T05:18:47Z

+		if datamover.IsBuiltInUploader(backup.Spec.DataMover) {
+			if err := nodeagent.HasRunningPods(context.Background(), backup.Namespace, p.crClient); err != nil {
+				dataUploadLog.WithError(err).Error("cannot perform snapshot data movement without running node-agent pods")
+				csi.CleanupVolumeSnapshot(vs, p.crClient, p.log)


It is not reasonable to do this check after creating the snapshot

Moved the check further up

Lyndon-Li · 2026-04-13T05:30:30Z

 }

+// HasRunningPods checks if any node agent pod is running in the namespace through controller client. If not, return the error found.
+func HasRunningPods(ctx context.Context, namespace string, crClient ctrlclient.Client) error {


From the plugin, we cannot make a decisive check:
It doesn't know which node the data mover is going to run
So as the code here, it only check if any node-agent pod is running
But that is not enough to guarantee the data mover could be processed by the node-agent pod

Then in what extend this code change would help?

The intention is to fast-fail. If node-agent is not deployed or no pods are running, DataUpload will time out as the issue describes
Backup times out at 4h at phase WaitingForPluginOperations, this check stops it from hanging

It is a "will any node-agent run this" check

Yes, but I will doubt how this "will any node-agent run this" check would help, as mentioned above, in many cases, the data mover pods will run in a dedicated node only and require the node-agent pod running in that node only.

Signed-off-by: Joseph <jvaikath@redhat.com>

The node-agent running check now lives in prepareBackupRequest() alongside other pre-flight validations (BSL availability, snapshot locations). This fails the entire backup with FailedValidation instead of producing per-PVC errors, and avoids creating snapshots that would need cleanup. Signed-off-by: Joseph <jvaikath@redhat.com>

Joeavaikath · 2026-04-13T18:30:42Z

@Lyndon-Li I think it properly lives in validation now: we are ensuring that there is at least one node-agent pod running before proceeding

The node-agent validation added in prepareBackupRequest causes test cases with SnapshotMoveData enabled to fail validation. Add a running node-agent pod to the baseline test environment so all cases pass the check. Signed-off-by: Joseph <jvaikath@redhat.com>

Signed-off-by: Joseph <jvaikath@redhat.com>

kaovilai · 2026-04-22T00:11:44Z

Fixes #2938

is this correct link? its very old.

github-actions Bot assigned Joeavaikath Apr 9, 2026

github-actions Bot requested review from Lyndon-Li and sseago April 9, 2026 19:23

github-actions Bot added has-unit-tests has-changelog labels Apr 9, 2026

Joeavaikath added 2 commits April 9, 2026 15:24

Changelog

dca63e7

Signed-off-by: Joseph <jvaikath@redhat.com>

Joeavaikath force-pushed the add-nodeagent-check-datamover branch from 116fd60 to dca63e7 Compare April 9, 2026 19:25

Lyndon-Li mentioned this pull request Apr 10, 2026

Backup should immediately fail when nodeAgent pods are not running #9698

Open

Joeavaikath marked this pull request as draft April 10, 2026 13:12

fmt fix

92a567e

Signed-off-by: Joseph <jvaikath@redhat.com>

Joeavaikath marked this pull request as ready for review April 10, 2026 13:27

github-actions Bot requested a review from blackpiglet April 10, 2026 13:31

Lyndon-Li reviewed Apr 13, 2026

View reviewed changes

Joeavaikath added 3 commits April 13, 2026 10:35

Move check further up

fc0bbc9

Signed-off-by: Joseph <jvaikath@redhat.com>

Fmt

8d10ceb

Signed-off-by: Joseph <jvaikath@redhat.com>

Joeavaikath changed the title ~~Add node-agent pod check to DataMover backup path~~ Fail backup validation when built-in data mover has no running node-agent Apr 13, 2026

Joeavaikath added 3 commits April 13, 2026 16:55

Fix filename

c7b65cf

Signed-off-by: Joseph <jvaikath@redhat.com>

Remove file

8d559de

Signed-off-by: Joseph <jvaikath@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail backup validation when built-in data mover has no running node-agent#9697

Fail backup validation when built-in data mover has no running node-agent#9697
Joeavaikath wants to merge 10 commits intovelero-io:mainfrom
Joeavaikath:add-nodeagent-check-datamover

Joeavaikath commented Apr 9, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Lyndon-Li commented Apr 10, 2026

Uh oh!

Joeavaikath commented Apr 10, 2026

Uh oh!

Lyndon-Li Apr 13, 2026

Uh oh!

Joeavaikath Apr 13, 2026

Uh oh!

Lyndon-Li Apr 13, 2026

Uh oh!

Joeavaikath Apr 13, 2026 •

edited

Loading

Uh oh!

Lyndon-Li Apr 21, 2026

Uh oh!

Joeavaikath commented Apr 13, 2026

Uh oh!

kaovilai commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Joeavaikath commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Does your change fix a particular issue?

Please indicate you've done the following:

Uh oh!

codecov Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Lyndon-Li commented Apr 10, 2026

Uh oh!

Joeavaikath commented Apr 10, 2026

Uh oh!

Lyndon-Li Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Joeavaikath Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Lyndon-Li Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Joeavaikath Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lyndon-Li Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Joeavaikath commented Apr 13, 2026

Uh oh!

kaovilai commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Joeavaikath commented Apr 9, 2026 •

edited

Loading

codecov Bot commented Apr 9, 2026 •

edited

Loading

Joeavaikath Apr 13, 2026 •

edited

Loading