Skip to content

Fail backup validation when built-in data mover has no running node-agent#9697

Open
Joeavaikath wants to merge 10 commits intovelero-io:mainfrom
Joeavaikath:add-nodeagent-check-datamover
Open

Fail backup validation when built-in data mover has no running node-agent#9697
Joeavaikath wants to merge 10 commits intovelero-io:mainfrom
Joeavaikath:add-nodeagent-check-datamover

Conversation

@Joeavaikath
Copy link
Copy Markdown
Contributor

@Joeavaikath Joeavaikath commented Apr 9, 2026

Summary

When snapshotMoveData: true is set with the built-in data mover but no node-agent pods are running, backups hang in WaitingForPluginOperations until itemOperationTimeout (default 4h) expires. The DataUpload CR is created but never reconciled because the DataUpload controller runs inside node-agent pods.

This PR adds a pre-flight validation check in prepareBackupRequest() that fails the backup with FailedValidation if the built-in data mover is requested but no node-agent pods are running. The check is scoped to the built-in data mover only — custom data movers that don't rely on node-agent are unaffected.

Changes:

  • pkg/controller/backup_controller.go — add node-agent validation in prepareBackupRequest(), consistent with existing BSL/snapshot location checks
  • pkg/controller/backup_controller_test.go — 4 test cases (no pods, running pod, custom mover, disabled)
  • pkg/nodeagent/node_agent.go — add HasRunningPods() function
  • pkg/nodeagent/node_agent_test.go — unit tests for HasRunningPods()

Does your change fix a particular issue?

Please indicate you've done the following:

When SnapshotMoveData is enabled but no node-agent pods are running,
DataUpload CRs sit unprocessed until timeout. This adds a pre-flight
check in the PVC backup item action that verifies running node-agent
pods exist before creating DataUpload CRs, mirroring the check FSB
already performs. Cleans up the VolumeSnapshot on failure to prevent
orphaned resources.

Signed-off-by: Joseph <jvaikath@redhat.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
@Joeavaikath Joeavaikath force-pushed the add-nodeagent-check-datamover branch from 116fd60 to dca63e7 Compare April 9, 2026 19:25
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 78.94737% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.97%. Comparing base (37abfb4) to head (0a164d2).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
pkg/nodeagent/node_agent.go 73.33% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9697      +/-   ##
==========================================
+ Coverage   60.94%   60.97%   +0.02%     
==========================================
  Files         384      384              
  Lines       36594    36614      +20     
==========================================
+ Hits        22303    22325      +22     
+ Misses      12681    12678       -3     
- Partials     1610     1611       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Lyndon-Li
Copy link
Copy Markdown
Contributor

@Joeavaikath Please check with Velero data mover design, Velero supports customized data movers which don't rely on node-agent necessarily, so it is not reasonable to check the status of node-agent.

@Joeavaikath
Copy link
Copy Markdown
Contributor Author

@Lyndon-Li Could add a check to ensure it's the default datamover before the nodeagent pod running check runs
Unless this should live somewhere else as a part of validation...what do you think?

Custom data movers operate independently of node-agent, so the
HasRunningPods check should only run when the built-in "velero"
data mover is in use. Adds test case verifying custom data movers
bypass the check.

Signed-off-by: Joseph <jvaikath@redhat.com>
@Joeavaikath Joeavaikath marked this pull request as draft April 10, 2026 13:12
Signed-off-by: Joseph <jvaikath@redhat.com>
@Joeavaikath Joeavaikath marked this pull request as ready for review April 10, 2026 13:27
@github-actions github-actions Bot requested a review from blackpiglet April 10, 2026 13:31
Comment thread pkg/backup/actions/csi/pvc_action.go Outdated
if datamover.IsBuiltInUploader(backup.Spec.DataMover) {
if err := nodeagent.HasRunningPods(context.Background(), backup.Namespace, p.crClient); err != nil {
dataUploadLog.WithError(err).Error("cannot perform snapshot data movement without running node-agent pods")
csi.CleanupVolumeSnapshot(vs, p.crClient, p.log)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not reasonable to do this check after creating the snapshot

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the check further up

}

// HasRunningPods checks if any node agent pod is running in the namespace through controller client. If not, return the error found.
func HasRunningPods(ctx context.Context, namespace string, crClient ctrlclient.Client) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the plugin, we cannot make a decisive check:
It doesn't know which node the data mover is going to run
So as the code here, it only check if any node-agent pod is running
But that is not enough to guarantee the data mover could be processed by the node-agent pod

Then in what extend this code change would help?

Copy link
Copy Markdown
Contributor Author

@Joeavaikath Joeavaikath Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is to fast-fail. If node-agent is not deployed or no pods are running, DataUpload will time out as the issue describes
Backup times out at 4h at phase WaitingForPluginOperations, this check stops it from hanging

It is a "will any node-agent run this" check

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I will doubt how this "will any node-agent run this" check would help, as mentioned above, in many cases, the data mover pods will run in a dedicated node only and require the node-agent pod running in that node only.

Signed-off-by: Joseph <jvaikath@redhat.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
The node-agent running check now lives in prepareBackupRequest()
alongside other pre-flight validations (BSL availability, snapshot
locations). This fails the entire backup with FailedValidation
instead of producing per-PVC errors, and avoids creating snapshots
that would need cleanup.

Signed-off-by: Joseph <jvaikath@redhat.com>
@Joeavaikath Joeavaikath changed the title Add node-agent pod check to DataMover backup path Fail backup validation when built-in data mover has no running node-agent Apr 13, 2026
@Joeavaikath
Copy link
Copy Markdown
Contributor Author

@Lyndon-Li I think it properly lives in validation now: we are ensuring that there is at least one node-agent pod running before proceeding

The node-agent validation added in prepareBackupRequest causes test
cases with SnapshotMoveData enabled to fail validation. Add a running
node-agent pod to the baseline test environment so all cases pass
the check.

Signed-off-by: Joseph <jvaikath@redhat.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
Signed-off-by: Joseph <jvaikath@redhat.com>
@kaovilai
Copy link
Copy Markdown
Collaborator

Fixes #2938

is this correct link? its very old.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants