Skip to content

refactor: Optimize VSC handle readiness polling for VSS backups#9602

Draft
sseago wants to merge 8 commits intovelero-io:mainfrom
sseago:windows-polling
Draft

refactor: Optimize VSC handle readiness polling for VSS backups#9602
sseago wants to merge 8 commits intovelero-io:mainfrom
sseago:windows-polling

Conversation

@sseago
Copy link
Copy Markdown
Collaborator

@sseago sseago commented Mar 10, 2026

Thank you for contributing to Velero!

Please add a summary of your change

When waiting for the CSI Snapshot to complete, the CSI plugin checks for the SnapHandle every 5 seconds up until csiSnapshotTimeout (default 10min) is reached. This is a problem for workloads that use Microsoft VSS because VSS will unfreeze the filesystem after 10 seconds (which is not configurable). If a workload has 2 volumes, the 5 second polling interval will almost always result in a forced unfreeze before the post hook runs and likely before the last PVC's snapshot is done.

See the VSS doc here: https://learn.microsoft.com/en-us/windows/win32/vss/overview-of-processing-a-backup-under-vss
Note that that the 10-second unfreeze is not configurable.

This PR refactors this to poll every second for the first 10 seconds, followed by the previous "every 5 seconds" until the snapshot timeout is reached.

Does your change fix a particular issue?

Fixes #9601

Please indicate you've done the following:

Comment thread pkg/util/csi/volume_snapshot.go Outdated
@kaovilai
Copy link
Copy Markdown
Collaborator

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 11, 2026

Codecov Report

❌ Patch coverage is 60.37736% with 63 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.74%. Comparing base (66ac235) to head (7b86343).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/backup_controller.go 6.89% 27 Missing ⚠️
pkg/util/csi/volume_snapshot.go 57.81% 24 Missing and 3 partials ⚠️
pkg/install/deployment.go 0.00% 4 Missing and 1 partial ⚠️
pkg/install/resources.go 0.00% 1 Missing and 1 partial ⚠️
pkg/cmd/cli/schedule/create.go 0.00% 1 Missing ⚠️
pkg/cmd/server/server.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9602      +/-   ##
==========================================
- Coverage   60.75%   60.74%   -0.02%     
==========================================
  Files         387      387              
  Lines       36618    36661      +43     
==========================================
+ Hits        22248    22269      +21     
- Misses      12774    12794      +20     
- Partials     1596     1598       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread pkg/util/csi/volume_snapshot.go Outdated
// Microsoft Volume Shadow Copy Service backups have a hard-coded unfreeze call after 10 seconds,
// so we need to minimize waiting time during the first 10 seconds.
// First poll with a short interval and timeout.
interval = 1 * time.Second
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better this frequent polling can be configured to enable or not.
There are some community users already complaining 10s polling overload the kube-apiserver.

#9582 (comment)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above is for item operations (which inherently lasts longer than this since we have to wait for the entire data upload), but we could make this configurable. It could be something simple like just a boolean to enable the frequent polling for the first 10 seconds, or it could be something more flexible like --early-csi-polling-frequency=1s --early-csi-polling-duration=10s -- although this is called from within the plugin, so I'm not sure what the best method is for passing this config in.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, a boolean flag is enough.
By far, we don't have scenario to fine-tuning the frequency and the period.
We can consider more sophisticated configurations when there is a need for that.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking long term, probably refactor in a way that Watches CRs instead of constant polling via GET.

Copy link
Copy Markdown
Contributor

@blackpiglet blackpiglet Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaovilai
Tiger's proposal is definitely better.
However, even using an informer cannot fully resolve this issue, because if there are many volumes attached to a pod, it's still possible that the watcher can time out to the VSS's 10-second time slot.

The ultimate solution is moving the logic to the item operation to make it async.
But it may not be an easy change.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blackpiglet @kaovilai ok, sounds like for now we just want a boolean flag to enable early fast polling for csi snapshots, and it's disabled by default since it's only needed in a minority of cases.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that making it async would help. The issue is the hard-coded 10s limit is in the VM. One thing that would help would be to implement per-volume hooks, so that each PVC can be frozen/unfrozen separately, but that would be a pretty big change to the backup workflow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to jump in late, per volume hook doesn't work either because Windows VSS is a OS based behavior, you have to add all the volumes of the VM to the same snapshotSet in order to achieve cross volume consistency. Making it by volume would break it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lyndon-Li I guess the only real option here is the fast polling (perhaps replaced eventually by watching) plus having users use volume group snapshots for multi-volume VMs.

sseago and others added 8 commits March 12, 2026 12:01
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat>
Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat>
Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat>
Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat>
Signed-off-by: Scott Seago <sseago@redhat.com>
…ilder

Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat>
Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat>
Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat>
Signed-off-by: Scott Seago <sseago@redhat.com>
Signed-off-by: Scott Seago <sseago@redhat.com>
@sseago sseago marked this pull request as draft March 12, 2026 20:17
@sseago
Copy link
Copy Markdown
Collaborator Author

sseago commented Mar 12, 2026

Updated to make the change disabled by default, to be enabled (by default) via server/installer flag, but can also be enabled/disabled per backup. Moved to draft since the changes are not yet tested.

@reasonerjt
Copy link
Copy Markdown
Contributor

Since this is introduced the fix a limitation for VSS, I assume it is not a very general problem. Is it possible to add the flag out of spec of the backup CR? For example, make it controlled by feature flag or env var of velero server?

@sseago
Copy link
Copy Markdown
Collaborator Author

sseago commented Mar 16, 2026

Since this is introduced the fix a limitation for VSS, I assume it is not a very general problem. Is it possible to add the flag out of spec of the backup CR? For example, make it controlled by feature flag or env var of velero server?

The setting is needed within the plugin. I don't think we have access to feature flag info there, although I guess the env vars would be inherited, so yes, maybe an env var would work? Are you thinking an env var is better than a server arg? We'd still need an installer flag to set the env var.

Backing up a bit -- what this PR currently does is adds an installer/server arg, which then gets passed along to the backup spec. We could replace the server arg with an env var. We could possibly get rid of the spec field and read the env var from the plugin. Let me know if that's what you were suggesting and I'll see if that works here.

@blackpiglet
Copy link
Copy Markdown
Contributor

Backing up a bit -- what this PR currently does is adds an installer/server arg, which then gets passed along to the backup spec. We could replace the server arg with an env var. We could possibly get rid of the spec field and read the env var from the plugin. Let me know if that's what you were suggesting and I'll see if that works here.

Yes. We want to avoid modifying the Velero CRD.
CRD modification introduces many file changes and may be an upgrade burden if we deprecate the field later.

@blackpiglet
Copy link
Copy Markdown
Contributor

blackpiglet commented Mar 17, 2026

Is this PR needed in v1.18.1?
If this PR is a short-term solution for the released branch, I think we can use the environment variable as the switch for the 10s more frequent polling.

If the fix is for the main branch, we should choose a long-term solution, e.g., notification instead of polling.

@sseago
Copy link
Copy Markdown
Collaborator Author

sseago commented Mar 17, 2026

Is this PR needed in v1.18.1? If this PR is a short-term solution for the released branch, I think we can use the environment variable as the switch for the 10s more frequent polling.

If the fix is for the main branch, we should choose a long-term solution, e.g., notification instead of polling.

@blackpiglet We definitely need this on main, but we also need something on 18.1. @kaovilai you suggested using watches/notification. How would we implement that since this is happening in the plugin process, not in a reconciler?

@blackpiglet
Copy link
Copy Markdown
Contributor

@sseago
We can continue the discussion for main branch in this PR.

Since the fix is also needed for v1.18.1, suggest to create a new PR for release-1.18 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CSI plugin waiting for SnapHandle needs faster polling interval for first 10 seconds.

5 participants