refactor: Optimize VSC handle readiness polling for VSS backups by sseago · Pull Request #9602 · velero-io/velero

sseago · 2026-03-10T20:14:31Z

Thank you for contributing to Velero!

Please add a summary of your change

When waiting for the CSI Snapshot to complete, the CSI plugin checks for the SnapHandle every 5 seconds up until csiSnapshotTimeout (default 10min) is reached. This is a problem for workloads that use Microsoft VSS because VSS will unfreeze the filesystem after 10 seconds (which is not configurable). If a workload has 2 volumes, the 5 second polling interval will almost always result in a forced unfreeze before the post hook runs and likely before the last PVC's snapshot is done.

See the VSS doc here: https://learn.microsoft.com/en-us/windows/win32/vss/overview-of-processing-a-backup-under-vss
Note that that the 10-second unfreeze is not configurable.

This PR refactors this to poll every second for the first 10 seconds, followed by the previous "every 5 seconds" until the snapshot timeout is reached.

Does your change fix a particular issue?

Fixes #9601

Please indicate you've done the following:

[ x] Accepted the DCO. Commits without the DCO will delay acceptance.
[ x] Created a changelog file (make new-changelog) or comment /kind changelog-not-required on this PR.
[ x] Updated the corresponding documentation in site/content/docs/main.

kaovilai · 2026-03-11T00:20:48Z

https://learn.microsoft.com/en-us/answers/questions/2238014/what-are-all-of-the-volume-shadow-copy-service-com mentions the 10 second limitation.

codecov · 2026-03-11T00:33:23Z

Codecov Report

❌ Patch coverage is 60.37736% with 63 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.74%. Comparing base (66ac235) to head (7b86343).
⚠️ Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/controller/backup_controller.go	6.89%	27 Missing ⚠️
pkg/util/csi/volume_snapshot.go	57.81%	24 Missing and 3 partials ⚠️
pkg/install/deployment.go	0.00%	4 Missing and 1 partial ⚠️
pkg/install/resources.go	0.00%	1 Missing and 1 partial ⚠️
pkg/cmd/cli/schedule/create.go	0.00%	1 Missing ⚠️
pkg/cmd/server/server.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9602      +/-   ##
==========================================
- Coverage   60.75%   60.74%   -0.02%     
==========================================
  Files         387      387              
  Lines       36618    36661      +43     
==========================================
+ Hits        22248    22269      +21     
- Misses      12774    12794      +20     
- Partials     1596     1598       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

blackpiglet · 2026-03-11T01:59:33Z

+	// Microsoft Volume Shadow Copy Service backups have a hard-coded unfreeze call after 10 seconds,
+	// so we need to minimize waiting time during the first 10 seconds.
+	// First poll with a short interval and timeout.
+	interval = 1 * time.Second


It's better this frequent polling can be configured to enable or not.
There are some community users already complaining 10s polling overload the kube-apiserver.

#9582 (comment)

The above is for item operations (which inherently lasts longer than this since we have to wait for the entire data upload), but we could make this configurable. It could be something simple like just a boolean to enable the frequent polling for the first 10 seconds, or it could be something more flexible like --early-csi-polling-frequency=1s --early-csi-polling-duration=10s -- although this is called from within the plugin, so I'm not sure what the best method is for passing this config in.

To me, a boolean flag is enough.
By far, we don't have scenario to fine-tuning the frequency and the period.
We can consider more sophisticated configurations when there is a need for that.

I was thinking long term, probably refactor in a way that Watches CRs instead of constant polling via GET.

@kaovilai
Tiger's proposal is definitely better.
However, even using an informer cannot fully resolve this issue, because if there are many volumes attached to a pod, it's still possible that the watcher can time out to the VSS's 10-second time slot.

The ultimate solution is moving the logic to the item operation to make it async.
But it may not be an easy change.

@blackpiglet @kaovilai ok, sounds like for now we just want a boolean flag to enable early fast polling for csi snapshots, and it's disabled by default since it's only needed in a minority of cases.

I don't know that making it async would help. The issue is the hard-coded 10s limit is in the VM. One thing that would help would be to implement per-volume hooks, so that each PVC can be frozen/unfrozen separately, but that would be a pretty big change to the backup workflow.

Sorry to jump in late, per volume hook doesn't work either because Windows VSS is a OS based behavior, you have to add all the volumes of the VM to the same snapshotSet in order to achieve cross volume consistency. Making it by volume would break it.

@Lyndon-Li I guess the only real option here is the fast polling (perhaps replaced eventually by watching) plus having users use volume group snapshots for multi-volume VMs.

Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>

…ilder Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>

Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>

Signed-off-by: Scott Seago <sseago@redhat.com>

sseago · 2026-03-12T20:18:06Z

Updated to make the change disabled by default, to be enabled (by default) via server/installer flag, but can also be enabled/disabled per backup. Moved to draft since the changes are not yet tested.

reasonerjt · 2026-03-16T07:05:51Z

Since this is introduced the fix a limitation for VSS, I assume it is not a very general problem. Is it possible to add the flag out of spec of the backup CR? For example, make it controlled by feature flag or env var of velero server?

sseago · 2026-03-16T15:10:40Z

Since this is introduced the fix a limitation for VSS, I assume it is not a very general problem. Is it possible to add the flag out of spec of the backup CR? For example, make it controlled by feature flag or env var of velero server?

The setting is needed within the plugin. I don't think we have access to feature flag info there, although I guess the env vars would be inherited, so yes, maybe an env var would work? Are you thinking an env var is better than a server arg? We'd still need an installer flag to set the env var.

Backing up a bit -- what this PR currently does is adds an installer/server arg, which then gets passed along to the backup spec. We could replace the server arg with an env var. We could possibly get rid of the spec field and read the env var from the plugin. Let me know if that's what you were suggesting and I'll see if that works here.

blackpiglet · 2026-03-17T08:47:51Z

Backing up a bit -- what this PR currently does is adds an installer/server arg, which then gets passed along to the backup spec. We could replace the server arg with an env var. We could possibly get rid of the spec field and read the env var from the plugin. Let me know if that's what you were suggesting and I'll see if that works here.

Yes. We want to avoid modifying the Velero CRD.
CRD modification introduces many file changes and may be an upgrade burden if we deprecate the field later.

blackpiglet · 2026-03-17T10:15:29Z

Is this PR needed in v1.18.1?
If this PR is a short-term solution for the released branch, I think we can use the environment variable as the switch for the 10s more frequent polling.

If the fix is for the main branch, we should choose a long-term solution, e.g., notification instead of polling.

sseago · 2026-03-17T13:28:47Z

Is this PR needed in v1.18.1? If this PR is a short-term solution for the released branch, I think we can use the environment variable as the switch for the 10s more frequent polling.

If the fix is for the main branch, we should choose a long-term solution, e.g., notification instead of polling.

@blackpiglet We definitely need this on main, but we also need something on 18.1. @kaovilai you suggested using watches/notification. How would we implement that since this is happening in the plugin process, not in a reconciler?

blackpiglet · 2026-03-18T02:59:59Z

@sseago
We can continue the discussion for main branch in this PR.

Since the fix is also needed for v1.18.1, suggest to create a new PR for release-1.18 branch.

github-actions Bot assigned sseago Mar 10, 2026

github-actions Bot requested review from reasonerjt and ywk253100 March 10, 2026 20:14

sseago force-pushed the windows-polling branch from eb3d570 to ba02916 Compare March 10, 2026 20:15

github-actions Bot added the has-changelog label Mar 10, 2026

kaovilai reviewed Mar 10, 2026

View reviewed changes

Comment thread pkg/util/csi/volume_snapshot.go Outdated

blackpiglet reviewed Mar 11, 2026

View reviewed changes

sseago and others added 8 commits March 12, 2026 12:01

refactor: Optimize VSC handle readiness polling for VSS backups

56fa5ea

Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>

feat: Implement early frequent polling for CSI snapshots

df5ce11

Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>

refactor: Update WaitUntilVSCHandleIsReady test call with new argument

3fe375d

Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>

feat: Pass CSISnapshotEarlyFrequentPolling to backup reconciler

37ecb4e

Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>

test: Add CSISnapshotEarlyFrequentPolling(false) to default backup bu…

327c891

…ilder Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>

chore: Add boolptr import to backup create test

dbc1ee6

Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>

test: Add CSISnapshotEarlyFrequentPolling to backup test specs

075f15b

Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>

make update

7b86343

Signed-off-by: Scott Seago <sseago@redhat.com>

sseago force-pushed the windows-polling branch from 78b9534 to 7b86343 Compare March 12, 2026 20:16

github-actions Bot added the has-unit-tests label Mar 12, 2026

sseago marked this pull request as draft March 12, 2026 20:17

reasonerjt removed request for reasonerjt and ywk253100 March 15, 2026 16:04

coderabbitai Bot mentioned this pull request Apr 27, 2026

Storage: Add windows OADP test RedHatQE/openshift-virtualization-tests#4603

Open

Conversation

sseago commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Please add a summary of your change

Does your change fix a particular issue?

Please indicate you've done the following:

Uh oh!

Uh oh!

kaovilai commented Mar 11, 2026

Uh oh!

codecov Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blackpiglet Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sseago commented Mar 12, 2026

Uh oh!

reasonerjt commented Mar 16, 2026

Uh oh!

sseago commented Mar 16, 2026

Uh oh!

blackpiglet commented Mar 17, 2026

Uh oh!

blackpiglet commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sseago commented Mar 17, 2026

Uh oh!

blackpiglet commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sseago commented Mar 10, 2026 •

edited

Loading

codecov Bot commented Mar 11, 2026 •

edited

Loading

blackpiglet Mar 12, 2026 •

edited

Loading

blackpiglet commented Mar 17, 2026 •

edited

Loading