refactor: Optimize VSC handle readiness polling for VSS backups#9602
refactor: Optimize VSC handle readiness polling for VSS backups#9602sseago wants to merge 8 commits intovelero-io:mainfrom
Conversation
|
https://learn.microsoft.com/en-us/answers/questions/2238014/what-are-all-of-the-volume-shadow-copy-service-com mentions the 10 second limitation. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #9602 +/- ##
==========================================
- Coverage 60.75% 60.74% -0.02%
==========================================
Files 387 387
Lines 36618 36661 +43
==========================================
+ Hits 22248 22269 +21
- Misses 12774 12794 +20
- Partials 1596 1598 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| // Microsoft Volume Shadow Copy Service backups have a hard-coded unfreeze call after 10 seconds, | ||
| // so we need to minimize waiting time during the first 10 seconds. | ||
| // First poll with a short interval and timeout. | ||
| interval = 1 * time.Second |
There was a problem hiding this comment.
It's better this frequent polling can be configured to enable or not.
There are some community users already complaining 10s polling overload the kube-apiserver.
There was a problem hiding this comment.
The above is for item operations (which inherently lasts longer than this since we have to wait for the entire data upload), but we could make this configurable. It could be something simple like just a boolean to enable the frequent polling for the first 10 seconds, or it could be something more flexible like --early-csi-polling-frequency=1s --early-csi-polling-duration=10s -- although this is called from within the plugin, so I'm not sure what the best method is for passing this config in.
There was a problem hiding this comment.
To me, a boolean flag is enough.
By far, we don't have scenario to fine-tuning the frequency and the period.
We can consider more sophisticated configurations when there is a need for that.
There was a problem hiding this comment.
I was thinking long term, probably refactor in a way that Watches CRs instead of constant polling via GET.
There was a problem hiding this comment.
@kaovilai
Tiger's proposal is definitely better.
However, even using an informer cannot fully resolve this issue, because if there are many volumes attached to a pod, it's still possible that the watcher can time out to the VSS's 10-second time slot.
The ultimate solution is moving the logic to the item operation to make it async.
But it may not be an easy change.
There was a problem hiding this comment.
@blackpiglet @kaovilai ok, sounds like for now we just want a boolean flag to enable early fast polling for csi snapshots, and it's disabled by default since it's only needed in a minority of cases.
There was a problem hiding this comment.
I don't know that making it async would help. The issue is the hard-coded 10s limit is in the VM. One thing that would help would be to implement per-volume hooks, so that each PVC can be frozen/unfrozen separately, but that would be a pretty big change to the backup workflow.
There was a problem hiding this comment.
Sorry to jump in late, per volume hook doesn't work either because Windows VSS is a OS based behavior, you have to add all the volumes of the VM to the same snapshotSet in order to achieve cross volume consistency. Making it by volume would break it.
There was a problem hiding this comment.
@Lyndon-Li I guess the only real option here is the fast polling (perhaps replaced eventually by watching) plus having users use volume group snapshots for multi-volume VMs.
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>
…ilder Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>
Co-authored-by: aider (gemini/gemini-2.5-pro) <aider@aider.chat> Signed-off-by: Scott Seago <sseago@redhat.com>
Signed-off-by: Scott Seago <sseago@redhat.com>
|
Updated to make the change disabled by default, to be enabled (by default) via server/installer flag, but can also be enabled/disabled per backup. Moved to draft since the changes are not yet tested. |
|
Since this is introduced the fix a limitation for VSS, I assume it is not a very general problem. Is it possible to add the flag out of spec of the backup CR? For example, make it controlled by feature flag or env var of velero server? |
The setting is needed within the plugin. I don't think we have access to feature flag info there, although I guess the env vars would be inherited, so yes, maybe an env var would work? Are you thinking an env var is better than a server arg? We'd still need an installer flag to set the env var. Backing up a bit -- what this PR currently does is adds an installer/server arg, which then gets passed along to the backup spec. We could replace the server arg with an env var. We could possibly get rid of the spec field and read the env var from the plugin. Let me know if that's what you were suggesting and I'll see if that works here. |
Yes. We want to avoid modifying the Velero CRD. |
|
Is this PR needed in v1.18.1? If the fix is for the main branch, we should choose a long-term solution, e.g., notification instead of polling. |
@blackpiglet We definitely need this on main, but we also need something on 18.1. @kaovilai you suggested using watches/notification. How would we implement that since this is happening in the plugin process, not in a reconciler? |
|
@sseago Since the fix is also needed for v1.18.1, suggest to create a new PR for release-1.18 branch. |
Thank you for contributing to Velero!
Please add a summary of your change
When waiting for the CSI Snapshot to complete, the CSI plugin checks for the SnapHandle every 5 seconds up until csiSnapshotTimeout (default 10min) is reached. This is a problem for workloads that use Microsoft VSS because VSS will unfreeze the filesystem after 10 seconds (which is not configurable). If a workload has 2 volumes, the 5 second polling interval will almost always result in a forced unfreeze before the post hook runs and likely before the last PVC's snapshot is done.
See the VSS doc here: https://learn.microsoft.com/en-us/windows/win32/vss/overview-of-processing-a-backup-under-vss
Note that that the 10-second unfreeze is not configurable.
This PR refactors this to poll every second for the first 10 seconds, followed by the previous "every 5 seconds" until the snapshot timeout is reached.
Does your change fix a particular issue?
Fixes #9601
Please indicate you've done the following:
make new-changelog) or comment/kind changelog-not-requiredon this PR.site/content/docs/main.