fix(restore_finalizer): bound WaitRestoreExecHook poll with resourceTimeout#9747
Open
SAY-5 wants to merge 2 commits intovelero-io:mainfrom
Open
fix(restore_finalizer): bound WaitRestoreExecHook poll with resourceTimeout#9747SAY-5 wants to merge 2 commits intovelero-io:mainfrom
SAY-5 wants to merge 2 commits intovelero-io:mainfrom
Conversation
…imeout WaitRestoreExecHook used wait.PollUntilContextCancel with a background context and no deadline. A hook that was registered via Add() but never recorded as executed (pod evicted, container never ready, goroutine panic, leaked tracker entry) left the restore stuck in Finalizing forever. Because the finalizer controller is a single-threaded controller-runtime reconciler, one stuck restore blocks every subsequent restore on the cluster, and only a Velero pod restart recovers. Bound the poll by the existing finalizerContext.resourceTimeout — the same budget Velero already applies to other finalizer phases. Fall back to 10m when the field is unset, so production deployments (always configured) are capped at resourceTimeout and tests that don't populate it still complete. Refs velero-io/velero issue 9744. Signed-off-by: SAY-5 <say.apm35@gmail.com>
priyansh17
reviewed
Apr 23, 2026
| } | ||
| pollCtx, cancel := context.WithTimeout(context.Background(), timeout) | ||
| defer cancel() | ||
| err := wait.PollUntilContextCancel(pollCtx, 1*time.Second, true, func(context.Context) (bool, error) { |
Collaborator
There was a problem hiding this comment.
we can also add a check here if the restore CR is missing skip it
priyansh17
reviewed
Apr 23, 2026
| // Finalizing forever and — because the finalizer controller is a | ||
| // single-threaded controller-runtime reconciler — blocked every | ||
| // other restore on the cluster. | ||
| timeout := ctx.resourceTimeout |
Collaborator
There was a problem hiding this comment.
we do not need to check <0 I think, this should be sufficient
// Bound the wait so we never poll forever on orphaned hook tracker state.
waitCtx, waitCancel := context.WithTimeout(context.Background(), ctx.resourceTimeout)
defer waitCancel()
// wait for restore exec hooks to finish
err := wait.PollUntilContextCancel(waitCtx, 1*time.Second, true, func(pollCtx context.Context) (bool, error) {
priyansh17
reviewed
Apr 23, 2026
| err := wait.PollUntilContextCancel(pollCtx, 1*time.Second, true, func(context.Context) (bool, error) { | ||
| log.Debug("Checking the progress of hooks execution") | ||
| if ctx.multiHookTracker.IsComplete(ctx.restore.Name) { | ||
| return true, nil |
Collaborator
There was a problem hiding this comment.
Suggestion
// Check that the Restore CR still exists; abort if it has been deleted.
restore := &velerov1api.Restore{}
if err := ctx.crClient.Get(pollCtx, client.ObjectKey{
Namespace: ctx.restore.Namespace,
Name: ctx.restore.Name,
}, restore); err != nil {
if apierrors.IsNotFound(err) {
log.Warn("Restore CR no longer exists, aborting hook wait")
return true, nil
}
log.WithError(err).Warn("Error checking restore CR existence during hook wait")
}
Collaborator
|
Please include a test as well for the same. //set |
…timeout Per @priyansh17's review of velero-io#9747: cover the never-recorded-hook case with a 3s resourceTimeout, confirming the poll is bounded and surfaces a namespace error instead of hanging the reconciler. Signed-off-by: SAY-5 <say.apm35@gmail.com>
Author
|
Added the regression test you suggested — |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #9744.
Problem
WaitRestoreExecHookcallswait.PollUntilContextCancelwith a background context and no deadline. A hook that was registered viaAdd()but never recorded as executed (pod evicted, container never ready, goroutine panic, leaked tracker entry) leaves the restore stuck inFinalizingforever. Because the finalizer controller is a single-threaded controller-runtime reconciler, one stuck restore blocks every subsequent restore on the cluster, and only a Velero pod restart recovers.Fix
Bound the poll by the existing
finalizerContext.resourceTimeout— the same budget Velero already applies to other finalizer phases. Fall back to 10m when the field is unset, so production deployments (always configured) are capped atresourceTimeoutand tests that don't populate it still complete.On timeout the existing
err != nilbranch below records the error into the restore'serrsso the phase transitions toPartiallyFailedinstead of hanging.Test
go test -run TestWaitRestoreExecHook -timeout 60s ./pkg/controller/...clean.