Skip to content

fix(restore_finalizer): bound WaitRestoreExecHook poll with resourceTimeout#9747

Open
SAY-5 wants to merge 2 commits intovelero-io:mainfrom
SAY-5:fix/wait-restore-exec-hook-timeout-9744
Open

fix(restore_finalizer): bound WaitRestoreExecHook poll with resourceTimeout#9747
SAY-5 wants to merge 2 commits intovelero-io:mainfrom
SAY-5:fix/wait-restore-exec-hook-timeout-9744

Conversation

@SAY-5
Copy link
Copy Markdown

@SAY-5 SAY-5 commented Apr 23, 2026

Fixes #9744.

Problem

WaitRestoreExecHook calls wait.PollUntilContextCancel with a background context and no deadline. A hook that was registered via Add() but never recorded as executed (pod evicted, container never ready, goroutine panic, leaked tracker entry) leaves the restore stuck in Finalizing forever. Because the finalizer controller is a single-threaded controller-runtime reconciler, one stuck restore blocks every subsequent restore on the cluster, and only a Velero pod restart recovers.

Fix

Bound the poll by the existing finalizerContext.resourceTimeout — the same budget Velero already applies to other finalizer phases. Fall back to 10m when the field is unset, so production deployments (always configured) are capped at resourceTimeout and tests that don't populate it still complete.

On timeout the existing err != nil branch below records the error into the restore's errs so the phase transitions to PartiallyFailed instead of hanging.

Test

go test -run TestWaitRestoreExecHook -timeout 60s ./pkg/controller/... clean.

…imeout

WaitRestoreExecHook used wait.PollUntilContextCancel with a
background context and no deadline. A hook that was registered via
Add() but never recorded as executed (pod evicted, container never
ready, goroutine panic, leaked tracker entry) left the restore
stuck in Finalizing forever. Because the finalizer controller is a
single-threaded controller-runtime reconciler, one stuck restore
blocks every subsequent restore on the cluster, and only a Velero
pod restart recovers.

Bound the poll by the existing finalizerContext.resourceTimeout —
the same budget Velero already applies to other finalizer phases.
Fall back to 10m when the field is unset, so production
deployments (always configured) are capped at resourceTimeout and
tests that don't populate it still complete.

Refs velero-io/velero issue 9744.

Signed-off-by: SAY-5 <say.apm35@gmail.com>
}
pollCtx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
err := wait.PollUntilContextCancel(pollCtx, 1*time.Second, true, func(context.Context) (bool, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also add a check here if the restore CR is missing skip it

// Finalizing forever and — because the finalizer controller is a
// single-threaded controller-runtime reconciler — blocked every
// other restore on the cluster.
timeout := ctx.resourceTimeout
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not need to check <0 I think, this should be sufficient

// Bound the wait so we never poll forever on orphaned hook tracker state.
	waitCtx, waitCancel := context.WithTimeout(context.Background(), ctx.resourceTimeout)
	defer waitCancel()

	// wait for restore exec hooks to finish
	err := wait.PollUntilContextCancel(waitCtx, 1*time.Second, true, func(pollCtx context.Context) (bool, error) {

err := wait.PollUntilContextCancel(pollCtx, 1*time.Second, true, func(context.Context) (bool, error) {
log.Debug("Checking the progress of hooks execution")
if ctx.multiHookTracker.IsComplete(ctx.restore.Name) {
return true, nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion

// Check that the Restore CR still exists; abort if it has been deleted.
		restore := &velerov1api.Restore{}
		if err := ctx.crClient.Get(pollCtx, client.ObjectKey{
			Namespace: ctx.restore.Namespace,
			Name:      ctx.restore.Name,
		}, restore); err != nil {
			if apierrors.IsNotFound(err) {
				log.Warn("Restore CR no longer exists, aborting hook wait")
				return true, nil
			}
			log.WithError(err).Warn("Error checking restore CR existence during hook wait")
		}

@priyansh17
Copy link
Copy Markdown
Collaborator

Please include a test as well for the same.

func TestWaitRestoreExecHook(t *testing.T) {

hookTracker4 := hook.NewMultiHookTracker()
	restoreName4 := "restore4"
	hookTracker4.Add(restoreName4, "ns", "pod", "con1", "s1", "h1", "", 0)

//set
resourceTimeout: 10 * time.Second,

case:
{
			name:                   "hook never recorded should timeout instead of hanging",
			hookTracker:            hookTracker4,
			restore:                builder.ForRestore(velerov1api.DefaultNamespace, restoreName4).Result(),
			expectedHooksAttempted: 0,
			expectedHooksFailed:    0,
			expectedHookErrs:       0,
			resourceTimeout:        3 * time.Second,
			expectTimeoutErr:       true,
		},

test:
if tc.expectTimeoutErr {
			// Should have a timeout error in the restore namespace
			assert.True(t, len(errs.Namespaces) > 0 || len(errs.Velero) > 0 || len(errs.Cluster) > 0,
				"expected timeout error but got none")
			continue
		}

…timeout

Per @priyansh17's review of velero-io#9747: cover the never-recorded-hook
case with a 3s resourceTimeout, confirming the poll is bounded and
surfaces a namespace error instead of hanging the reconciler.

Signed-off-by: SAY-5 <say.apm35@gmail.com>
@SAY-5
Copy link
Copy Markdown
Author

SAY-5 commented Apr 23, 2026

Added the regression test you suggested — hookTracker4 entry added but never recorded, resourceTimeout: 3 * time.Second, asserts errs.Namespaces is non-empty. Verified locally with go test ./pkg/controller/ -run TestWaitRestoreExecHook -timeout 60s — passes in ~5s (3s timeout + surrounding scaffolding). Thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Restore stuck in Finalizing phase indefinitely due to unbounded hook-tracker wait

2 participants