fix(restore_finalizer): bound WaitRestoreExecHook poll with resourceTimeout by SAY-5 · Pull Request #9747 · velero-io/velero

SAY-5 · 2026-04-23T18:59:06Z

Problem

WaitRestoreExecHook calls wait.PollUntilContextCancel with a background context and no deadline. A hook that was registered via Add() but never recorded as executed (pod evicted, container never ready, goroutine panic, leaked tracker entry) leaves the restore stuck in Finalizing forever. Because the finalizer controller is a single-threaded controller-runtime reconciler, one stuck restore blocks every subsequent restore on the cluster, and only a Velero pod restart recovers.

Fix

Bound the poll by the existing finalizerContext.resourceTimeout — the same budget Velero already applies to other finalizer phases. Fall back to 10m when the field is unset, so production deployments (always configured) are capped at resourceTimeout and tests that don't populate it still complete.

On timeout the existing err != nil branch below records the error into the restore's errs so the phase transitions to PartiallyFailed instead of hanging.

Test

go test -run TestWaitRestoreExecHook -timeout 60s ./pkg/controller/... clean.

…imeout WaitRestoreExecHook used wait.PollUntilContextCancel with a background context and no deadline. A hook that was registered via Add() but never recorded as executed (pod evicted, container never ready, goroutine panic, leaked tracker entry) left the restore stuck in Finalizing forever. Because the finalizer controller is a single-threaded controller-runtime reconciler, one stuck restore blocks every subsequent restore on the cluster, and only a Velero pod restart recovers. Bound the poll by the existing finalizerContext.resourceTimeout — the same budget Velero already applies to other finalizer phases. Fall back to 10m when the field is unset, so production deployments (always configured) are capped at resourceTimeout and tests that don't populate it still complete. Refs velero-io/velero issue 9744. Signed-off-by: SAY-5 <say.apm35@gmail.com>

priyansh17 · 2026-04-23T23:13:30Z

+	}
+	pollCtx, cancel := context.WithTimeout(context.Background(), timeout)
+	defer cancel()
+	err := wait.PollUntilContextCancel(pollCtx, 1*time.Second, true, func(context.Context) (bool, error) {


we can also add a check here if the restore CR is missing skip it

priyansh17 · 2026-04-23T23:15:04Z

+	// Finalizing forever and — because the finalizer controller is a
+	// single-threaded controller-runtime reconciler — blocked every
+	// other restore on the cluster.
+	timeout := ctx.resourceTimeout


we do not need to check <0 I think, this should be sufficient

// Bound the wait so we never poll forever on orphaned hook tracker state. waitCtx, waitCancel := context.WithTimeout(context.Background(), ctx.resourceTimeout) defer waitCancel() // wait for restore exec hooks to finish err := wait.PollUntilContextCancel(waitCtx, 1*time.Second, true, func(pollCtx context.Context) (bool, error) {

priyansh17 · 2026-04-23T23:15:30Z

+	err := wait.PollUntilContextCancel(pollCtx, 1*time.Second, true, func(context.Context) (bool, error) {
 		log.Debug("Checking the progress of hooks execution")
 		if ctx.multiHookTracker.IsComplete(ctx.restore.Name) {
 			return true, nil


Suggestion

// Check that the Restore CR still exists; abort if it has been deleted. restore := &velerov1api.Restore{} if err := ctx.crClient.Get(pollCtx, client.ObjectKey{ Namespace: ctx.restore.Namespace, Name: ctx.restore.Name, }, restore); err != nil { if apierrors.IsNotFound(err) { log.Warn("Restore CR no longer exists, aborting hook wait") return true, nil } log.WithError(err).Warn("Error checking restore CR existence during hook wait") }

priyansh17 · 2026-04-23T23:20:22Z

Please include a test as well for the same.

velero/pkg/controller/restore_finalizer_controller_test.go

Line 470 in 7549408

func TestWaitRestoreExecHook(t *testing.T) {

hookTracker4 := hook.NewMultiHookTracker()
	restoreName4 := "restore4"
	hookTracker4.Add(restoreName4, "ns", "pod", "con1", "s1", "h1", "", 0)

//set
resourceTimeout: 10 * time.Second,

case:
{
			name:                   "hook never recorded should timeout instead of hanging",
			hookTracker:            hookTracker4,
			restore:                builder.ForRestore(velerov1api.DefaultNamespace, restoreName4).Result(),
			expectedHooksAttempted: 0,
			expectedHooksFailed:    0,
			expectedHookErrs:       0,
			resourceTimeout:        3 * time.Second,
			expectTimeoutErr:       true,
		},

test:
if tc.expectTimeoutErr {
			// Should have a timeout error in the restore namespace
			assert.True(t, len(errs.Namespaces) > 0 || len(errs.Velero) > 0 || len(errs.Cluster) > 0,
				"expected timeout error but got none")
			continue
		}

@priyansh17

…timeout Per @priyansh17's review of velero-io#9747: cover the never-recorded-hook case with a 3s resourceTimeout, confirming the poll is bounded and surfaces a namespace error instead of hanging the reconciler. Signed-off-by: SAY-5 <say.apm35@gmail.com>

SAY-5 · 2026-04-23T23:57:27Z

Added the regression test you suggested — hookTracker4 entry added but never recorded, resourceTimeout: 3 * time.Second, asserts errs.Namespaces is non-empty. Verified locally with go test ./pkg/controller/ -run TestWaitRestoreExecHook -timeout 60s — passes in ~5s (3s timeout + surrounding scaffolding). Thanks for the review!

github-actions Bot assigned SAY-5 Apr 23, 2026

priyansh17 reviewed Apr 23, 2026

View reviewed changes

github-actions Bot added the has-unit-tests label Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(restore_finalizer): bound WaitRestoreExecHook poll with resourceTimeout#9747

fix(restore_finalizer): bound WaitRestoreExecHook poll with resourceTimeout#9747
SAY-5 wants to merge 2 commits intovelero-io:mainfrom
SAY-5:fix/wait-restore-exec-hook-timeout-9744

SAY-5 commented Apr 23, 2026

Uh oh!

priyansh17 Apr 23, 2026

Uh oh!

priyansh17 Apr 23, 2026

Uh oh!

priyansh17 Apr 23, 2026

Uh oh!

priyansh17 commented Apr 23, 2026

Uh oh!

SAY-5 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SAY-5 commented Apr 23, 2026

Problem

Fix

Test

Uh oh!

priyansh17 Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

priyansh17 Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

priyansh17 Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

priyansh17 commented Apr 23, 2026

Uh oh!

SAY-5 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants