What steps did you take and what happened:
A filesystem backup (PVB) was created. The backup subprocess started successfully and
reported Running data path service, but the node-agent never completed the gRPC Init
handshake. The PVB stayed in Prepared phase indefinitely.
The node-agent logs showed a tight 5-second loop — reaching "Exposed PVB is ready" on
every iteration but never "Init cancelable PVB":
time="2026-04-13T12:28:00Z" level=info msg="PVB is prepared and should be processed by ()" controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:278"
podvolumebackup=velero/daily-filesystembackup-20260413121426-m9wwc
time="2026-04-13T12:28:00Z" level=info msg="Hosting pod is in running state in node " logSource="pkg/exposer/pod_volume.go:253"
time="2026-04-13T12:28:00Z" level=info msg="Exposed PVB is ready and creating data path routine" controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:303"
podvolumebackup=velero/daily-filesystembackup-20260413121426-m9wwc
time="2026-04-13T12:28:05Z" level=info msg="PVB is prepared and should be processed by ()" ...
time="2026-04-13T12:28:05Z" level=info msg="Hosting pod is in running state in node " ...
time="2026-04-13T12:28:05Z" level=info msg="Exposed PVB is ready and creating data path routine" ...
(repeating every 5s — "Init cancelable PVB" is never logged)
The backup subprocess had been running since 12:15:55 and was waiting for the gRPC Init
call the entire time:
time="2026-04-13T12:15:55Z" level=info msg="Starting Velero pod volume backup v1.18.0 (6adcf06-dirty)" logSource="pkg/cmd/cli/podvolume/backup.go:75"
time="2026-04-13T12:15:55Z" level=info msg="Starting micro service in node for PVB daily-filesystembackup-20260413121426-m9wwc" logSource="pkg/cmd/cli/podvolume/backup.go:219"
time="2026-04-13T12:16:14Z" level=info msg="Starting data path service daily-filesystembackup-20260413121426-m9wwc" logSource="pkg/cmd/cli/podvolume/backup.go:228"
time="2026-04-13T12:16:14Z" level=info msg="Running data path service daily-filesystembackup-20260413121426-m9wwc" logSource="pkg/cmd/cli/podvolume/backup.go:238"
No further output from the subprocess until the node-agent was restarted at 12:32:22 —
16 minutes later — after which the gRPC Init succeeded immediately:
time="2026-04-13T12:32:37Z" level=info msg="Init cancelable PVB" controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:500"
podvolumebackup=velero/daily-filesystembackup-20260413121426-m9wwc
time="2026-04-13T12:32:38Z" level=info msg="MicroServiceBR is initialized" controller=podvolumebackup logSource="pkg/datapath/micro_service_watcher.go:176"
podvolumebackup=velero/daily-filesystembackup-20260413121426-m9wwc
time="2026-04-13T12:32:44Z" level=info msg="PVB completed" controller=PodVolumeBackup logSource="pkg/controller/pod_volume_backup_controller.go:572"
--resource-timeout does not apply to this state — the PVB never times out on its own.
What did you expect to happen:
The PVB should either complete the gRPC handshake and proceed with the backup, or fail
with a clear error and allow the backup to be retried. Instead it loops silently every
5 seconds with no timeout and no error surfaced to the user.
The following information will help us better understand what's going on:
The node had ~25% CPU but ~73% memory utilization at the time of the failure. Memory
pressure appears to cause sufficient scheduling latency to consistently time out the
(hardcoded) gRPC Init call, even at low CPU.
Root cause: initCancelableDataPath failure → closeDataPath() removes asyncBR from
dataPathMgr → PVB stays in Prepared → PeriodicalEnqueueSource requeues every 5s
→ asyncBR == nil → recreate watcher → Init timeout → loop.
Anything else you would like to add:
Workaround: Restart the node-agent pod on the affected node after load decreases:
kubectl delete pod -n velero -l role=node-agent \
--field-selector spec.nodeName=<node-name>
The existing subprocess is still running and responds to the gRPC Init call immediately
after node-agent restart — no data loss occurs.
Suggested fix: Either expose the gRPC timeout as a configurable flag, or have the
node-agent detect and terminate the stale subprocess instead of looping indefinitely.
Environment:
- Velero version: v1.18.0
- Kubernetes version: 1.35.3
- Kubernetes installer & version: helm
- Cloud provider or hardware configuration: OpenStack
- OS: ubuntu 24.04
What steps did you take and what happened:
A filesystem backup (PVB) was created. The backup subprocess started successfully and
reported
Running data path service, but the node-agent never completed the gRPC Inithandshake. The PVB stayed in
Preparedphase indefinitely.The node-agent logs showed a tight 5-second loop — reaching "Exposed PVB is ready" on
every iteration but never "Init cancelable PVB":
time="2026-04-13T12:28:00Z" level=info msg="PVB is prepared and should be processed by ()" controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:278"
podvolumebackup=velero/daily-filesystembackup-20260413121426-m9wwc
time="2026-04-13T12:28:00Z" level=info msg="Hosting pod is in running state in node " logSource="pkg/exposer/pod_volume.go:253"
time="2026-04-13T12:28:00Z" level=info msg="Exposed PVB is ready and creating data path routine" controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:303"
podvolumebackup=velero/daily-filesystembackup-20260413121426-m9wwc
time="2026-04-13T12:28:05Z" level=info msg="PVB is prepared and should be processed by ()" ...
time="2026-04-13T12:28:05Z" level=info msg="Hosting pod is in running state in node " ...
time="2026-04-13T12:28:05Z" level=info msg="Exposed PVB is ready and creating data path routine" ...
(repeating every 5s — "Init cancelable PVB" is never logged)
The backup subprocess had been running since 12:15:55 and was waiting for the gRPC Init
call the entire time:
time="2026-04-13T12:15:55Z" level=info msg="Starting Velero pod volume backup v1.18.0 (6adcf06-dirty)" logSource="pkg/cmd/cli/podvolume/backup.go:75"
time="2026-04-13T12:15:55Z" level=info msg="Starting micro service in node for PVB daily-filesystembackup-20260413121426-m9wwc" logSource="pkg/cmd/cli/podvolume/backup.go:219"
time="2026-04-13T12:16:14Z" level=info msg="Starting data path service daily-filesystembackup-20260413121426-m9wwc" logSource="pkg/cmd/cli/podvolume/backup.go:228"
time="2026-04-13T12:16:14Z" level=info msg="Running data path service daily-filesystembackup-20260413121426-m9wwc" logSource="pkg/cmd/cli/podvolume/backup.go:238"
No further output from the subprocess until the node-agent was restarted at 12:32:22 —
16 minutes later — after which the gRPC Init succeeded immediately:
time="2026-04-13T12:32:37Z" level=info msg="Init cancelable PVB" controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:500"
podvolumebackup=velero/daily-filesystembackup-20260413121426-m9wwc
time="2026-04-13T12:32:38Z" level=info msg="MicroServiceBR is initialized" controller=podvolumebackup logSource="pkg/datapath/micro_service_watcher.go:176"
podvolumebackup=velero/daily-filesystembackup-20260413121426-m9wwc
time="2026-04-13T12:32:44Z" level=info msg="PVB completed" controller=PodVolumeBackup logSource="pkg/controller/pod_volume_backup_controller.go:572"
--resource-timeoutdoes not apply to this state — the PVB never times out on its own.What did you expect to happen:
The PVB should either complete the gRPC handshake and proceed with the backup, or fail
with a clear error and allow the backup to be retried. Instead it loops silently every
5 seconds with no timeout and no error surfaced to the user.
The following information will help us better understand what's going on:
The node had ~25% CPU but ~73% memory utilization at the time of the failure. Memory
pressure appears to cause sufficient scheduling latency to consistently time out the
(hardcoded) gRPC Init call, even at low CPU.
Root cause:
initCancelableDataPathfailure →closeDataPath()removes asyncBR fromdataPathMgr→ PVB stays inPrepared→PeriodicalEnqueueSourcerequeues every 5s→ asyncBR == nil → recreate watcher → Init timeout → loop.
Anything else you would like to add:
Workaround: Restart the node-agent pod on the affected node after load decreases: