node-agent: PVB gets stuck indefinitely after the data path pod completes successfully

## What steps did you take and what happened:

  A filesystem backup (PVB) was created. The backup subprocess started successfully and
  reported `Running data path service`, but the node-agent never completed the gRPC Init
  handshake. The PVB stayed in `Prepared` phase indefinitely.

  The node-agent logs showed a tight 5-second loop — reaching "Exposed PVB is ready" on
  every iteration but never "Init cancelable PVB":

  time="2026-04-13T12:28:00Z" level=info msg="PVB is prepared and should be processed by  ()" controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:278"
  podvolumebackup=velero/daily-filesystembackup-20260413121426-m9wwc
  time="2026-04-13T12:28:00Z" level=info msg="Hosting pod is in running state in node " logSource="pkg/exposer/pod_volume.go:253"
  time="2026-04-13T12:28:00Z" level=info msg="Exposed PVB is ready and creating data path routine" controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:303"
  podvolumebackup=velero/daily-filesystembackup-20260413121426-m9wwc
  time="2026-04-13T12:28:05Z" level=info msg="PVB is prepared and should be processed by  ()" ...
  time="2026-04-13T12:28:05Z" level=info msg="Hosting pod is in running state in node " ...
  time="2026-04-13T12:28:05Z" level=info msg="Exposed PVB is ready and creating data path routine" ...
  (repeating every 5s — "Init cancelable PVB" is never logged)

  The backup subprocess had been running since 12:15:55 and was waiting for the gRPC Init
  call the entire time:

  time="2026-04-13T12:15:55Z" level=info msg="Starting Velero pod volume backup v1.18.0 (6adcf06b5b0e6fb93998d3e101e2cbdc134fa3c3-dirty)" logSource="pkg/cmd/cli/podvolume/backup.go:75"
  time="2026-04-13T12:15:55Z" level=info msg="Starting micro service in node  for PVB daily-filesystembackup-20260413121426-m9wwc" logSource="pkg/cmd/cli/podvolume/backup.go:219"
  time="2026-04-13T12:16:14Z" level=info msg="Starting data path service daily-filesystembackup-20260413121426-m9wwc" logSource="pkg/cmd/cli/podvolume/backup.go:228"
  time="2026-04-13T12:16:14Z" level=info msg="Running data path service daily-filesystembackup-20260413121426-m9wwc" logSource="pkg/cmd/cli/podvolume/backup.go:238"

  No further output from the subprocess until the node-agent was restarted at 12:32:22 —
  16 minutes later — after which the gRPC Init succeeded immediately:

  time="2026-04-13T12:32:37Z" level=info msg="Init cancelable PVB" controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:500"
  podvolumebackup=velero/daily-filesystembackup-20260413121426-m9wwc
  time="2026-04-13T12:32:38Z" level=info msg="MicroServiceBR is initialized" controller=podvolumebackup logSource="pkg/datapath/micro_service_watcher.go:176"
  podvolumebackup=velero/daily-filesystembackup-20260413121426-m9wwc
  time="2026-04-13T12:32:44Z" level=info msg="PVB completed" controller=PodVolumeBackup logSource="pkg/controller/pod_volume_backup_controller.go:572"

  `--resource-timeout` does not apply to this state — the PVB never times out on its own.

  ## What did you expect to happen:

  The PVB should either complete the gRPC handshake and proceed with the backup, or fail
  with a clear error and allow the backup to be retried. Instead it loops silently every
  5 seconds with no timeout and no error surfaced to the user.

  ## The following information will help us better understand what's going on:

  The node had ~25% CPU but ~73% memory utilization at the time of the failure. Memory
  pressure appears to cause sufficient scheduling latency to consistently time out the
  (hardcoded) gRPC Init call, even at low CPU.

  Root cause: `initCancelableDataPath` failure → `closeDataPath()` removes asyncBR from
  `dataPathMgr` → PVB stays in `Prepared` → `PeriodicalEnqueueSource` requeues every 5s
  → asyncBR == nil → recreate watcher → Init timeout → loop.

  ## Anything else you would like to add:

  **Workaround:** Restart the node-agent pod on the affected node after load decreases:

  ```bash
  kubectl delete pod -n velero -l role=node-agent \
    --field-selector spec.nodeName=<node-name>

  The existing subprocess is still running and responds to the gRPC Init call immediately
  after node-agent restart — no data loss occurs.

  Suggested fix: Either expose the gRPC timeout as a configurable flag, or have the
  node-agent detect and terminate the stale subprocess instead of looping indefinitely.

  Environment:

  - Velero version: v1.18.0
  - Kubernetes version: 1.35.3
  - Kubernetes installer & version: helm
  - Cloud provider or hardware configuration: OpenStack
  - OS: ubuntu 24.04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node-agent: PVB gets stuck indefinitely after the data path pod completes successfully #9718

What steps did you take and what happened:

What did you expect to happen:

The following information will help us better understand what's going on:

Anything else you would like to add:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

node-agent: PVB gets stuck indefinitely after the data path pod completes successfully #9718

Description

What steps did you take and what happened:

What did you expect to happen:

The following information will help us better understand what's going on:

Anything else you would like to add:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions