Skip to content

refactor: improve Velero server deployment retrieval and handling#9746

Open
priyansh17 wants to merge 3 commits intovelero-io:mainfrom
priyansh17:priyansh/fix-velero-image-pickup
Open

refactor: improve Velero server deployment retrieval and handling#9746
priyansh17 wants to merge 3 commits intovelero-io:mainfrom
priyansh17:priyansh/fix-velero-image-pickup

Conversation

@priyansh17
Copy link
Copy Markdown
Collaborator

refactor: improve Velero server deployment retrieval and handling
Signed-off-by: Priyansh Choudhary im1706@gmail.com
Thank you for contributing to Velero!

Please add a summary of your change

Describer in #9745

Does your change fix a particular issue?

Fixes #(issue)
#9745

Please indicate you've done the following:

Signed-off-by: Priyansh Choudhary <im1706@gmail.com>
@priyansh17 priyansh17 force-pushed the priyansh/fix-velero-image-pickup branch from 27a98cf to edb5b3f Compare April 23, 2026 14:45
@github-actions github-actions Bot added Dependencies Pull requests that update a dependency file Documentation labels Apr 23, 2026
@priyansh17 priyansh17 force-pushed the priyansh/fix-velero-image-pickup branch from edb5b3f to 27a98cf Compare April 23, 2026 14:48
@priyansh17 priyansh17 force-pushed the priyansh/fix-velero-image-pickup branch from 27a98cf to 104d36e Compare April 23, 2026 14:51
@priyansh17 priyansh17 requested review from blackpiglet, kaovilai and shubham-pampattiwar and removed request for shubham-pampattiwar April 23, 2026 14:52
@priyansh17
Copy link
Copy Markdown
Collaborator Author

This way customers only have to add install.Labels() in their own deployment files and it can pick up Velero container from it.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 80.64516% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/util/velero/velero.go 82.14% 3 Missing and 2 partials ⚠️
pkg/restore/actions/pod_volume_restore_action.go 50.00% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@Lyndon-Li
Copy link
Copy Markdown
Contributor

See comments #9745 (comment), I believe this is a known behavior and if we need to fill this gap, we need a system wide treatment, only one the current code changes. But before that, let's evaluate the requirement first.

@priyansh17
Copy link
Copy Markdown
Collaborator Author

See comments #9745 (comment), I believe this is a known behavior and if we need to fill this gap, we need a system wide treatment, only one the current code changes. But before that, let's evaluate the requirement first.

Hello, I meant we should generalize this behavior so it works for velero as well as others adapting it.
For instance in Azure backup we have our own deployment with different containers where velero is one of them.

Example Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    meta.helm.sh/release-name: azure-aks-backup
    meta.helm.sh/release-namespace: dataprotection-microsoft
  creationTimestamp: "2026-04-21T21:48:12Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: azure-aks-backup
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: dataprotection-microsoft-kubernetes
    app.kubernetes.io/version: 0.0.3397-770
    component: velero
    helm.sh/chart: dataprotection-microsoft-kubernetes-0.0.3397-770
  name: dataprotection-microsoft-kubernetes-agent
  namespace: dataprotection-microsoft
  ---
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: azure-aks-backup
      app.kubernetes.io/name: dataprotection-microsoft-kubernetes
  strategy:
    type: Recreate
  template:
    metadata:
      annotations:
        checksum/secret: 556ac5f50170c44cf65ffa6f1ba4a1c53b98de0d712dc7032482d98ccd57fb88
        kubernetes.azure.com/set-kube-service-host-fqdn: "true"
        prometheus.io/path: /metrics
        prometheus.io/port: "8085"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: azure-aks-backup
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: dataprotection-microsoft-kubernetes
        app.kubernetes.io/version: 0.0.3397-770
        helm.sh/chart: dataprotection-microsoft-kubernetes-0.0.3397-770
        name: velero
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: dataprotection-microsoft
                operator: In
                values:
                - allow
            weight: 100
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.azure.com/cluster
                operator: Exists
              - key: type
                operator: NotIn
                values:
                - virtual-kubelet
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
              - key: kubernetes.io/arch
                operator: In
                values:
                - amd64
      automountServiceAccountToken: false
      containers:
      - args:
        - -c
        - /fluent-bit/etc/fluent-bit.conf
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: mcr.microsoft.com/oss/v2/fluent/fluent-bit:v4.2.3-4
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 10
          httpGet:
            path: /api/v1/health
            port: 2020
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 5
        name: fluent-bit
        readinessProbe:
          failureThreshold: 10
          httpGet:
            path: /api/v1/health
            port: 2020
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: 30m
            memory: 128Mi
          requests:
            cpu: 20m
            memory: 64Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          privileged: false
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /fluent-bit/etc
          name: fluentbit-config
        - mountPath: /var/log/containers
          name: var-log-containers
          readOnly: true
        - mountPath: /var/log/pods
          name: var-log-pods
          readOnly: true
        - mountPath: /var/lib/docker/containers
          name: var-lib-containers
          readOnly: true
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: service-account-token
          readOnly: true
      - args:
        - server
        - --uploader-type=kopia
        - --default-item-operation-timeout=10m0s
        - --disable-informer-cache=true
        - --log-format=json
        - --log-level=debug
        - --terminating-resource-timeout=1m0s
        - --features=EnableCSI
        - --keep-latest-maintenance-jobs=3
        command:
        - /velero
        env:
       ---
        - name: LD_LIBRARY_PATH
          value: /plugins
        - name: AZURE_CREDENTIALS_FILE
          value: /credentials/cloud
        **image: mcr.microsoft.com/azurebackup/k8s/velero:0.0.03397.770**
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 10
          httpGet:
            path: /metrics
            port: http-monitoring
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 5
        name: velero
        ports:
        - containerPort: 8085
          name: http-monitoring
          protocol: TCP
        readinessProbe:
          failureThreshold: 10
          httpGet:
            path: /metrics
            port: http-monitoring
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          ---
        securityContext:
          ---
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /plugins
          name: plugins
        - mountPath: /credentials
          name: cloud-credentials
        - mountPath: /scratch
          name: scratch
        - mountPath: /tmp
          name: tmpdir
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: service-account-token
          readOnly: true
      dnsPolicy: ClusterFirst
      initContainers:
      - image: mcr.microsoft.com/azurebackup/k8s/velero-plugin-for-microsoft-azure:0.0.03397.770
        imagePullPolicy: IfNotPresent
        name: velero-plugin-for-microsoft-azure
        resources:
          ---
        securityContext:
          ---
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /target
          name: plugins
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
       ---
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      volumes:
      - configMap:
          defaultMode: 420
          name: dataprotection-microsoft-kubernetes-agent-fluentbit-config
        name: fluentbit-config
      - hostPath:
          path: /var/log/containers
          type: ""
        name: var-log-containers
      - hostPath:
          path: /var/log/pods
          type: ""
        name: var-log-pods
      - hostPath:
          path: /var/lib/docker/containers
          type: ""
        name: var-lib-containers
      - name: cloud-credentials
        secret:
          defaultMode: 420
          secretName: dataprotection-microsoft-kubernetes-agent-credentials
      - emptyDir: {}
        name: plugins
      - emptyDir: {}
        name: scratch
      - emptyDir: {}
        name: tmpdir
      - name: service-account-token
        projected:
          defaultMode: 420
          sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              items:
              - key: ca.crt
                path: ca.crt
              name: kube-root-ca.crt
          - downwardAPI:
              items:
              - fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
                path: namespace
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2026-04-21T21:48:12Z"
    lastUpdateTime: "2026-04-21T21:49:23Z"
    message: ReplicaSet "dataprotection-microsoft-kubernetes-agent-66c9fcd59c" has
      successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2026-04-23T07:22:22Z"
    lastUpdateTime: "2026-04-23T07:22:22Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

@Lyndon-Li
Copy link
Copy Markdown
Contributor

For instance in Azure backup we have our own deployment with different containers where velero is one of them.

In this case, is having Velero instances in different namespace enough? Or why do you have to rename the modules of the Velero instances?
And I believe this behavior is there for a long time, is there any new change from your side exposing this problem?

@priyansh17
Copy link
Copy Markdown
Collaborator Author

For instance in Azure backup we have our own deployment with different containers where velero is one of them.

In this case, is having Velero instances in different namespace enough? Or why do you have to rename the modules of the Velero instances? And I believe this behavior is there for a long time, is there any new change from your side exposing this problem?

We also have a dependency to add a fluent-bit container with velero in same deployment to move logs to our custom databases.
This was working ealrier in previous velero versions but since 1.16 we notice that this was introduced:

err := cli.Get(ctx, types.NamespacedName{Name: "velero", Namespace: repo.Namespace}, deployment)

as the image is being used in a new field now which didn't exist before:

return &PodVolumeRestoreAction{
		logger:      logger,
		client:      client,
		crClient:    crClient,
		veleroImage: image,  -> new field introduced
	}, nil

Changed here:
eb5230e

Now in our case as name was different it failed to identify the deployment and failed going into PartiallyFailed state at the beginning.

@Lyndon-Li
Copy link
Copy Markdown
Contributor

We also have a dependency to add a fluent-bit container with velero in same deployment to move logs to our custom databases.

Then I understand why getVeleroContainer is required, but I don't understand why GetVeleroServerDeployment is required, as mentioned above, Velero doesn't support to customize its modules' name all the time.

@priyansh17
Copy link
Copy Markdown
Collaborator Author

priyansh17 commented Apr 23, 2026

We also have a dependency to add a fluent-bit container with velero in same deployment to move logs to our custom databases.

Then I understand why getVeleroContainer is required, but I don't understand why GetVeleroServerDeployment is required, as mentioned above, Velero doesn't support to customize its modules' name all the time.

I had to make that change because it finds the container in the deployment specified.
To find the deployment when the deployment name is not exactly "velero" as shown in example above, but something different we can use the labels and confirm if velero exists inside that.
This would make it easier for users and not break current behaviour as well in custom deployments.

@Lyndon-Li
Copy link
Copy Markdown
Contributor

To find the deployment when the name is not exactly "velero" but something different

This assumes that Velero supports to customize its modules' name, but it actually doesn't.

@priyansh17
Copy link
Copy Markdown
Collaborator Author

To find the deployment when the name is not exactly "velero" but something different

This assumes that Velero supports to customize its modules' name, but it actually doesn't.

Yes I agree but that's what I intend to achieve here. With this we would be able to support customization.
And the default deployment continues to work as it is.

@Lyndon-Li
Copy link
Copy Markdown
Contributor

To find the deployment when the name is not exactly "velero" but something different

This assumes that Velero supports to customize its modules' name, but it actually doesn't.

Yes I agree but that's what I intend to achieve here. With this we would be able to support customization. And the default deployment continues to work as it is.

This is my suggestion:

  • For the current PR, fix the getVeleroContainer only
  • For the requirement to support customized module name, open another issue, with the details why it is needed, for discussion. If it is approved, we can go ahead to create other PRs for design and implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Dependencies Pull requests that update a dependency file Documentation has-changelog has-unit-tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants