Skip to content

Commit daff6ab

Browse files
committed
The modification of VGDP affinity design.
Signed-off-by: Xun Jiang <xun.jiang@broadcom.com>
1 parent 4ba61df commit daff6ab

3 files changed

Lines changed: 303 additions & 4 deletions

File tree

design/Implemented/node-agent-affinity.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,5 +128,5 @@ Once this problem happens, the backupPod stays in `Pending` phase, and the corre
128128
On the other hand, the backupPod is deleted after the prepare timeout, so there is no way to tell the cause is one of the above problems or others.
129129
To help the troubleshooting, we can add some diagnostic mechanism to discover the status of the backupPod and node-agent in the same node before deleting it as a result of the prepare timeout.
130130

131-
[1]: Implemented/unified-repo-and-kopia-integration/unified-repo-and-kopia-integration.md
131+
[1]: unified-repo-and-kopia-integration/unified-repo-and-kopia-integration.md
132132
[2]: volume-snapshot-data-movement/volume-snapshot-data-movement.md

design/Implemented/repo_maintenance_job_config.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
Add this design to make the repository maintenance job can read configuration from a dedicate ConfigMap and make the Job's necessary parts configurable, e.g. `PodSpec.Affinity` and `PodSpec.Resources`.
55

66
## Background
7-
Repository maintenance is split from the Velero server to a k8s Job in v1.14 by design [repository maintenance job](Implemented/repository-maintenance.md).
7+
Repository maintenance is split from the Velero server to a k8s Job in v1.14 by design [repository maintenance job](repository-maintenance.md).
88
The repository maintenance Job configuration was read from the Velero server CLI parameter, and it inherits the most of Velero server's Deployment's PodSpec to fill un-configured fields.
99

1010
This design introduces a new way to let the user to customize the repository maintenance behavior instead of inheriting from the Velero server Deployment or reading from `velero server` CLI parameters.
@@ -13,7 +13,7 @@ It's possible new configurations are introduced in future releases based on this
1313

1414
For the node selection, the repository maintenance Job also inherits from the Velero server deployment before, but the Job may last for a while and cost noneligible resources, especially memory.
1515
The users have the need to choose which k8s node to run the maintenance Job.
16-
This design reuses the data structure introduced by design [node-agent affinity configuration](Implemented/node-agent-affinity.md) to make the repository maintenance job can choose which node running on.
16+
This design reuses the data structure introduced by design [Velero Generic Data Path affinity configuration](node-agent-affinity.md) to make the repository maintenance job can choose which node running on.
1717

1818
## Goals
1919
- Unify the repository maintenance Job configuration at one place.
@@ -118,7 +118,7 @@ For example, the following BackupRepository's key should be `test-default-kopia`
118118
volumeNamespace: test
119119
```
120120
121-
The `LoadAffinity` structure is reused from design [node-agent affinity configuration](Implemented/node-agent-affinity.md).
121+
The `LoadAffinity` structure is reused from design [Velero Generic Data Path affinity configuration](node-agent-affinity.md).
122122
It's possible that the users want to choose nodes that match condition A or condition B to run the job.
123123
For example, the user want to let the nodes is in a specified machine type or the nodes locate in the us-central1-x zones to run the job.
124124
This can be done by adding multiple entries in the `LoadAffinity` array.
Lines changed: 299 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,299 @@
1+
# Velero Generic Data Path Load Affinity Enhancement Design
2+
3+
## Glossary & Abbreviation
4+
5+
**Velero Generic Data Path (VGDP)**: VGDP is the collective modules that is introduced in [Unified Repository design][1]. Velero uses these modules to finish data transfer for various purposes (i.e., PodVolume backup/restore, Volume Snapshot Data Movement). VGDP modules include uploaders and the backup repository.
6+
7+
**Exposer**: Exposer is a module that is introduced in [Volume Snapshot Data Movement Design][1]. Velero uses this module to expose the volume snapshots to Velero node-agent pods or node-agent associated pods so as to complete the data movement from the snapshots.
8+
9+
## Background
10+
11+
The implemented [VGDP LoadAffinity design][3] already defined the a structure `LoadAffinity` in `--node-agent-configmap` parameter. The parameter is used to set the affinity of the backupPod of VGDP.
12+
13+
There are still some limitations of this design:
14+
* The affinity setting is global. Say there are two StorageClasses and the underlying storage can only provision volumes to part of the cluster nodes. The supported nodes don't have intersection. Then the affinity will definitely not work in some cases.
15+
* The old design only take the first element of the []*LoadAffinity array. By this way, it cannot support the or logic between Affinity selectors.
16+
* The old design focuses on the backupPod affinity, but the restorePod also needs the affinity setting.
17+
18+
As a result, create this design to address the limitations.
19+
20+
## Goals
21+
22+
- Enhance the node affinity of VGDP instances for volume snapshot data movement: add per StorageClass node affinity.
23+
- Enhance the node affinity of VGDP instances for volume snapshot data movement: support the or logic between affinity selectors.
24+
- Define the behaviors of node affinity of VGDP instances in node-agent for volume snapshot data movement restore, when the PVC restore doesn't require delay binding.
25+
26+
## Non-Goals
27+
28+
- It is also beneficial to support VGDP instances affinity for PodVolume backup/restore, this will be implemented after the PodVolume micro service completes.
29+
30+
## Solution
31+
32+
This design still uses the ConfigMap specified by `velero node-agent` CLI's parameter `--node-agent-configmap` to host the node affinity configurations.
33+
34+
Upon the implemented [VGDP LoadAffinity design][3] introduced `[]*LoadAffinity` structure, this design add a new field `StorageClass`. This field is optional.
35+
* If the `LoadAffinity` element's `StorageClass` doesn't have value, it means this element is applied to global, just as the old design.
36+
* If the `LoadAffinity` element's `StorageClass` has value, it means this element is applied to the VGDP instances' PVCs use the specified StorageClass.
37+
* To support the or logic between LoadAffinity elements, this design allows multiple instances of `LoadAffinity` whose `StorageClass` field have the same value.
38+
* The `LoadAffinity` element whose `StorageClass` has value has higher priority than the `LoadAffinity` element whose `StorageClass` doesn't have value.
39+
40+
41+
```go
42+
type Configs struct {
43+
// LoadConcurrency is the config for load concurrency per node.
44+
LoadConcurrency *LoadConcurrency `json:"loadConcurrency,omitempty"`
45+
46+
// LoadAffinity is the config for data path load affinity.
47+
LoadAffinity []*LoadAffinity `json:"loadAffinity,omitempty"`
48+
}
49+
50+
type LoadAffinity struct {
51+
// NodeSelector specifies the label selector to match nodes
52+
NodeSelector metav1.LabelSelector `json:"nodeSelector"`
53+
}
54+
```
55+
56+
``` go
57+
type LoadAffinity struct {
58+
// NodeSelector specifies the label selector to match nodes
59+
NodeSelector metav1.LabelSelector `json:"nodeSelector"`
60+
61+
// StorageClass specifies the VGDPs the LoadAffinity applied to. If the StorageClass doesn't have value, it applies to all. If not, it applies to only the VGDPs that use this StorageClass.
62+
StorageClass string `json:"storageClass"`
63+
}
64+
```
65+
66+
### Decision Tree
67+
68+
```mermaid
69+
flowchart TD
70+
A[VGDP Pod Needs Scheduling] --> B{Is this a restore operation?}
71+
72+
B -->|Yes| C{StorageClass has volumeBindingMode: WaitForFirstConsumer?}
73+
B -->|No| D[Backup Operation]
74+
75+
C -->|Yes| E{restorePVC.ignoreDelayBinding = true?}
76+
C -->|No| F[StorageClass binding mode: Immediate]
77+
78+
E -->|No| G[Wait for target Pod scheduling<br/>Use Pod's selected node<br/>⚠️ Affinity rules ignored]
79+
E -->|Yes| H[Apply affinity rules<br/>despite WaitForFirstConsumer]
80+
81+
F --> I{Check StorageClass in loadAffinity by StorageClass field}
82+
H --> I
83+
D --> J{Using backupPVC with different StorageClass?}
84+
85+
J -->|Yes| K[Use final StorageClass<br/>for affinity lookup]
86+
J -->|No| L[Use original PVC StorageClass<br/>for affinity lookup]
87+
88+
K --> I
89+
L --> I
90+
91+
I -->|StorageClass found| N[Filter the LoadAffinity by <br/>the StorageClass<br/>🎯 and apply the LoadAffinity HIGHEST PRIORITY]
92+
I -->|StorageClass not found| O{Check loadAffinity element without StorageClass field}
93+
94+
O -->|No loadAffinity configured| R[No affinity constraints<br/>Schedule on any available node<br/>🌐 DEFAULT]
95+
96+
N --> S{Multiple rules in array?}
97+
S -->|Yes| T[Apply all rules as OR conditions<br/>Pod scheduled on nodes matching ANY rule]
98+
S -->|No| U[Apply single rule<br/>Pod scheduled on nodes matching this rule]
99+
100+
O --> S
101+
102+
T --> V[Validate node-agent availability<br/>⚠️ Ensure node-agent pods exist on target nodes]
103+
U --> V
104+
105+
V --> W{Node-agent available on selected nodes?}
106+
W -->|Yes| X[✅ VGDP Pod scheduled successfully]
107+
W -->|No| Y[❌ Pod stays in Pending state<br/>Timeout after 30min<br/>Check node-agent DaemonSet coverage]
108+
109+
R --> Z[Schedule on any node<br/>✅ Basic scheduling]
110+
111+
%% Styling
112+
classDef successNode fill:#d4edda,stroke:#155724,color:#155724
113+
classDef warningNode fill:#fff3cd,stroke:#856404,color:#856404
114+
classDef errorNode fill:#f8d7da,stroke:#721c24,color:#721c24
115+
classDef priorityHigh fill:#e7f3ff,stroke:#0066cc,color:#0066cc
116+
classDef priorityMedium fill:#f0f8ff,stroke:#4d94ff,color:#4d94ff
117+
classDef priorityDefault fill:#f8f9fa,stroke:#6c757d,color:#6c757d
118+
119+
class X,Z successNode
120+
class G,V,Y warningNode
121+
class Y errorNode
122+
class N,T,U priorityHigh
123+
class P,Q priorityMedium
124+
class R priorityDefault
125+
```
126+
127+
### Examples
128+
129+
#### Multiple LoadAffinities
130+
131+
``` json
132+
{
133+
"loadAffinity": [
134+
{
135+
"nodeSelector": {
136+
"matchLabels": {
137+
"beta.kubernetes.io/instance-type": "Standard_B4ms"
138+
}
139+
}
140+
},
141+
{
142+
"nodeSelector": {
143+
"matchExpressions": [
144+
{
145+
"key": "topology.kubernetes.io/zone",
146+
"operator": "In",
147+
"values": [
148+
"us-central1-a"
149+
]
150+
}
151+
]
152+
}
153+
}
154+
]
155+
}
156+
```
157+
158+
This sample demonstrates how to use multiple affinities in `loadAffinity`. That can support more complicated scenarios, e.g. need to filter nodes satisfied either of two conditions, instead of satisfied both of two conditions.
159+
160+
In this example, the VGDP pods will be assigned to nodes, which instance type is `Standard_B4ms` or which zone is `us-central1-a`.
161+
162+
163+
#### LoadAffinity interacts with LoadAffinityPerStorageClass
164+
165+
``` json
166+
{
167+
"loadAffinity": [
168+
{
169+
"nodeSelector": {
170+
"matchLabels": {
171+
"beta.kubernetes.io/instance-type": "Standard_B4ms"
172+
}
173+
}
174+
},
175+
{
176+
"nodeSelector": {
177+
"matchExpressions": [
178+
{
179+
"key": "kubernetes.io/os",
180+
"values": [
181+
"linux"
182+
],
183+
"operator": "In"
184+
}
185+
]
186+
},
187+
"storageClass": "kibishii-storage-class"
188+
},
189+
{
190+
"nodeSelector": {
191+
"matchLabels": {
192+
"beta.kubernetes.io/instance-type": "Standard_B8ms"
193+
}
194+
},
195+
"storageClass": "kibishii-storage-class"
196+
}
197+
]
198+
}
199+
```
200+
201+
This sample demonstrates how the `loadAffinity` elements with `StorageClass` field and without `StorageClass` field setting work together.
202+
If the VGDP mounting volume is created from StorageClass `kibishii-storage-class`, its pod will run Linux nodes or instance type as `Standard_B8ms`.
203+
204+
The other VGDP instances will run on nodes, which instance type is `Standard_B4ms`.
205+
206+
#### LoadAffinity interacts with BackupPVC
207+
208+
``` json
209+
{
210+
"loadAffinity": [
211+
{
212+
"nodeSelector": {
213+
"matchLabels": {
214+
"beta.kubernetes.io/instance-type": "Standard_B4ms"
215+
}
216+
},
217+
"storageClass": "kibishii-storage-class"
218+
},
219+
{
220+
"nodeSelector": {
221+
"matchLabels": {
222+
"beta.kubernetes.io/instance-type": "Standard_B2ms"
223+
}
224+
},
225+
"storageClass": "worker-storagepolicy"
226+
}
227+
],
228+
"backupPVC": {
229+
"kibishii-storage-class": {
230+
"storageClass": "worker-storagepolicy"
231+
}
232+
}
233+
}
234+
```
235+
236+
Velero data mover supports to use different StorageClass to create backupPVC by [design](https://github.com/vmware-tanzu/velero/pull/7982).
237+
238+
In this example, if the backup target PVC's StorageClass is `kibishii-storage-class`, its backupPVC should use StorageClass `worker-storagepolicy`. Because the final StorageClass is `worker-storagepolicy`, the backupPod uses the loadAffinity specified by `loadAffinity`'s elements with `StorageClass` field set to `worker-storagepolicy`. backupPod will be assigned to nodes, which instance type is `Standard_B2ms`.
239+
240+
241+
#### LoadAffinity interacts with RestorePVC
242+
243+
``` json
244+
{
245+
"loadAffinity": [
246+
{
247+
"nodeSelector": {
248+
"matchLabels": {
249+
"beta.kubernetes.io/instance-type": "Standard_B4ms"
250+
}
251+
},
252+
"storageClass": "kibishii-storage-class"
253+
}
254+
],
255+
"restorePVC": {
256+
"ignoreDelayBinding": false
257+
}
258+
}
259+
```
260+
261+
##### StorageClass's bind mode is WaitForFirstConsumer
262+
263+
``` yaml
264+
apiVersion: storage.k8s.io/v1
265+
kind: StorageClass
266+
metadata:
267+
name: kibishii-storage-class
268+
parameters:
269+
svStorageClass: worker-storagepolicy
270+
provisioner: csi.vsphere.vmware.com
271+
reclaimPolicy: Delete
272+
volumeBindingMode: WaitForFirstConsumer
273+
```
274+
275+
If restorePVC should be created from StorageClass `kibishii-storage-class`, and it's volumeBindingMode is `WaitForFirstConsumer`.
276+
Although `loadAffinityPerStorageClass` has a section matches the StorageClass, the `ignoreDelayBinding` is set `false`, the Velero exposer will wait until the target Pod scheduled to a node, and returns the node as SelectedNode for the restorePVC.
277+
As a result, the `loadAffinityPerStorageClass` will not take affect.
278+
279+
##### StorageClass's bind mode is Immediate
280+
281+
``` yaml
282+
apiVersion: storage.k8s.io/v1
283+
kind: StorageClass
284+
metadata:
285+
name: kibishii-storage-class
286+
parameters:
287+
svStorageClass: worker-storagepolicy
288+
provisioner: csi.vsphere.vmware.com
289+
reclaimPolicy: Delete
290+
volumeBindingMode: Immediate
291+
```
292+
293+
Because the StorageClass volumeBindingMode is `Immediate`, although `ignoreDelayBinding` is set to `false`, restorePVC will not be created according to the target Pod.
294+
295+
The restorePod will be assigned to nodes, which instance type is `Standard_B4ms`.
296+
297+
[1]: Implemented/unified-repo-and-kopia-integration/unified-repo-and-kopia-integration.md
298+
[2]: Implemented/volume-snapshot-data-movement/volume-snapshot-data-movement.md
299+
[3]: Implemented/node-agent-affinity.md

0 commit comments

Comments
 (0)