Skip to content

[BUG] modifying maxSurge doesn't work under some settings #1666

Open
@myname4423

Description

@myname4423

What happened:
Test 1:
step 1: create a cloneset with following setting (replicas=4, paused = false):
image
step 2: update the cloneset.spec.template to release a new version [As expected]:
get the cloneset:
image
the log:
I0719 09:37:04.282526 1 cloneset_sync_utils.go:124] Calculate diffs for CloneSet bg/echoserver, replicas=4, partition=0, maxSurge=4, maxUnavailable=0, allPods=4, newRevisionPods=0, newRevisionActivePods=0, oldRevisionPods=4, oldRevisionActivePods=4, unavailableNewRevisionCount=0, unavailableOldRevisionCount=4, preDeletingNewRevisionCount=0, preDeletingOldRevisionCount=0, toDeleteNewRevisionCount=0, toDeleteOldRevisionCount=0, enabledPreparingUpdateAsUpdate=false, useDefaultIsPodUpdate=false. Result: {scaleUpNum:4 scaleUpNumOldRevision:0 scaleDownNum:0 scaleDownNumOldRevision:0 scaleUpLimit:4 deleteReadyLimit:0 useSurge:4 useSurgeOldRevision:0 updateNum:4 updateMaxUnavailable:0}

Test 2:
step 1: create a cloneset with following setting (replicas=4, paused = false):
(similar to test 1, but with maxSurge = 50%
image

step 2: update the cloneset.spec.template to release a new version [As expected]:
get the cloneset:
image
the log:
I0719 09:28:51.366992 1 cloneset_sync_utils.go:124] Calculate diffs for CloneSet bg/echoserver, replicas=4, partition=0, maxSurge=2, maxUnavailable=0, allPods=4, newRevisionPods=0, newRevisionActivePods=0, oldRevisionPods=4, oldRevisionActivePods=4, unavailableNewRevisionCount=0, unavailableOldRevisionCount=4, preDeletingNewRevisionCount=0, preDeletingOldRevisionCount=0, toDeleteNewRevisionCount=0, toDeleteOldRevisionCount=0, enabledPreparingUpdateAsUpdate=false, useDefaultIsPodUpdate=false. Result: {scaleUpNum:2 scaleUpNumOldRevision:0 scaleDownNum:0 scaleDownNumOldRevision:0 scaleUpLimit:2 deleteReadyLimit:0 useSurge:2 useSurgeOldRevision:0 updateNum:4 updateMaxUnavailable:0}

so far so good

step 3: update the maxSurge from 50% to 100%
get the cloneset:
image
complete yaml:
apiVersion: apps.kruise.io/v1alpha1 kind: CloneSet metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"apps.kruise.io/v1alpha1","kind":"CloneSet","metadata":{"annotations":{},"labels":{"app":"echoserver"},"name":"echoserver","namespace":"bg"},"spec":{"minReadySeconds":99999,"replicas":4,"selector":{"matchLabels":{"app":"echoserver"}},"template":{"metadata":{"labels":{"app":"echoserver","version":"base"}},"spec":{"containers":[{"env":[{"name":"version","value":"base"},{"name":"PORT","value":"8080"},{"name":"POD_NAME","valueFrom":{"fieldRef":{"fieldPath":"metadata.name"}}},{"name":"POD_NAMESPACE","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}},{"name":"POD_IP","valueFrom":{"fieldRef":{"fieldPath":"status.podIP"}}},{"name":"NODE_NAME","value":"version2"}],"image":"cilium/echoserver:latest","imagePullPolicy":"IfNotPresent","name":"echoserver","ports":[{"containerPort":8080,"name":"http"}]}]}},"updateStrategy":{"maxSurge":"100%","maxUnavailable":0,"partition":0,"paused":false,"type":"ReCreate"}}} creationTimestamp: "2024-07-19T09:28:24Z" generation: 3 labels: app: echoserver name: echoserver namespace: bg resourceVersion: "14120354" uid: e9b9f64c-eb05-48a8-89be-4db735f2a011 spec: minReadySeconds: 99999 replicas: 4 revisionHistoryLimit: 10 scaleStrategy: {} selector: matchLabels: app: echoserver template: metadata: creationTimestamp: null labels: app: echoserver version: base spec: containers: - env: - name: version value: base - name: PORT value: "8080" - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name - name: POD_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: POD_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP - name: NODE_NAME value: version2 image: cilium/echoserver:latest imagePullPolicy: IfNotPresent name: echoserver ports: - containerPort: 8080 name: http protocol: TCP resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 updateStrategy: maxSurge: 100% maxUnavailable: 0 partition: 0 type: ReCreate status: availableReplicas: 0 collisionCount: 0 currentRevision: echoserver-54bf67658f expectedUpdatedReplicas: 4 labelSelector: app=echoserver observedGeneration: 3 readyReplicas: 6 replicas: 6 updateRevision: echoserver-5464fddcfb updatedReadyReplicas: 2 updatedReplicas: 2
the log:
I0719 09:32:09.518530 1 cloneset_sync_utils.go:124] Calculate diffs for CloneSet bg/echoserver, replicas=4, partition=0, maxSurge=4, maxUnavailable=0, allPods=6, newRevisionPods=2, newRevisionActivePods=2, oldRevisionPods=4, oldRevisionActivePods=4, unavailableNewRevisionCount=2, unavailableOldRevisionCount=4, preDeletingNewRevisionCount=0, preDeletingOldRevisionCount=0, toDeleteNewRevisionCount=0, toDeleteOldRevisionCount=0, enabledPreparingUpdateAsUpdate=false, useDefaultIsPodUpdate=false. Result: {scaleUpNum:0 scaleUpNumOldRevision:0 scaleDownNum:0 scaleDownNumOldRevision:0 scaleUpLimit:0 deleteReadyLimit:0 useSurge:2 useSurgeOldRevision:0 updateNum:2 updateMaxUnavailable:2}

we can find that, the useSurge is still 2, instead of 4 as test 1. the cloneset now still has 50% pods of new version instead of 100% as expected.

What you expected to happen:
the test 2 should has the consistent result with the test 1, ie. when changing maxSurge from 50% to 100%, controller is expected to scale 100% replicas of pods of new version, instead of not working like now.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
the "bug" is probably caused by this logic in function calculateDiffsWithExpectation

if scaleSurge >= updateSurge {

the useSurge is calculated from min(maxSurge, max(scaleSurge, updateSurge)), where maxSurge is 4, scaleSurge is 0, updateSurge is:

  • in test 1, it is min(|4|, |-4|), and thus the final useSurge is 4
  • in test 2, it is min(|4|,|-2|), and thus the final useSurge is 2, which caused this bug

Environment:

  • Kruise version:
  • Kubernetes version (use kubectl version):
  • Install details (e.g. helm install args):
  • Others:

Metadata

Metadata

Assignees

Labels

kind/bugSomething isn't working

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions