[DO NOT MERGE] PoC: feat: AKS machine API integration#1102
[DO NOT MERGE] PoC: feat: AKS machine API integration#1102
Conversation
Bryce-Soghigian
left a comment
There was a problem hiding this comment.
👏 wow you turned that around fast!!
matthchr
left a comment
There was a problem hiding this comment.
Didnt go through every file in depth but left some general comments. I think I got at least most of the really important files.
| github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/compute/armcompute v1.0.0 | ||
| github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/compute/armcompute/v5 v5.7.0 | ||
| github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/containerservice/armcontainerservice/v4 v4.8.0 | ||
| github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/containerservice/armcontainerservice/v7 v7.3.0-beta.1 |
There was a problem hiding this comment.
Should be split out into a separate PR probably that just does this?
| instancePromise, err := c.instanceProvider.BeginCreate(ctx, nodeClass, nodeClaim, instanceTypes) | ||
|
|
||
| // Choose provider based on provision mode | ||
| if options.FromContext(ctx).ProvisionMode == consts.ProvisionModeAKSMachineAPI { |
There was a problem hiding this comment.
Wondering if we want an interface that abstracts over at least this part so that we don't need this options.FromContext(ctx).ProvisionMode == consts.ProvisionModeAKSMachineAPI everywhere.
There was a problem hiding this comment.
That's actually the only one. Delete/IsDrifted/Get/ is per nature of the given NodeClaim. List will list both regardless.
But, to the point that "decision + calling" should be abstracted, I don't think there is much value from it:

- cloudprovider.go: it is already one use for each, to the intuitively matching method
- drift.go: each type needs different handling, so better make it clear
- inplaceupdate/controller.go: see another thread
| // Resolve fills in dynamic launch template parameters. | ||
| // The name "imageFamilyResolver.Resolve()" is potentially misleading here. | ||
| // Suggestion: refactor would help, but this won't be used by PROVISION_MODE=aksmachineapi anyway. May not be worth it. | ||
| // ATTENTION!!!: changes here may NOT be effective on AKS machine nodes (ProvisionModeAKSMachineAPI); See aksmachineinstance.go/aksmachineinstancehelpers.go. |
There was a problem hiding this comment.
Agree this is sorta confusing. Also not sure if refactor is worth it, but splitting these specific changes into a smaller PR and/or looking more closely at just these bits may give us a better idea.
| // Existing AKS machine found, reuse it. | ||
|
|
||
| // Reconstruct properties from existing AKS machine instance. | ||
| if len(resp.Machine.Zones) == 0 || resp.Machine.Zones[0] == nil { |
There was a problem hiding this comment.
This seems strange/wrong to me possibly?
What if machine exists but wrong configuration? Returning success for beginCreate seems incorrect then? Plus API should be idempotent if we PUT w/ same configuration again, so why do we need to do this GET check at all?
There was a problem hiding this comment.
Plus API should be idempotent if we PUT w/ same configuration again
This check occurs before Karpenter select SKU/priority/zone. They are not necessarily deterministic, and immutable in Machine API. PUT again would fail if they change, even though both are correct per NodeClaim requirements (which gives the list of applicable ones than the selected one).
so why do we need to do this GET check at all?
Unless we handle that PUT failure instead (which is more complicated, as it is post-selection), we would waste a machine, which will be in numbers in the case where Karpenter restarts during mass provisioning, hurting performance. That was the case in past scale tests.
What if machine exists but wrong configuration? Returning success for beginCreate seems incorrect then?
If the configuration is wrong, then drift would be detected just like other machines?
The same issue and resolution exist on the race condition between configuration changes and ongoing provisioning.
| if gotAKSMachine.Properties.NodeImageVersion == nil { | ||
| return nil, fmt.Errorf("failed to get AKS machine instance %q once after begin creation: AKS machine node image version is nil", aksMachineName) | ||
| } | ||
| if gotAKSMachine.Properties.ProvisioningState != nil && lo.FromPtr(gotAKSMachine.Properties.ProvisioningState) == "Failed" { |
There was a problem hiding this comment.
Shoul;d have consts for terminal ProvStates - I assume SDK already defines them - use them
There was a problem hiding this comment.
assume SDK already defines them
Well, that's not true. Unless my search has been broken...
ed06957 to
e125779
Compare
…s instance provider types
DO NOT MERGE. PoC only.
Let's continue the reviews/merge/implementations in #1197