Skip to content

[DO NOT MERGE] PoC: feat: AKS machine API integration#1102

Closed
comtalyst wants to merge 53 commits intomainfrom
comtalyst/xpmt-aks-machine-api
Closed

[DO NOT MERGE] PoC: feat: AKS machine API integration#1102
comtalyst wants to merge 53 commits intomainfrom
comtalyst/xpmt-aks-machine-api

Conversation

@comtalyst
Copy link
Copy Markdown
Collaborator

@comtalyst comtalyst commented Aug 13, 2025

DO NOT MERGE. PoC only.

Let's continue the reviews/merge/implementations in #1197

Comment thread pkg/utils/image.go
Copy link
Copy Markdown
Contributor

@Bryce-Soghigian Bryce-Soghigian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 wow you turned that around fast!!

Copy link
Copy Markdown
Member

@matthchr matthchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didnt go through every file in depth but left some general comments. I think I got at least most of the really important files.

Comment thread pkg/providers/instance/aksmachineinstance.go
Comment thread go.mod
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/compute/armcompute v1.0.0
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/compute/armcompute/v5 v5.7.0
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/containerservice/armcontainerservice/v4 v4.8.0
github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/containerservice/armcontainerservice/v7 v7.3.0-beta.1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be split out into a separate PR probably that just does this?

Comment thread pkg/apis/v1beta1/labels.go
instancePromise, err := c.instanceProvider.BeginCreate(ctx, nodeClass, nodeClaim, instanceTypes)

// Choose provider based on provision mode
if options.FromContext(ctx).ProvisionMode == consts.ProvisionModeAKSMachineAPI {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we want an interface that abstracts over at least this part so that we don't need this options.FromContext(ctx).ProvisionMode == consts.ProvisionModeAKSMachineAPI everywhere.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually the only one. Delete/IsDrifted/Get/ is per nature of the given NodeClaim. List will list both regardless.

But, to the point that "decision + calling" should be abstracted, I don't think there is much value from it:
image

  • cloudprovider.go: it is already one use for each, to the intuitively matching method
  • drift.go: each type needs different handling, so better make it clear
  • inplaceupdate/controller.go: see another thread

Comment thread pkg/providers/instance/aksmachineinstance.go Outdated
// Resolve fills in dynamic launch template parameters.
// The name "imageFamilyResolver.Resolve()" is potentially misleading here.
// Suggestion: refactor would help, but this won't be used by PROVISION_MODE=aksmachineapi anyway. May not be worth it.
// ATTENTION!!!: changes here may NOT be effective on AKS machine nodes (ProvisionModeAKSMachineAPI); See aksmachineinstance.go/aksmachineinstancehelpers.go.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this is sorta confusing. Also not sure if refactor is worth it, but splitting these specific changes into a smaller PR and/or looking more closely at just these bits may give us a better idea.

Comment thread pkg/providers/instance/aksmachineinstance.go Outdated
// Existing AKS machine found, reuse it.

// Reconstruct properties from existing AKS machine instance.
if len(resp.Machine.Zones) == 0 || resp.Machine.Zones[0] == nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems strange/wrong to me possibly?

What if machine exists but wrong configuration? Returning success for beginCreate seems incorrect then? Plus API should be idempotent if we PUT w/ same configuration again, so why do we need to do this GET check at all?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus API should be idempotent if we PUT w/ same configuration again

This check occurs before Karpenter select SKU/priority/zone. They are not necessarily deterministic, and immutable in Machine API. PUT again would fail if they change, even though both are correct per NodeClaim requirements (which gives the list of applicable ones than the selected one).

so why do we need to do this GET check at all?

Unless we handle that PUT failure instead (which is more complicated, as it is post-selection), we would waste a machine, which will be in numbers in the case where Karpenter restarts during mass provisioning, hurting performance. That was the case in past scale tests.

What if machine exists but wrong configuration? Returning success for beginCreate seems incorrect then?

If the configuration is wrong, then drift would be detected just like other machines?
The same issue and resolution exist on the race condition between configuration changes and ongoing provisioning.

Comment thread pkg/providers/instance/aksmachineinstance.go
if gotAKSMachine.Properties.NodeImageVersion == nil {
return nil, fmt.Errorf("failed to get AKS machine instance %q once after begin creation: AKS machine node image version is nil", aksMachineName)
}
if gotAKSMachine.Properties.ProvisioningState != nil && lo.FromPtr(gotAKSMachine.Properties.ProvisioningState) == "Failed" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoul;d have consts for terminal ProvStates - I assume SDK already defines them - use them

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume SDK already defines them

Well, that's not true. Unless my search has been broken...

@comtalyst comtalyst force-pushed the comtalyst/xpmt-aks-machine-api branch from ed06957 to e125779 Compare September 8, 2025 06:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants