Skip to content

Fix route-controller by fanning out getInstancesByIDs through the batcher#1388

Open
babyhuey wants to merge 1 commit into
kubernetes:masterfrom
babyhuey:fix-1351-getinstancesbyids-batcher
Open

Fix route-controller by fanning out getInstancesByIDs through the batcher#1388
babyhuey wants to merge 1 commit into
kubernetes:masterfrom
babyhuey:fix-1351-getinstancesbyids-batcher

Conversation

@babyhuey
Copy link
Copy Markdown

What type of PR is this?

/kind bug

What this PR does / why we need it:

describeInstanceBatcher.DescribeInstances requires an input with exactly one instance ID:

https://github.com/kubernetes/cloud-provider-aws/blob/master/pkg/providers/v1/describe_instance_batch.go#L56-L58

The batcher's BatchExecutor is what aggregates concurrent single-instance inputs back into one DescribeInstances API call. But getInstancesByIDs was passing the full slice of instance IDs as a single input, which trips that guard and errors out any time more than one instance is looked up.

The most visible consequence is that in clusters using kubenet / route-based pod networking, the route-controller's initial ListRoutes call always needs to look up every node and therefore always fails with expected to receive a single instance only, found N. No pod-CIDR VPC routes get created and source-dest-check is left enabled on new workers, so pods on new nodes have no connectivity — see #1351.

This PR fans out one goroutine per instance ID, each passing a single-instance input into the batcher. The batcher then coalesces those concurrent calls into one underlying DescribeInstances API call, preserving the original batching behavior while honoring the single-instance contract.

Which issue(s) this PR fixes:

Fixes #1351

Special notes for your reviewer:

  • go build ./... clean
  • go test ./pkg/providers/v1/... passes
  • The only other caller of getInstancesByIDs (aws.go:3412) already passes a single-ID slice, so its behavior is unchanged.

Does this PR introduce a user-facing change?

Fix route-controller and other code paths that resolve multiple instances in one call. `getInstancesByIDs` previously passed the full slice into `describeInstanceBatcher.DescribeInstances`, which rejects inputs with more than one ID, breaking the route-controller in kubenet/route-based clusters.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Apr 23, 2026
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Apr 23, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: babyhuey / name: Hannah Lyons (7e2c805)

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 23, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kmala for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @babyhuey!

It looks like this is your first PR to kubernetes/cloud-provider-aws 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/cloud-provider-aws has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 23, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @babyhuey. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 23, 2026
@babyhuey
Copy link
Copy Markdown
Author

/check-cla

@babyhuey babyhuey force-pushed the fix-1351-getinstancesbyids-batcher branch from 1041ac0 to feeba59 Compare April 23, 2026 18:58
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 23, 2026
Comment thread pkg/providers/v1/aws.go
InstanceIds: instanceIDs,
}

instances, err := c.describeInstanceBatcher.DescribeInstances(ctx, request)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add the functionality in the batcher instead of the client making complex logic

@babyhuey babyhuey force-pushed the fix-1351-getinstancesbyids-batcher branch from feeba59 to 6c1ebf1 Compare May 3, 2026 14:52
@babyhuey
Copy link
Copy Markdown
Author

babyhuey commented May 3, 2026

Thanks @kmala, that makes sense. I moved the fan out into describeInstanceBatcher.DescribeInstances itself in the latest push, so multi ID inputs are split into one batcher.Add per ID inside the batcher and the caller in getInstancesByIDs is back to the simple single call. Added a regression test (TestDescribeInstanceBatchingWithMultipleInstanceIDs) covering the multi ID path. Local go test ./pkg/providers/v1/..., gofmt, golint, and vet all pass.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 3, 2026
The describeInstanceBatcher currently rejects DescribeInstances inputs
with anything other than a single instance ID, which breaks every caller
of getInstancesByIDs that legitimately passes multiple IDs (notably the
route controller's UpdateRoutes path). See kubernetes#1351.

Move the fan-out logic into the batcher itself: multi-ID inputs are
split into one batcher submission per ID so the batcher can coalesce
them with other in-flight requests, and the aggregated results (and
joined errors, if any) are returned to the caller. The caller in
getInstancesByIDs is unchanged.

Adds a regression test exercising a multi-ID input through the batcher.
@babyhuey babyhuey force-pushed the fix-1351-getinstancesbyids-batcher branch from 6c1ebf1 to 7e2c805 Compare May 3, 2026 16:12
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 3, 2026
@hakman
Copy link
Copy Markdown
Member

hakman commented May 8, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Route reconciliation fails with "expected to receive a single instance only" when multiple instance IDs are passed to describeInstanceBatcher

4 participants