Skip to content

job/presubmit/ccm-aws: bump mem and cpu limit to prevent OOMKill#35274

Merged
k8s-ci-robot merged 1 commit into
kubernetes:masterfrom
mtulio:ccm-aws-presubmit-limit
Aug 8, 2025
Merged

job/presubmit/ccm-aws: bump mem and cpu limit to prevent OOMKill#35274
k8s-ci-robot merged 1 commit into
kubernetes:masterfrom
mtulio:ccm-aws-presubmit-limit

Conversation

@mtulio
Copy link
Copy Markdown
Contributor

@mtulio mtulio commented Aug 7, 2025

The idea of this PR is to bump resource utilization of e2e targeting
stability of existing presubmits which is curently having high falure[1]
ratio with many hours to get the feedback to the user[2].

[1]
The root cause of mostly failures cuased by CI infra is pointing to be
OOMKill. Here is one example of a e2e job using above mem and CPU limits:

You can see instability on e2e presubmits recently (almost two weeks):
https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/pr-logs/directory/pull-cloud-provider-aws-e2e

[2] kubernetes-sigs/prow#210

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/config Issues or PRs related to code in /config labels Aug 7, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @mtulio!

It looks like this is your first PR to kubernetes/test-infra 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/test-infra has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added area/jobs needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. area/provider/aws Issues or PRs related to aws provider labels Aug 7, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @mtulio. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Aug 7, 2025
@mtulio mtulio marked this pull request as draft August 7, 2025 00:15
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 7, 2025
@mtulio mtulio force-pushed the ccm-aws-presubmit-limit branch from 3c457d5 to fbc591d Compare August 7, 2025 00:51
@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Aug 7, 2025

/test all

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@mtulio: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/test all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Aug 7, 2025

cc @elmiko @kmala

@mtulio mtulio marked this pull request as ready for review August 7, 2025 02:19
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 7, 2025
@k8s-ci-robot k8s-ci-robot requested a review from wongma7 August 7, 2025 02:19
@kmala
Copy link
Copy Markdown
Member

kmala commented Aug 7, 2025

/ok-to-test
/lgtm

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 7, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 7, 2025
@mtulio mtulio force-pushed the ccm-aws-presubmit-limit branch from fbc591d to 6bd91bd Compare August 7, 2025 13:04
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 7, 2025
Comment thread config/jobs/kubernetes/cloud-provider-aws/cloud-provider-aws-presubmit.yaml Outdated
@mtulio mtulio force-pushed the ccm-aws-presubmit-limit branch from 6bd91bd to 4e3fb0e Compare August 7, 2025 15:14
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 7, 2025
@mtulio mtulio force-pushed the ccm-aws-presubmit-limit branch from 4e3fb0e to cb56d01 Compare August 7, 2025 15:16
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 7, 2025
@mtulio mtulio force-pushed the ccm-aws-presubmit-limit branch from cb56d01 to b1cd68f Compare August 7, 2025 15:17
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 7, 2025
@mtulio mtulio force-pushed the ccm-aws-presubmit-limit branch from b1cd68f to 5705233 Compare August 7, 2025 16:33
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 7, 2025
The idea of this PR is to bump resource utilization of e2e targeting
stability of existing presubmits which is curently having high falure[1]
ratio with many hours to get the feedback to the user[2].

Setting 3GiB/core to increase stability frm OOM kills.

[1]
The root cause of mostly failures cuased by CI infra is pointing to be
OOMKill. Here is one example of a e2e job using above mem and CPU limits:
https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?orgId=1&var-org=kubernetes&var-repo=cloud-provider-aws&var-job=pull-cloud-provider-aws-e2e&var-build=All&from=1754491871179&to=1754494399603
https://issues.redhat.com/secure/attachment/13469904/13469904_Screenshot+From+2025-08-06+21-06-13.png
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/cloud-provider-aws/1158/pull-cloud-provider-aws-e2e/1953110200760143872
https://kubernetes.slack.com/archives/C7J9RP96G/p1754505741634999

You can see instability on e2e presubmits recently (almost two weeks):
https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/pr-logs/directory/pull-cloud-provider-aws-e2e

[2]  kubernetes-sigs/prow#210
@mtulio mtulio force-pushed the ccm-aws-presubmit-limit branch from 5705233 to 54e08a1 Compare August 7, 2025 16:39
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 7, 2025
@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Aug 7, 2025

I did some changes based in the Slack conversation[1], but this PR is now ready for review keeping the current CPU value and bumping memory from 4 to 6GiB to resolve asap the OOM Kill.

https://kubernetes.slack.com/archives/C7J9RP96G/p1754505741634999

@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Aug 7, 2025

keeping the current CPU value and bumping memory from 4 to 6GiB to resolve asap the OOM Kill.

is it okay to you, @kmala ? Thanks!

@kmala
Copy link
Copy Markdown
Member

kmala commented Aug 8, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2025
@mtulio
Copy link
Copy Markdown
Contributor Author

mtulio commented Aug 8, 2025

/assign BenTheElder

limits:
cpu: 2
memory: 4Gi
memory: 6Gi
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's start here to unblock, but I meant 6Gi per core :-)

Copy link
Copy Markdown
Contributor Author

@mtulio mtulio Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd later aligned in Slack that we'll not lowered existing CPU =]

Copy link
Copy Markdown
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder, mtulio

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 8, 2025
@k8s-ci-robot k8s-ci-robot merged commit a141325 into kubernetes:master Aug 8, 2025
7 checks passed
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@mtulio: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

  • key cloud-provider-aws-presubmit.yaml using file config/jobs/kubernetes/cloud-provider-aws/cloud-provider-aws-presubmit.yaml
Details

In response to this:

The idea of this PR is to bump resource utilization of e2e targeting
stability of existing presubmits which is curently having high falure[1]
ratio with many hours to get the feedback to the user[2].

[1]
The root cause of mostly failures cuased by CI infra is pointing to be
OOMKill. Here is one example of a e2e job using above mem and CPU limits:

You can see instability on e2e presubmits recently (almost two weeks):
https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/pr-logs/directory/pull-cloud-provider-aws-e2e

[2] kubernetes-sigs/prow#210

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mtulio mtulio deleted the ccm-aws-presubmit-limit branch August 8, 2025 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs area/provider/aws Issues or PRs related to aws provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants