job/presubmit/ccm-aws: bump mem and cpu limit to prevent OOMKill by mtulio · Pull Request #35274 · kubernetes/test-infra

mtulio · 2025-08-07T00:15:22Z

The idea of this PR is to bump resource utilization of e2e targeting
stability of existing presubmits which is curently having high falure[1]
ratio with many hours to get the feedback to the user[2].

[1]
The root cause of mostly failures cuased by CI infra is pointing to be
OOMKill. Here is one example of a e2e job using above mem and CPU limits:

failed job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/cloud-provider-aws/1158/pull-cloud-provider-aws-e2e/1953110200760143872
dashboard for referenced job: https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?orgId=1&var-org=kubernetes&var-repo=cloud-provider-aws&var-job=pull-cloud-provider-aws-e2e&var-build=All&from=1754491871179&to=1754494399603
snapshot image (just in case the data points isn't available when this PR is reviewed):https://issues.redhat.com/secure/attachment/13469904/13469904_Screenshot+From+2025-08-06+21-06-13.png
Slack discussion about the issue: https://kubernetes.slack.com/archives/C7J9RP96G/p1754505741634999

You can see instability on e2e presubmits recently (almost two weeks):
https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/pr-logs/directory/pull-cloud-provider-aws-e2e

[2] kubernetes-sigs/prow#210

k8s-ci-robot · 2025-08-07T00:15:31Z

Welcome @mtulio!

It looks like this is your first PR to kubernetes/test-infra 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/test-infra has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-08-07T00:15:32Z

Hi @mtulio. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mtulio · 2025-08-07T00:56:21Z

/test all

k8s-ci-robot · 2025-08-07T00:56:36Z

@mtulio: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/test all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mtulio · 2025-08-07T01:12:31Z

cc @elmiko @kmala

kmala · 2025-08-07T03:54:39Z

/ok-to-test
/lgtm

The idea of this PR is to bump resource utilization of e2e targeting stability of existing presubmits which is curently having high falure[1] ratio with many hours to get the feedback to the user[2]. Setting 3GiB/core to increase stability frm OOM kills. [1] The root cause of mostly failures cuased by CI infra is pointing to be OOMKill. Here is one example of a e2e job using above mem and CPU limits: https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?orgId=1&var-org=kubernetes&var-repo=cloud-provider-aws&var-job=pull-cloud-provider-aws-e2e&var-build=All&from=1754491871179&to=1754494399603 https://issues.redhat.com/secure/attachment/13469904/13469904_Screenshot+From+2025-08-06+21-06-13.png https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/cloud-provider-aws/1158/pull-cloud-provider-aws-e2e/1953110200760143872 https://kubernetes.slack.com/archives/C7J9RP96G/p1754505741634999 You can see instability on e2e presubmits recently (almost two weeks): https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/pr-logs/directory/pull-cloud-provider-aws-e2e [2] kubernetes-sigs/prow#210

mtulio · 2025-08-07T17:06:56Z

I did some changes based in the Slack conversation[1], but this PR is now ready for review keeping the current CPU value and bumping memory from 4 to 6GiB to resolve asap the OOM Kill.

https://kubernetes.slack.com/archives/C7J9RP96G/p1754505741634999

mtulio · 2025-08-07T18:48:19Z

keeping the current CPU value and bumping memory from 4 to 6GiB to resolve asap the OOM Kill.

is it okay to you, @kmala ? Thanks!

kmala · 2025-08-08T05:49:35Z

/lgtm

mtulio · 2025-08-08T14:24:47Z

/assign BenTheElder

BenTheElder · 2025-08-08T16:02:17Z

          limits:
            cpu: 2
-            memory: 4Gi
+            memory: 6Gi


let's start here to unblock, but I meant 6Gi per core :-)

we'd later aligned in Slack that we'll not lowered existing CPU =]

BenTheElder

/lgtm
/approve

k8s-ci-robot · 2025-08-08T16:02:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder, mtulio

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~config/jobs/kubernetes/cloud-provider-aws/OWNERS~~ [BenTheElder]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-08-08T16:15:47Z

@mtulio: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

key cloud-provider-aws-presubmit.yaml using file config/jobs/kubernetes/cloud-provider-aws/cloud-provider-aws-presubmit.yaml

Details

In response to this:

The idea of this PR is to bump resource utilization of e2e targeting
stability of existing presubmits which is curently having high falure[1]
ratio with many hours to get the feedback to the user[2].

[1]
The root cause of mostly failures cuased by CI infra is pointing to be
OOMKill. Here is one example of a e2e job using above mem and CPU limits:

failed job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/cloud-provider-aws/1158/pull-cloud-provider-aws-e2e/1953110200760143872

dashboard for referenced job: https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?orgId=1&var-org=kubernetes&var-repo=cloud-provider-aws&var-job=pull-cloud-provider-aws-e2e&var-build=All&from=1754491871179&to=1754494399603

snapshot image (just in case the data points isn't available when this PR is reviewed):https://issues.redhat.com/secure/attachment/13469904/13469904_Screenshot+From+2025-08-06+21-06-13.png

Slack discussion about the issue: https://kubernetes.slack.com/archives/C7J9RP96G/p1754505741634999

You can see instability on e2e presubmits recently (almost two weeks):
https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/pr-logs/directory/pull-cloud-provider-aws-e2e

[2] kubernetes-sigs/prow#210

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/config Issues or PRs related to code in /config labels Aug 7, 2025

k8s-ci-robot added area/jobs needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. area/provider/aws Issues or PRs related to aws provider labels Aug 7, 2025

k8s-ci-robot requested review from andrewsykim and nckturner August 7, 2025 00:15

k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Aug 7, 2025

mtulio marked this pull request as draft August 7, 2025 00:15

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 7, 2025

mtulio force-pushed the ccm-aws-presubmit-limit branch from 3c457d5 to fbc591d Compare August 7, 2025 00:51

mtulio marked this pull request as ready for review August 7, 2025 02:19

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 7, 2025

k8s-ci-robot requested a review from wongma7 August 7, 2025 02:19

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 7, 2025

k8s-ci-robot assigned kmala Aug 7, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 7, 2025

mtulio force-pushed the ccm-aws-presubmit-limit branch from fbc591d to 6bd91bd Compare August 7, 2025 13:04

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 7, 2025

BenTheElder reviewed Aug 7, 2025

View reviewed changes

Comment thread config/jobs/kubernetes/cloud-provider-aws/cloud-provider-aws-presubmit.yaml Outdated

mtulio force-pushed the ccm-aws-presubmit-limit branch from 6bd91bd to 4e3fb0e Compare August 7, 2025 15:14

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 7, 2025

mtulio force-pushed the ccm-aws-presubmit-limit branch from 4e3fb0e to cb56d01 Compare August 7, 2025 15:16

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 7, 2025

mtulio force-pushed the ccm-aws-presubmit-limit branch from cb56d01 to b1cd68f Compare August 7, 2025 15:17

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 7, 2025

mtulio force-pushed the ccm-aws-presubmit-limit branch from b1cd68f to 5705233 Compare August 7, 2025 16:33

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 7, 2025

mtulio force-pushed the ccm-aws-presubmit-limit branch from 5705233 to 54e08a1 Compare August 7, 2025 16:39

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 7, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2025

k8s-ci-robot assigned BenTheElder Aug 8, 2025

BenTheElder reviewed Aug 8, 2025

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 8, 2025

k8s-ci-robot merged commit a141325 into kubernetes:master Aug 8, 2025
7 checks passed

mtulio deleted the ccm-aws-presubmit-limit branch August 8, 2025 22:04

Conversation

mtulio commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Aug 7, 2025

Uh oh!

k8s-ci-robot commented Aug 7, 2025

Uh oh!

mtulio commented Aug 7, 2025

Uh oh!

k8s-ci-robot commented Aug 7, 2025

Uh oh!

mtulio commented Aug 7, 2025

Uh oh!

kmala commented Aug 7, 2025

Uh oh!

Uh oh!

mtulio commented Aug 7, 2025

Uh oh!

mtulio commented Aug 7, 2025

Uh oh!

kmala commented Aug 8, 2025

Uh oh!

mtulio commented Aug 8, 2025

Uh oh!

BenTheElder Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

mtulio Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenTheElder left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Aug 8, 2025

Uh oh!

Uh oh!

k8s-ci-robot commented Aug 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mtulio commented Aug 7, 2025 •

edited

Loading

mtulio Aug 11, 2025 •

edited

Loading