Fix: Never completionPolicy ImagePullJob timeout setting should consider backoffLimit #2072

MajLuu · 2025-05-31T03:29:58Z

issue: 2071

…der backoffLimit Signed-off-by: majlu <[email protected]>

kruise-bot · 2025-05-31T03:30:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign fei-guo for approval by writing /assign @fei-guo in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2025-05-31T03:41:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 43.78%. Comparing base (648f933) to head (f250d07).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #2072   +/-   ##
=======================================
  Coverage   43.78%   43.78%           
=======================================
  Files         316      316           
  Lines       31617    31617           
=======================================
+ Hits        13842    13845    +3     
+ Misses      16378    16375    -3     
  Partials     1397     1397

Flag	Coverage Δ
unittests	`43.78% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ABNER-1 · 2025-06-04T01:45:57Z

/lgtm
@zmberg, could you please review this?

MajLuu · 2025-06-10T09:38:49Z

@zmberg @FillZpp @veophi , could you please review this?

zmberg · 2025-06-23T09:35:17Z

@MajLuu Currently timeout is pretty clear and we use it heavily internally. Is there an actual scenario on your end that the current logic can't accommodate?

MajLuu · 2025-06-23T10:00:08Z

@MajLuu Currently timeout is pretty clear and we use it heavily internally. Is there an actual scenario on your end that the current logic can't accommodate?

In our machine learning cluster, we set the pull timeout for a single image to 1 hour, which can be retried 3 times; at the same time, we started many never completionPolicy image pull jobs (10+). It is estimated that the size of our machine learning images (including cuda, pytorch, etc.) is more than 20GB. Most nodes can pull images down after running normally for a period of time, but the newly added nodes in the cluster suddenly received 10+ image pull jobs, which resulted in 1 hour of failure to download the images. At this time, setting the timeout to 3*1h can solve this scenario. In this scenario, we do not think it is reasonable to change the timeout to 3h * 1.

furykerry · 2025-06-24T02:35:16Z

@MajLuu we've added command line argument for controlling concurrent imagepulling workers, maybe setting the worker limits can help increasing the effective image pulling efficiency for newly added nodes. plz check the patch for more detail: 318165b

MajLuu · 2025-06-29T10:44:00Z

@MajLuu we've added command line argument for controlling concurrent imagepulling workers, maybe setting the worker limits can help increasing the effective image pulling efficiency for newly added nodes. plz check the patch for more detail: 318165b

OK, we have also implemented the waiting mechanism on the internal 1.4 version, and configured it to pull a maximum of 3 images. At the same time, Containerd has implemented a mechanism to limit the number of concurrent pulling images. However, there is still a problem of excessive disk IO, so this codes was modified. In our machine learning clusters, users can wait for the image to be downloaded; however, there is a strict container startup timeout limit in our flink clusters, and the container image must be guaranteed to exist on the node (there are a large number of pods in the flink nodes, which leads to a large IO utilization rate of the node). The image pull job created is never pullPolicy. If the image cannot be downloaded normally within the specified time, it can only be continued after 24 hours. The problem mentioned in the issue was discovered when troubleshooting the problem of inaccurate download timeouts. If the community considers this modification is unnecessary, please close the issue. Thank you~

Fix: Never completionPolicy ImagePullJob timeout setting should consi…

f250d07

…der backoffLimit Signed-off-by: majlu <[email protected]>

kruise-bot requested review from FillZpp and veophi May 31, 2025 03:30

kruise-bot added the size/M size/M: 30-99 label May 31, 2025

kruise-bot assigned ABNER-1 Jun 4, 2025

kruise-bot added the lgtm label Jun 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Never completionPolicy ImagePullJob timeout setting should consider backoffLimit #2072

Fix: Never completionPolicy ImagePullJob timeout setting should consider backoffLimit #2072

Uh oh!

MajLuu commented May 31, 2025

Uh oh!

kruise-bot commented May 31, 2025

Uh oh!

codecov bot commented May 31, 2025 •

edited

Loading

Uh oh!

ABNER-1 commented Jun 4, 2025

Uh oh!

MajLuu commented Jun 10, 2025

Uh oh!

zmberg commented Jun 23, 2025

Uh oh!

MajLuu commented Jun 23, 2025

Uh oh!

furykerry commented Jun 24, 2025

Uh oh!

MajLuu commented Jun 29, 2025

Uh oh!

Uh oh!

Fix: Never completionPolicy ImagePullJob timeout setting should consider backoffLimit #2072

Are you sure you want to change the base?

Fix: Never completionPolicy ImagePullJob timeout setting should consider backoffLimit #2072

Uh oh!

Conversation

MajLuu commented May 31, 2025

Uh oh!

kruise-bot commented May 31, 2025

Uh oh!

codecov bot commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ABNER-1 commented Jun 4, 2025

Uh oh!

MajLuu commented Jun 10, 2025

Uh oh!

zmberg commented Jun 23, 2025

Uh oh!

MajLuu commented Jun 23, 2025

Uh oh!

furykerry commented Jun 24, 2025

Uh oh!

MajLuu commented Jun 29, 2025

Uh oh!

Uh oh!

codecov bot commented May 31, 2025 •

edited

Loading