Skip to content

Fix: Never completionPolicy ImagePullJob timeout setting should consider backoffLimit #2072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MajLuu
Copy link
Contributor

@MajLuu MajLuu commented May 31, 2025

issue: 2071

@kruise-bot kruise-bot requested review from FillZpp and veophi May 31, 2025 03:30
@kruise-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign fei-guo for approval by writing /assign @fei-guo in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kruise-bot kruise-bot added the size/M size/M: 30-99 label May 31, 2025
Copy link

codecov bot commented May 31, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 43.78%. Comparing base (648f933) to head (f250d07).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2072   +/-   ##
=======================================
  Coverage   43.78%   43.78%           
=======================================
  Files         316      316           
  Lines       31617    31617           
=======================================
+ Hits        13842    13845    +3     
+ Misses      16378    16375    -3     
  Partials     1397     1397           
Flag Coverage Δ
unittests 43.78% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ABNER-1
Copy link
Member

ABNER-1 commented Jun 4, 2025

/lgtm
@zmberg, could you please review this?

@MajLuu
Copy link
Contributor Author

MajLuu commented Jun 10, 2025

@zmberg @FillZpp @veophi , could you please review this?

@zmberg
Copy link
Member

zmberg commented Jun 23, 2025

@MajLuu Currently timeout is pretty clear and we use it heavily internally. Is there an actual scenario on your end that the current logic can't accommodate?

@MajLuu
Copy link
Contributor Author

MajLuu commented Jun 23, 2025

@MajLuu Currently timeout is pretty clear and we use it heavily internally. Is there an actual scenario on your end that the current logic can't accommodate?

In our machine learning cluster, we set the pull timeout for a single image to 1 hour, which can be retried 3 times; at the same time, we started many never completionPolicy image pull jobs (10+). It is estimated that the size of our machine learning images (including cuda, pytorch, etc.) is more than 20GB. Most nodes can pull images down after running normally for a period of time, but the newly added nodes in the cluster suddenly received 10+ image pull jobs, which resulted in 1 hour of failure to download the images. At this time, setting the timeout to 3*1h can solve this scenario. In this scenario, we do not think it is reasonable to change the timeout to 3h * 1.

@furykerry
Copy link
Member

@MajLuu we've added command line argument for controlling concurrent imagepulling workers, maybe setting the worker limits can help increasing the effective image pulling efficiency for newly added nodes. plz check the patch for more detail: 318165b

@MajLuu
Copy link
Contributor Author

MajLuu commented Jun 29, 2025

@MajLuu we've added command line argument for controlling concurrent imagepulling workers, maybe setting the worker limits can help increasing the effective image pulling efficiency for newly added nodes. plz check the patch for more detail: 318165b

OK, we have also implemented the waiting mechanism on the internal 1.4 version, and configured it to pull a maximum of 3 images. At the same time, Containerd has implemented a mechanism to limit the number of concurrent pulling images. However, there is still a problem of excessive disk IO, so this codes was modified. In our machine learning clusters, users can wait for the image to be downloaded; however, there is a strict container startup timeout limit in our flink clusters, and the container image must be guaranteed to exist on the node (there are a large number of pods in the flink nodes, which leads to a large IO utilization rate of the node). The image pull job created is never pullPolicy. If the image cannot be downloaded normally within the specified time, it can only be continued after 24 hours. The problem mentioned in the issue was discovered when troubleshooting the problem of inaccurate download timeouts. If the community considers this modification is unnecessary, please close the issue. Thank you~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm size/M size/M: 30-99
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants