-
Notifications
You must be signed in to change notification settings - Fork 802
Fix: Never completionPolicy ImagePullJob timeout setting should consider backoffLimit #2072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…der backoffLimit Signed-off-by: majlu <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2072 +/- ##
=======================================
Coverage 43.78% 43.78%
=======================================
Files 316 316
Lines 31617 31617
=======================================
+ Hits 13842 13845 +3
+ Misses 16378 16375 -3
Partials 1397 1397
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
/lgtm |
@MajLuu Currently timeout is pretty clear and we use it heavily internally. Is there an actual scenario on your end that the current logic can't accommodate? |
In our machine learning cluster, we set the pull timeout for a single image to 1 hour, which can be retried 3 times; at the same time, we started many never completionPolicy image pull jobs (10+). It is estimated that the size of our machine learning images (including cuda, pytorch, etc.) is more than 20GB. Most nodes can pull images down after running normally for a period of time, but the newly added nodes in the cluster suddenly received 10+ image pull jobs, which resulted in 1 hour of failure to download the images. At this time, setting the timeout to 3*1h can solve this scenario. In this scenario, we do not think it is reasonable to change the timeout to 3h * 1. |
OK, we have also implemented the waiting mechanism on the internal 1.4 version, and configured it to pull a maximum of 3 images. At the same time, Containerd has implemented a mechanism to limit the number of concurrent pulling images. However, there is still a problem of excessive disk IO, so this codes was modified. In our machine learning clusters, users can wait for the image to be downloaded; however, there is a strict container startup timeout limit in our flink clusters, and the container image must be guaranteed to exist on the node (there are a large number of pods in the flink nodes, which leads to a large IO utilization rate of the node). The image pull job created is never pullPolicy. If the image cannot be downloaded normally within the specified time, it can only be continued after 24 hours. The problem mentioned in the issue was discovered when troubleshooting the problem of inaccurate download timeouts. If the community considers this modification is unnecessary, please close the issue. Thank you~ |
issue: 2071