allocator: Less aggressive retry#2021
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2021 +/- ##
==========================================
+ Coverage 53.66% 53.71% +0.05%
==========================================
Files 109 109
Lines 18991 19008 +17
==========================================
+ Hits 10191 10210 +19
+ Misses 7578 7564 -14
- Partials 1222 1234 +12Continue to review full report at Codecov.
|
manager/allocator/network.go
Outdated
| a.doNodeAlloc(ctx, ev) | ||
| case state.EventCreateTask, state.EventUpdateTask, state.EventDeleteTask: | ||
| a.doTaskAlloc(ctx, ev) | ||
| a.doTaskAlloc(ctx, ev, nc.pendingTasks) |
There was a problem hiding this comment.
Couldn't doTaskAlloc(ctx,ev) retrieve pendingTasks on its own via ctx.nc.pendingTasks ?
manager/allocator/network.go
Outdated
| func (a *Allocator) procUnallocatedTasksNetwork(ctx context.Context) { | ||
| nc := a.netCtx | ||
| allocatedTasks := make([]*api.Task, 0, len(nc.unallocatedTasks)) | ||
| func (a *Allocator) procTasksNetwork(ctx context.Context, toAllocate map[string]*api.Task, quiet bool) { |
There was a problem hiding this comment.
If working on the nc retrieved from the context is equivalent, would it make sense to write this method as
func (a *Allocator) procTasksNetwork(ctx context.Context, onRetryInterval bool) {
nc := a.netCtx
quiet := false
toAllocate := nc.pendingTasks
if onRetryInterval {
toAllocate = nc.unallocatedTasks
quiet = true
}
...
|
Logic looks good to me. |
6e78fc2 to
456c2ec
Compare
|
Updated, thanks |
manager/allocator/network.go
Outdated
| allocatedTasks := make([]*api.Task, 0, len(nc.unallocatedTasks)) | ||
| quiet := false | ||
| toAllocate := nc.pendingTasks | ||
| allocatedTasks := make([]*api.Task, 0, len(toAllocate)) |
There was a problem hiding this comment.
This line should go below the if block, after which we know what toAllocate points to
Instead of retrying unallocated tasks, services, and networks every time data changes in the store, limit these retries to every 5 minutes. When a repeated attempt to allocate one of these objects fails, log it at the debug log level, to reduce noise in the logs. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
456c2ec to
513d028
Compare
|
Looks good to me |
|
Do we want to handle the potentially impossible case in which we don't get a commit? e.g. we receive a commit (and turns out that it free'ed up an IP address), we're above the 5 minutes limit so we don't try, and no other commit comes after so we don't allocate the task. |
|
I think that's a very good point. I had considered this but didn't want to add too much complexity, especially because I think this should be backported. Do you think it's a good idea to add a timer that triggers after 5 minutes if no commits happen during that interval? |
|
I think it's such a rare case that we may not need to bother ... I guess it depends if the fix would be extremely tiny? Can this simply be another switch case with a time.After? |
|
Or a timer that we reset every time we receive a commit |
|
Or maybe we shouldn't bother :) This is going to be so rare that the code to handle this case this may be buggy and we'll never notice |
|
Yeah, let's not bother. I liked the suggestion of adding a |
|
LGTM |
Instead of retrying unallocated tasks, services, and networks every time data changes in the store, limit these retries to every 5 minutes.
When a repeated attempt to allocate one of these objects fails, log it at the debug log level, to reduce noise in the logs.
cc @alexmavr @yongtang @aboch