Skip to content

Query Frontend: Job weights#4076

Merged
javiermolinar merged 23 commits intografana:mainfrom
joe-elliott:job-weights
Oct 11, 2024
Merged

Query Frontend: Job weights#4076
javiermolinar merged 23 commits intografana:mainfrom
joe-elliott:job-weights

Conversation

@joe-elliott
Copy link
Copy Markdown
Collaborator

@joe-elliott joe-elliott commented Sep 12, 2024

What this PR does:
The query frontend treats all jobs as the same size when it farms them out to the queriers. This can cause querier instability b/c some jobs actually require quite a bit more resources to execute. By assigning weights to jobs we can reduce the amount each querier is asked to do will hopefully:

  1. reduce querier OOMs/timeouts/retries
  2. reduce querier latency
  3. increase total throughput

Other changes

  • Removed the roundtripper httpgrpc bridge and pushed the concept of pipeline.Request all the way down into the cortex frontend code. This can be a nice perf improvement b/c translating http -> httpgrpc is costly and we are pushing it to the last moment. Currently for some queries we are translating thousands of jobs and then throwing them away.
  • Removed redundant parseQuery and createFetchSpansRequest to consolidate on the Compile function in pkg/traceql
  • Check for context error before going through retry logic in retryWare. This causes retry metrics to be more accurate in the event of many cancelled jobs.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Comment thread modules/frontend/queue/queue.go Outdated
}
totalWeight += weight

if totalWeight >= requestedCount {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense. I suppose what we're saying here is that we request of this batch a certain high water mark of work that we're willing to take, and the weight increases the notion of complexity for a single item above this threshold. Implicitly here I suppose is that weight and requestedCount are of the same unit of measure.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicitly here I suppose is that weight and requestedCount are of the same unit of measure.

yes! currently all jobs fill a single "slot" in the batch. the "weight" is basically just making it fill more slots.

Comment thread modules/frontend/weights/weights.go Outdated
}
}

if conditions > 4 { // yay, magic!
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fine starting point. I was wonder if each condition is weight++, and maybe each regex is weight+2 or some such. It means for the queue logic that if any condition is present, we'll never consume the entire requested batch. 🤔

Comment thread modules/frontend/weights/weights.go Outdated
Comment thread modules/frontend/transport/roundtripper.go
Comment thread modules/frontend/pipeline/async_weight_middleware.go Outdated
@javiermolinar javiermolinar marked this pull request as ready for review October 9, 2024 15:04
Comment thread modules/frontend/config.go
Comment thread modules/frontend/pipeline/pipeline.go
Comment thread modules/frontend/metrics_query_range_sharder.go Outdated
Comment thread modules/frontend/pipeline/async_weight_middleware.go Outdated
Comment thread modules/frontend/pipeline/pipeline.go Outdated
@javiermolinar javiermolinar self-requested a review October 11, 2024 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants