Skip to content

Frontend batching#2677

Merged
joe-elliott merged 22 commits intografana:mainfrom
joe-elliott:frontend-batching
Jul 20, 2023
Merged

Frontend batching#2677
joe-elliott merged 22 commits intografana:mainfrom
joe-elliott:frontend-batching

Conversation

@joe-elliott
Copy link
Copy Markdown
Collaborator

@joe-elliott joe-elliott commented Jul 19, 2023

What this PR does:
Batches jobs in the requests from the query-frontend queue to the queriers. Previously, the frontend would send each job 1 at at time with an individual http request. This PR adds a configurable parameter to allow the frontend to send more than one request at once.

Other changes:

  • Docs of course! Including an update to the search performance tuning doc with some more current information.
  • Adds a new histogram metric tempo_query_frontend_actual_batch_size to track the actual size of the batches being farmed to the queriers
  • Better testing of the queues and frontend worker.
  • Added the ability for the querier to signal to the frontend the features it supports for seamless rollouts.

Performance testing
The goal with the setup was to create a cluster that could execute the 36k jobs created by the test query simultaneously. This way job throughput from frontend -> querier could be tested more directly.

  • 80 queriers
  • 500 jobs per querier
  • Total cluster capacity 40k jobs
  • No reliance on serverless

Results

batch size    overall query latency     p99 job time in queue
1             8.5s                      4.9s
2             7.6s                      2.4s
5             6.7s                      1s
10            9s                        4.4s
1*            9.6s                      9s

*current image

The overall latency of queries where total jobs > total cluster capacity was not as impressively reduced, but this is a good step in the right direction.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Copy link
Copy Markdown
Contributor

@mdisibio mdisibio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and I like the way it is controlled by querier features. A few small q's, but none blocking and will go ahead and approve.

Comment thread modules/frontend/queue/queue.go Outdated
Comment thread modules/frontend/v1/request_batch.go
Comment thread modules/querier/worker/frontend_processor.go Outdated
Comment thread modules/frontend/v1/frontend.go
Comment thread modules/frontend/v1/frontend.go Outdated
Co-authored-by: Martin Disibio <mdisibio@gmail.com>
Copy link
Copy Markdown
Contributor

@zalegrala zalegrala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me, nice improvement. Will be interesting to see the results on the dashboard. Had a question about the context handling but not blocking.

// then error out this upstream request _and_ stream.
case err := <-errs:
req.err <- err
err = reportResponseUpstream(reqBatch, errs, resps)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a context to pass? Wondering if it might simplify the context handling below.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the streaming GRPC server connection itself drops or context is cancelled then the .Send() returns an error and this case is hit:
https://github.com/grafana/tempo/pull/2677/files#diff-0914703aed52090bd72851004df203444207d9d48677c10860b0459afef1a0b9R311

If the request is cancelled upstream then this case is hit:
https://github.com/grafana/tempo/pull/2677/files#diff-0914703aed52090bd72851004df203444207d9d48677c10860b0459afef1a0b9R304

If the requests are cancelled downstream then we get an http response and this case is hit:
https://github.com/grafana/tempo/pull/2677/files#diff-0914703aed52090bd72851004df203444207d9d48677c10860b0459afef1a0b9R304

I think everything is covered.

Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants