Skip to content

[TraceQL] quantile_over_time 2 of 2 - engine#3633

Merged
mdisibio merged 18 commits intografana:mainfrom
mdisibio:quantile-engine
May 7, 2024
Merged

[TraceQL] quantile_over_time 2 of 2 - engine#3633
mdisibio merged 18 commits intografana:mainfrom
mdisibio:quantile-engine

Conversation

@mdisibio
Copy link
Copy Markdown
Contributor

@mdisibio mdisibio commented Apr 30, 2024

What this PR does:
Adds remaining engine and plumbing for quantile_over_time. To dig into a few things:

(1) Calculations and accuracy: This reuses the same approach from the metrics summary api, which creates a histogram based on powers-of-2 buckets, 2,4,8,16, etc for any int64 attribute (duration is considered in nanos). I think this is a good 90%-useful first approach for simplicity. Fairly good for continuous values like Duration, but less so for discrete values like http status code. Long-term this should move towards a "native histogram" approach, but didn't want to introduce that amount of complexity yet.

(2) This PR introduces different variations of a query for each module. I.e. quantile_over_time has to execute 3 different ways between the generator, queriers, and frontend. Previously rate() and count_over_time() were always basic sums, so the combiner in the frontend assumed addition. Not true anymore. The data flow is like this Generator does Spans -> partial histograms -> Querier sums partials across generators -> Frontend sums partials across jobs, and then -> computes final quantiles. Mimir and Loki achieve this through query rewrites example example. For now, this is taking a simpler approach where you pass the AggregateMode to CompileMetricsQuery and it returns a different implementation as needed.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

return
}

// There is an issue where multiple conditions &&'ed on the same
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a pre-existing unrelated bug. Queries asserting multiple conditions on the same attribute like span.http.status_code>=500 && span.http.status_code<600, which should only fetch spans between 500 and 599, actually fetches them all since the conditions are checked independently, and the second pass engine logic filters them to the correct output. It means in this case we can't optimize away the second pass like for other metrics queries. Ideally this is fixed in the Fetch layer and we bring the optimization back, but it wasn't trivial, so better to incur overhead instead of wrong results.

@mdisibio mdisibio marked this pull request as ready for review May 2, 2024 16:22
Comment thread modules/frontend/metrics_query_range_handler.go
Comment thread pkg/traceql/ast.go
mdisibio added 2 commits May 3, 2024 12:31
…log2, for several reasons: support non-log2 or native histogram buckets in the future, and where queriers on different versions may have different buckets during rollout
Start: 1,
End: uint64(10000 * time.Second),
Step: uint64(1 * time.Second),
Start: uint64(1100 * time.Second),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test changes here are due to a change in response combining. Previously the combiner did simple addition by timestamp, so only the timestamps populated by jobs were returned. Now the new SimpleAdditionAggregator and HistogramAggregator start with a zero-slice for all expected time slots in the final request range. The tests had to be restricted to match the test blocks in the test module here, so the output was still readable (not 10,000 zeros). I think this change overall is good but open to discuss. The question boils down to: If you rate() over a single span in the last hour, should it return 1 data point, or fill in the zeros for the whole response?

@mdisibio mdisibio merged commit 8df9670 into grafana:main May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants