enforce max series for metrics queries#4525
Conversation
knylander-grafana
left a comment
There was a problem hiding this comment.
Thank you for adding docs.
604930d to
2c986ed
Compare
4aa4ee3 to
a073454
Compare
ba3fb50 to
b0989bc
Compare
|
benchmark shows no noticeable difference with added call to Length()/length() |
| require.NoError(t, tempo.WaitSumMetricsWithOptions(e2e.GreaterOrEqual(1), []string{"tempo_metrics_generator_processor_local_blocks_spans_total"}, e2e.WaitMissingMetrics)) | ||
| require.NoError(t, tempo.WaitSumMetricsWithOptions(e2e.GreaterOrEqual(1), []string{"tempo_metrics_generator_processor_local_blocks_cut_blocks"}, e2e.WaitMissingMetrics)) | ||
|
|
||
| query := "{} | rate() by (span:id)" |
There was a problem hiding this comment.
I love that this is our test case 💯
| } | ||
| // if a limit is being enforced, honor the request if it is less than the limit | ||
| // else set it to max limit | ||
| if cfg.Metrics.Sharder.MaxResponseSeries > 0 && (qr.MaxSeries > uint32(cfg.Metrics.Sharder.MaxResponseSeries) || qr.MaxSeries == 0) { |
There was a problem hiding this comment.
Do the handlers need to do this too, when the sharder is already doing it? It looks like doing in the 1 place in the sharder should work.
There was a problem hiding this comment.
the handler needs it to pass it to the combiner
There was a problem hiding this comment.
Ah ok. Let's still consolidate it to 1 function?
There was a problem hiding this comment.
i think it only needs to be done here? the sharder can assume that the value is set correctly?
although looking at search it looks like we repeat the logic 2x. once in the sharder and once in the handler:
https://github.com/grafana/tempo/blob/main/modules/frontend/search_sharder.go#L82
https://github.com/grafana/tempo/blob/main/modules/frontend/search_handlers.go#L128
it seems like we should establish a better pattern and only do this once, but agree with marty that consolidating the logic in one func is the right choice
| } | ||
| jobEval.ObserveSeries(resp) | ||
| seriesCount := jobEval.Length() | ||
| syncAtomic.AddInt32(&totalSeriesCount, int32(seriesCount)) |
There was a problem hiding this comment.
Can you clarify this part a bit - The atomic is used to skip remaining blocks when we exceed the limit, but should it be a Set instead of Add? It looks like if we have 100 complete blocks all returning the same 10 series, it can still hit the (default) 1000 limit.
There was a problem hiding this comment.
Ah nice catch. The atomic should just be used to keep track of the raw eval for wal and head blocks. The jobEval.length() is already tracking the series for the completed blocks. The check should now be
raw series count + jobEval.length() > max series
There was a problem hiding this comment.
Sorry I think I am missing something or there is still a change needed. If there are 2 wal blocks and they each result in 501 series, it would add to the atomic twice and consider the limit exceeded, even if it's the same series. Likewise if wal rawEval and complete jobEval each have (the same) 501 series, totalRawResultsSeriesCount + jobEval.Length would be considered above the limit. I think what we can do is run and check them both independently, skipping blocks of each type if the matching eval is exceeded. Instance.go already checks again when do the final combine. Would it be helpful to add func (m *MetricsEvaluator) Length() and we don't need the atomics at all?
There was a problem hiding this comment.
updated to use atomic bool for maxSeriesReached which will be set to true if either rawEval.length() > maxSeries or jobEval.length() > maxSeries
ff1490a to
b3557a5
Compare
What this PR does: Add config to enforce max time series returned in a metrics query. This is enforced at 4 levels: front-end combiner, querier combiner, metrics-generator local blocks, and metrics evaluation. The configuration is set in the query-frontend config and is passed to all levels as
maxSeriesin theQueryRangeRequestproto.new config: max_response_series <default 1000>
Setting the value to 0 will disable this feature.
approach : Keep track of number of series by calling Length() until limit is reached. Whatever series were generated up to this point will be truncated at the limit and returned as partial results. This may mean that the partial response will contain values that are inaccurate.
Which issue(s) this PR fixes:
Fixes #4219
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]benchmarks