enforce max series for metrics queries by ie-pham · Pull Request #4525 · grafana/tempo

ie-pham · 2025-01-07T23:58:18Z

What this PR does: Add config to enforce max time series returned in a metrics query. This is enforced at 4 levels: front-end combiner, querier combiner, metrics-generator local blocks, and metrics evaluation. The configuration is set in the query-frontend config and is passed to all levels as maxSeries in the QueryRangeRequest proto.

new config: max_response_series <default 1000>
Setting the value to 0 will disable this feature.

approach : Keep track of number of series by calling Length() until limit is reached. Whatever series were generated up to this point will be truncated at the limit and returned as partial results. This may mean that the partial response will contain values that are inaccurate.

Which issue(s) this PR fixes:
Fixes #4219

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
Once merged please let know @grafana/observability-traces-and-profiling that Tempo: Support partial metric results grafana#103595 can be worked on

benchmarks

goos: darwin
goarch: amd64
pkg: github.com/grafana/tempo/tempodb/encoding/vparquet4
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
                                                                                                       │  before.txt  │            after1000.txt            │
                                                                                                       │    sec/op    │    sec/op     vs base               │
BackendBlockQueryRange/{}_|_rate()/5                                                                     201.5m ± 11%   205.9m ± 16%       ~ (p=0.971 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                          443.8m ±  2%   446.0m ±  3%       ~ (p=0.739 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                          251.8m ±  1%   246.8m ±  6%       ~ (p=0.123 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                  433.3m ±  2%   440.9m ±  8%       ~ (p=0.529 n=10)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                3.721m ±  3%   3.710m ±  4%       ~ (p=0.739 n=10)
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5   136.1m ±  6%   146.8m ±  9%       ~ (p=0.247 n=10)
BackendBlockQueryRange/{status=error}_|_rate()/5                                                         35.54m ±  9%   33.54m ±  3%       ~ (p=0.063 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                  357.0m ±  4%   364.6m ±  3%       ~ (p=0.190 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5               596.0m ±  4%   600.7m ±  8%       ~ (p=0.631 n=10)
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                              361.7m ± 13%   365.7m ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                         528.3m ±  2%   519.9m ±  1%       ~ (p=0.105 n=10)
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                         524.4m ±  5%   519.4m ±  3%       ~ (p=0.247 n=10)
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                         528.1m ±  3%   529.5m ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                        9.137 ±  2%    8.954 ±  3%       ~ (p=0.075 n=10)
geomean                                                                                                  279.8m         280.5m        +0.25%

                                                                                                       │ before.txt  │            after1000.txt             │
                                                                                                       │  MB_IO/op   │  MB_IO/op    vs base                 │
BackendBlockQueryRange/{}_|_rate()/5                                                                      3.733 ± 0%    3.733 ± 0%       ~ (p=1.000 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                           3.984 ± 0%    3.984 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                           3.763 ± 0%    3.763 ± 0%       ~ (p=1.000 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                   3.904 ± 0%    3.904 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                289.5m ± 0%   289.5m ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5    15.71 ± 0%    15.71 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{status=error}_|_rate()/5                                                          3.905 ± 0%    3.904 ± 0%       ~ (p=0.070 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                   6.663 ± 0%    6.663 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5                6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                               6.663 ± 0%    6.663 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                          6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                          6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                          6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                        43.85 ± 0%    43.85 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                                                                   5.385         5.385       -0.00%
¹ all samples are equal

                                                                                                       │  before.txt   │            after1000.txt             │
                                                                                                       │   spans/op    │  spans/op    vs base                 │
BackendBlockQueryRange/{}_|_rate()/5                                                                     623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                          623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                          623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                  623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                 0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5    0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{status=error}_|_rate()/5                                                         1.754k ± 0%     1.754k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                  623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5               623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                              623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                         623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                         623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                         623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                       623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                                                                              ²                +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                                                                                       │   before.txt   │             after1000.txt             │
                                                                                                       │    spans/s     │   spans/s     vs base                 │
BackendBlockQueryRange/{}_|_rate()/5                                                                     3.093M ± 10%     3.027M ± 14%       ~ (p=0.971 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                          1.404M ±  2%     1.397M ±  3%       ~ (p=0.739 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                          2.475M ±  1%     2.526M ±  6%       ~ (p=0.123 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                  1.438M ±  2%     1.414M ±  8%       ~ (p=0.529 n=10)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                 0.000 ±  0%      0.000 ±  0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5    0.000 ±  0%      0.000 ±  0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{status=error}_|_rate()/5                                                         49.38k ±  8%     52.30k ±  3%       ~ (p=0.063 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                  1.746M ±  4%     1.710M ±  3%       ~ (p=0.190 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5               1.046M ±  4%     1.037M ±  7%       ~ (p=0.631 n=10)
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                              1.723M ± 12%     1.704M ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                         1.180M ±  2%     1.199M ±  1%       ~ (p=0.105 n=10)
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                         1.189M ±  4%     1.200M ±  3%       ~ (p=0.247 n=10)
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                         1.180M ±  3%     1.177M ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                       68.21k ±  2%     69.61k ±  3%       ~ (p=0.075 n=10)
geomean                                                                                                               ²                 +0.26%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

knylander-grafana

Thank you for adding docs.

javiermolinar

How does the metric query look now? Are the series evenly distributed? I'm asking because we have an issue with exemplars where we enforce a similar limit, and they appear to be skewed

ie-pham · 2025-03-27T22:34:17Z

benchmark shows no noticeable difference with added call to Length()/length()

                                                                                                       │  before.txt  │            after1000.txt            │
                                                                                                       │    sec/op    │    sec/op     vs base               │
BackendBlockQueryRange/{}_|_rate()/5                                                                     201.5m ± 11%   205.9m ± 16%       ~ (p=0.971 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                          443.8m ±  2%   446.0m ±  3%       ~ (p=0.739 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                          251.8m ±  1%   246.8m ±  6%       ~ (p=0.123 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                  433.3m ±  2%   440.9m ±  8%       ~ (p=0.529 n=10)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                3.721m ±  3%   3.710m ±  4%       ~ (p=0.739 n=10)
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5   136.1m ±  6%   146.8m ±  9%       ~ (p=0.247 n=10)
BackendBlockQueryRange/{status=error}_|_rate()/5                                                         35.54m ±  9%   33.54m ±  3%       ~ (p=0.063 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                  357.0m ±  4%   364.6m ±  3%       ~ (p=0.190 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5               596.0m ±  4%   600.7m ±  8%       ~ (p=0.631 n=10)
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                              361.7m ± 13%   365.7m ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                         528.3m ±  2%   519.9m ±  1%       ~ (p=0.105 n=10)
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                         524.4m ±  5%   519.4m ±  3%       ~ (p=0.247 n=10)
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                         528.1m ±  3%   529.5m ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                        9.137 ±  2%    8.954 ±  3%       ~ (p=0.075 n=10)
geomean                                                                                                  279.8m         280.5m        +0.25%

                                                                                                       │ before.txt  │            after1000.txt             │
                                                                                                       │  MB_IO/op   │  MB_IO/op    vs base                 │
BackendBlockQueryRange/{}_|_rate()/5                                                                      3.733 ± 0%    3.733 ± 0%       ~ (p=1.000 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                           3.984 ± 0%    3.984 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                           3.763 ± 0%    3.763 ± 0%       ~ (p=1.000 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                   3.904 ± 0%    3.904 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                289.5m ± 0%   289.5m ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5    15.71 ± 0%    15.71 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{status=error}_|_rate()/5                                                          3.905 ± 0%    3.904 ± 0%       ~ (p=0.070 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                   6.663 ± 0%    6.663 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5                6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                               6.663 ± 0%    6.663 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                          6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                          6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                          6.913 ± 0%    6.913 ± 0%       ~ (p=1.000 n=10)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                        43.85 ± 0%    43.85 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                                                                   5.385         5.385       -0.00%
¹ all samples are equal

                                                                                                       │  before.txt   │            after1000.txt             │
                                                                                                       │   spans/op    │  spans/op    vs base                 │
BackendBlockQueryRange/{}_|_rate()/5                                                                     623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                          623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                          623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                  623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                 0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5    0.000 ± 0%      0.000 ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{status=error}_|_rate()/5                                                         1.754k ± 0%     1.754k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                  623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5               623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                              623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                         623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                         623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                         623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                       623.2k ± 0%     623.2k ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                                                                              ²                +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                                                                                       │   before.txt   │             after1000.txt             │
                                                                                                       │    spans/s     │   spans/s     vs base                 │
BackendBlockQueryRange/{}_|_rate()/5                                                                     3.093M ± 10%     3.027M ± 14%       ~ (p=0.971 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)/5                                          1.404M ±  2%     1.397M ±  3%       ~ (p=0.739 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5                                          2.475M ±  1%     2.526M ±  6%       ~ (p=0.123 n=10)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5                                                  1.438M ±  2%     1.414M ±  8%       ~ (p=0.529 n=10)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5                                 0.000 ±  0%      0.000 ±  0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5    0.000 ±  0%      0.000 ±  0%       ~ (p=1.000 n=10) ¹
BackendBlockQueryRange/{status=error}_|_rate()/5                                                         49.38k ±  8%     52.30k ±  3%       ~ (p=0.063 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)/5                                  1.746M ±  4%     1.710M ±  3%       ~ (p=0.190 n=10)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)/5               1.046M ±  4%     1.037M ±  7%       ~ (p=0.631 n=10)
BackendBlockQueryRange/{}_|_histogram_over_time(duration)/5                                              1.723M ± 12%     1.704M ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)/5                         1.180M ±  2%     1.199M ±  1%       ~ (p=0.105 n=10)
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)/5                         1.189M ±  4%     1.200M ±  3%       ~ (p=0.247 n=10)
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)/5                         1.180M ±  3%     1.177M ±  3%       ~ (p=0.912 n=10)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})/5                                       68.21k ±  2%     69.61k ±  3%       ~ (p=0.075 n=10)
geomean                                                                                                               ²                 +0.26%                ²

mdisibio · 2025-04-07T16:26:18Z

+	require.NoError(t, tempo.WaitSumMetricsWithOptions(e2e.GreaterOrEqual(1), []string{"tempo_metrics_generator_processor_local_blocks_spans_total"}, e2e.WaitMissingMetrics))
+	require.NoError(t, tempo.WaitSumMetricsWithOptions(e2e.GreaterOrEqual(1), []string{"tempo_metrics_generator_processor_local_blocks_cut_blocks"}, e2e.WaitMissingMetrics))
+
+	query := "{} | rate() by (span:id)"


I love that this is our test case 💯

mdisibio · 2025-04-07T16:34:37Z

 		}
+		// if a limit is being enforced, honor the request if it is less than the limit
+		// else set it to max limit
+		if cfg.Metrics.Sharder.MaxResponseSeries > 0 && (qr.MaxSeries > uint32(cfg.Metrics.Sharder.MaxResponseSeries) || qr.MaxSeries == 0) {


Do the handlers need to do this too, when the sharder is already doing it? It looks like doing in the 1 place in the sharder should work.

the handler needs it to pass it to the combiner

Ah ok. Let's still consolidate it to 1 function?

i think it only needs to be done here? the sharder can assume that the value is set correctly?

although looking at search it looks like we repeat the logic 2x. once in the sharder and once in the handler:

https://github.com/grafana/tempo/blob/main/modules/frontend/search_sharder.go#L82
https://github.com/grafana/tempo/blob/main/modules/frontend/search_handlers.go#L128

it seems like we should establish a better pattern and only do this once, but agree with marty that consolidating the logic in one func is the right choice

mdisibio · 2025-04-07T16:45:48Z

+				}
+				jobEval.ObserveSeries(resp)
+				seriesCount := jobEval.Length()
+				syncAtomic.AddInt32(&totalSeriesCount, int32(seriesCount))


Can you clarify this part a bit - The atomic is used to skip remaining blocks when we exceed the limit, but should it be a Set instead of Add? It looks like if we have 100 complete blocks all returning the same 10 series, it can still hit the (default) 1000 limit.

Ah nice catch. The atomic should just be used to keep track of the raw eval for wal and head blocks. The jobEval.length() is already tracking the series for the completed blocks. The check should now be

raw series count + jobEval.length() > max series

Sorry I think I am missing something or there is still a change needed. If there are 2 wal blocks and they each result in 501 series, it would add to the atomic twice and consider the limit exceeded, even if it's the same series. Likewise if wal rawEval and complete jobEval each have (the same) 501 series, totalRawResultsSeriesCount + jobEval.Length would be considered above the limit. I think what we can do is run and check them both independently, skipping blocks of each type if the matching eval is exceeded. Instance.go already checks again when do the final combine. Would it be helpful to add func (m *MetricsEvaluator) Length() and we don't need the atomics at all?

updated to use atomic bool for maxSeriesReached which will be set to true if either rawEval.length() > maxSeries or jobEval.length() > maxSeries

ie-pham mentioned this pull request Jan 8, 2025

Limit series produced by TraceQL Metrics #4219

Closed

ie-pham marked this pull request as ready for review January 8, 2025 17:54

ie-pham requested review from electron0zero, javiermolinar, joe-elliott, knylander-grafana, mapno, mdisibio, stoewer, yvrhdn and zalegrala as code owners January 8, 2025 17:54

electron0zero reviewed Jan 21, 2025

View reviewed changes

joe-elliott reviewed Jan 21, 2025

View reviewed changes

Comment thread pkg/api/http.go Outdated

Comment thread pkg/tempopb/tempo.proto

knylander-grafana reviewed Jan 27, 2025

View reviewed changes

electron0zero reviewed Feb 4, 2025

View reviewed changes

Comment thread modules/frontend/combiner/metrics_query_range.go Outdated

ie-pham force-pushed the maxmetricsseries branch from 604930d to 2c986ed Compare February 10, 2025 17:56

joe-elliott reviewed Feb 12, 2025

View reviewed changes

ie-pham force-pushed the maxmetricsseries branch from 4aa4ee3 to a073454 Compare March 17, 2025 18:19

ie-pham requested a review from carles-grafana as a code owner March 17, 2025 18:19

javiermolinar reviewed Mar 19, 2025

View reviewed changes

joe-elliott reviewed Mar 19, 2025

View reviewed changes

Comment thread docs/sources/tempo/configuration/_index.md Outdated

Comment thread modules/frontend/combiner/metrics_query_range.go Outdated

Comment thread modules/frontend/metrics_query_range_handler.go Outdated

joe-elliott reviewed Mar 19, 2025

View reviewed changes

Comment thread modules/frontend/metrics_query_range_sharder.go Outdated

joe-elliott reviewed Mar 19, 2025

View reviewed changes

Comment thread modules/frontend/metrics_query_range_sharder.go Outdated

joe-elliott reviewed Mar 19, 2025

View reviewed changes

Comment thread modules/generator/instance.go Outdated

Comment thread modules/generator/processor/localblocks/query_range.go Outdated

Comment thread pkg/traceql/combine.go

joe-elliott reviewed Mar 20, 2025

View reviewed changes

Comment thread pkg/traceql/engine_metrics.go Outdated

Comment thread modules/frontend/metrics_query_range_handler.go Outdated

Comment thread pkg/traceql/engine_metrics_compare.go Outdated

ie-pham force-pushed the maxmetricsseries branch from ba3fb50 to b0989bc Compare March 27, 2025 15:39

ie-pham mentioned this pull request Apr 1, 2025

Tempo: Display warning for query range queries that reaches max limit grafana/grafana#103241

Open

mdisibio reviewed Apr 7, 2025

View reviewed changes

Comment thread pkg/traceql/engine_metrics_test.go Outdated

ie-pham added 28 commits April 11, 2025 11:02

enforce limit at metrics evaluator

aaa8768

typo

4e17816

rebase mishap

16d1e2e

lint

af98ba4

lint

71ea3d2

fix race maybe

ed9b2cd

i love racing

fe4b560

lint

394f149

handle diff

de40dd1

address comments p1

4db57ed

return seriesCount instead of using results()

a51fe7d

remove channel and break as soon as any layer reaches limit

5e1a066

rebase

2f8884c

make generator series count simpler

aadfa9a

add length() function instead

e8c3fd0

update benchmark test

d4ccf9e

lint

00ecea9

another lint

056eb97

atomic should only keep track of raw results

fc8c7c6

remove max series logic out of handler, add length() to metrics eval

e4653c9

lint

a103191

add atomic bool and fix test

c620887

oooops

d10b6b4

tidy up

3a823cc

rebase and add additional logging for max series

eebf830

move maxSeries logic

71910d0

lint

77ec8ef

update tests for topk bottomk

b3557a5

ie-pham force-pushed the maxmetricsseries branch from ff1490a to b3557a5 Compare April 11, 2025 16:03

ie-pham merged commit 895f187 into grafana:main Apr 11, 2025
15 checks passed

Conversation

ie-pham commented Jan 7, 2025 • edited by ifrost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

knylander-grafana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

javiermolinar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ie-pham commented Mar 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ie-pham commented Jan 7, 2025 •

edited by ifrost

Loading

javiermolinar left a comment •

edited

Loading