Skip to content

Vulture: fix for query_end_cutoff#7018

Merged
ruslan-mikhailov merged 2 commits intografana:mainfrom
ruslan-mikhailov:bugfix/vulture-step
Apr 21, 2026
Merged

Vulture: fix for query_end_cutoff#7018
ruslan-mikhailov merged 2 commits intografana:mainfrom
ruslan-mikhailov:bugfix/vulture-step

Conversation

@ruslan-mikhailov
Copy link
Copy Markdown
Contributor

@ruslan-mikhailov ruslan-mikhailov commented Apr 21, 2026

What this PR does: it fixes vulture metrics check when query_end_cutoff is enabled. With step=1m, for traces that are close to longWriteBackoff (default is 1 minute), request will hit the cutoff and remove the last bucket. In order to solve the problem, I just reduce step duration to 10 seconds.

I ran it for some time to confirm it solves the problem:
image

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Comment thread cmd/tempo-vulture/main.go
resp, err := client.MetricsQueryRange(
fmt.Sprintf(`{.%s = "%s"} | count_over_time()`, attr.Key, util.StringifyAnyValue(attr.Value)),
start, end, "1m", 0,
start, end, "10s", 0,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of a config parameter, but won't it make configuration too complex? 10 seconds I think a good middle ground.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts tempo-vulture’s TraceQL metrics range query resolution to avoid false negatives when Tempo’s query_end_cutoff truncates the most recent bucket, particularly for traces near the long write backoff window.

Changes:

  • Change MetricsQueryRange step from 1m to 10s in vulture’s metrics checker.

Comment thread cmd/tempo-vulture/main.go
Comment on lines 695 to 698
resp, err := client.MetricsQueryRange(
fmt.Sprintf(`{.%s = "%s"} | count_over_time()`, attr.Key, util.StringifyAnyValue(attr.Value)),
start, end, "1m", 0,
start, end, "10s", 0,
)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the metrics query step from 1m to 10s increases the number of samples returned per call by ~6x (for the 1h window used here), and vulture runs this on a ticker (default 10s). That can noticeably increase query load/response sizes in environments with many vulture instances. Would it make sense to (a) derive the step from a config/flag (or at least a named constant with a short rationale about query_end_cutoff), or (b) avoid a dense range query by switching this check to an instant query / smaller time window so the cutoff workaround doesn’t translate into extra steady-state load?

Copilot uses AI. Check for mistakes.
@ruslan-mikhailov ruslan-mikhailov merged commit 2e87d43 into grafana:main Apr 21, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants