filter out stale spans from metrics generator by ie-pham · Pull Request #1612 · grafana/tempo

ie-pham · 2022-08-01T18:58:44Z

What this PR does: This PR adds a configurable variable under metrics generator to filter out any span that is older than "metrics_ingestion_time_range_slack" before metrics are aggregated. The current default is set to 30s.

Which issue(s) this PR fixes:
Fixes #1537

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

yvrhdn

This looks good! I've left a bunch of comments, no major stuff just some suggestions/nitpicks.

yvrhdn · 2022-08-02T12:24:43Z

@@ -261,14 +266,27 @@ func (i *instance) pushSpans(ctx context.Context, req *tempopb.PushSpansRequest)
 func (i *instance) updatePushMetrics(req *tempopb.PushSpansRequest) {


I suggest renaming this method since its scope has changed. This method used to only read the spans but now It will also modify the received request. Maybe change it to something like preprocessSpans? (better suggestion are welcome)

agree that the function needs to be renamed if it's going to mutate the contents of the push request.

Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.

Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.

Yeah, that should be possible. Both the span metrics and the service graphs processor loop through the batches anyway and they process spans one by one.
It will be a bit tricky how we deal with the resource attributes though. We currently extract some data out of the resource attributes before looping through the instrumentation library spans individually.

Oh, that is kind of gross. We could nil out the span pointers and make it an expectation of the processors that some spans may be nil.

I think ideally we leave the processors alone and do some clever in place manipulation of the slices to remove the "bad" spans. This logic could get rather gross though.

I could filter out the outdated spans inside the aggregateMetricsForSpan function and the consume function for service graph. But then we wouldn't be keep track of the numbers of spans dropped in this situation

But that would require every processor to implement the same logic which will lead to duplicated work and code. I'd be interested to see how many spans we drop in practice, if only a small amount of batches have to be reallocated the impact will be fine. If we are constantly dropping spans that are too old we might have to re-evaluate.

If we need to get rid of these reallocations we could change the interface of Processor so it passes the resource attributes of the batch next to each span.

yvrhdn · 2022-08-02T14:06:18Z

 	}, []string{"tenant"})
+	metricSpansDiscarded = promauto.NewCounterVec(prometheus.CounterOpts{
+		Namespace: "tempo",
+		Name:      "metrics_generator_spans_discarded_total",


I'd consider renaming to metrics_generator_discarded_spans_total just to make it similar to this other metric: https://github.com/grafana/tempo/blob/main/modules/overrides/discarded_spans.go#L12 Not a big deal though, both should show up in grafana 🤷🏻

I think it's good to have separate metrics since a span discarded in the metrics-generator is very different from a span discarded in the ingester/compactor.

Hmm should we make it similar to the other discarded span name or should we keep it similar to the other metrics in the same space?
https://github.com/grafana/tempo/blob/main/modules/generator/instance.go#L39

Oh I see 🙃 Err, either is fine I guess? Maybe a slight preference for keeping it consistent with the other tempo_metrics_generator_ metrics then.

Naming is hard 😅

yvrhdn · 2022-08-02T14:07:44Z

+	// setting default for max span age before discarding to 30 sec
+	cfg.MaxSpanAge = 30


I'm curious how this default behaves in practices. I honestly have no clue what a typical latency is between the span creation and ingestion by Tempo.

These are the ingestion latency numbers on ops for a few days. Do we think setting it at 30s is a right or is it too aggressive? @kvrhdn @joe-elliott

30s as a default looks good to me. It seems this should be include 99% of the data while excluding 1% that is lagging behind.
It's also configurable, so other people can switch it up.

joe-elliott · 2022-08-02T15:07:05Z

+			var newSpansArr []*v1.Span
+			timeNow := time.Now().UnixNano()
+			for _, span := range ils.Spans {
+				if span.EndTimeUnixNano >= uint64(timeNow-i.cfg.MaxSpanAge*1000000000) {


both sides of this time range need to be checked. If the user sends a span that's 5 days in the future it should not impact metrics.

we have similar code in the wal:

tempo/tempodb/wal/append_block.go

Line 235 in 5c885f3

func (a *AppendBlock) adjustTimeRangeForSlack(start uint32, end uint32, additionalStartSlack time.Duration) (uint32, uint32) {

Is there a reason why we pick 5 days?

I think 5 days was just an example. I think we can start with a symmetrical time range, i.e. use the same duration before and after time.Now(). If this doesn't work well we could still break it out into two config options.

joe-elliott · 2022-08-02T15:10:04Z

@@ -261,14 +266,27 @@ func (i *instance) pushSpans(ctx context.Context, req *tempopb.PushSpansRequest)
 func (i *instance) updatePushMetrics(req *tempopb.PushSpansRequest) {


agree that the function needs to be renamed if it's going to mutate the contents of the push request.

Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.

knylander-grafana

Doc updates look good.

yvrhdn

Nice work!

ie-pham marked this pull request as ready for review August 1, 2022 19:24

ie-pham requested review from annanay25, joe-elliott, mapno, mdisibio, yvrhdn and zalegrala as code owners August 1, 2022 19:24

yvrhdn reviewed Aug 2, 2022

View reviewed changes

joe-elliott reviewed Aug 2, 2022

View reviewed changes

ie-pham requested review from KMiller-Grafana and knylander-grafana as code owners August 10, 2022 22:35

knylander-grafana reviewed Aug 29, 2022

View reviewed changes

ie-pham added 11 commits September 9, 2022 11:30

filter out stale spans from metrics generator

e60e114

removed debug steps/logs

583608a

gofmt

3a3de90

addressed some review comments

22e8293

fixed docs

f29b089

more review comment addressing

aaf5def

testing stuff

86bdcb3

rebase

d60a85f

rebased

07cb521

removing debug stuff

948b8e0

lint

e5ac4b8

ie-pham requested review from joe-elliott and yvrhdn and removed request for KMiller-Grafana September 9, 2022 17:49

yvrhdn approved these changes Sep 12, 2022

View reviewed changes

yvrhdn merged commit 9a135a9 into grafana:main Sep 12, 2022

ie-pham deleted the jennie/1537 branch March 17, 2023 17:31

		@@ -261,14 +266,27 @@ func (i instance) pushSpans(ctx context.Context, req tempopb.PushSpansRequest)
		func (i instance) updatePushMetrics(req tempopb.PushSpansRequest) {

		// setting default for max span age before discarding to 30 sec
		cfg.MaxSpanAge = 30

Conversation

ie-pham commented Aug 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yvrhdn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knylander-grafana left a comment

Choose a reason for hiding this comment

Uh oh!

yvrhdn left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ie-pham commented Aug 1, 2022 •

edited

Loading