filter out stale spans from metrics generator#1612
filter out stale spans from metrics generator#1612yvrhdn merged 11 commits intografana:mainfrom ie-pham:jennie/1537
Conversation
yvrhdn
left a comment
There was a problem hiding this comment.
This looks good! I've left a bunch of comments, no major stuff just some suggestions/nitpicks.
| @@ -261,14 +266,27 @@ func (i *instance) pushSpans(ctx context.Context, req *tempopb.PushSpansRequest) | |||
| func (i *instance) updatePushMetrics(req *tempopb.PushSpansRequest) { | |||
There was a problem hiding this comment.
I suggest renaming this method since its scope has changed. This method used to only read the spans but now It will also modify the received request. Maybe change it to something like preprocessSpans? (better suggestion are welcome)
There was a problem hiding this comment.
agree that the function needs to be renamed if it's going to mutate the contents of the push request.
Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.
There was a problem hiding this comment.
Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.
Yeah, that should be possible. Both the span metrics and the service graphs processor loop through the batches anyway and they process spans one by one.
It will be a bit tricky how we deal with the resource attributes though. We currently extract some data out of the resource attributes before looping through the instrumentation library spans individually.
There was a problem hiding this comment.
Oh, that is kind of gross. We could nil out the span pointers and make it an expectation of the processors that some spans may be nil.
I think ideally we leave the processors alone and do some clever in place manipulation of the slices to remove the "bad" spans. This logic could get rather gross though.
There was a problem hiding this comment.
I could filter out the outdated spans inside the aggregateMetricsForSpan function and the consume function for service graph. But then we wouldn't be keep track of the numbers of spans dropped in this situation
There was a problem hiding this comment.
But that would require every processor to implement the same logic which will lead to duplicated work and code. I'd be interested to see how many spans we drop in practice, if only a small amount of batches have to be reallocated the impact will be fine. If we are constantly dropping spans that are too old we might have to re-evaluate.
If we need to get rid of these reallocations we could change the interface of Processor so it passes the resource attributes of the batch next to each span.
| }, []string{"tenant"}) | ||
| metricSpansDiscarded = promauto.NewCounterVec(prometheus.CounterOpts{ | ||
| Namespace: "tempo", | ||
| Name: "metrics_generator_spans_discarded_total", |
There was a problem hiding this comment.
I'd consider renaming to metrics_generator_discarded_spans_total just to make it similar to this other metric: https://github.com/grafana/tempo/blob/main/modules/overrides/discarded_spans.go#L12 Not a big deal though, both should show up in grafana 🤷🏻
I think it's good to have separate metrics since a span discarded in the metrics-generator is very different from a span discarded in the ingester/compactor.
There was a problem hiding this comment.
Hmm should we make it similar to the other discarded span name or should we keep it similar to the other metrics in the same space?
https://github.com/grafana/tempo/blob/main/modules/generator/instance.go#L39
There was a problem hiding this comment.
Oh I see 🙃 Err, either is fine I guess? Maybe a slight preference for keeping it consistent with the other tempo_metrics_generator_ metrics then.
Naming is hard 😅
| // setting default for max span age before discarding to 30 sec | ||
| cfg.MaxSpanAge = 30 |
There was a problem hiding this comment.
I'm curious how this default behaves in practices. I honestly have no clue what a typical latency is between the span creation and ingestion by Tempo.
There was a problem hiding this comment.


These are the ingestion latency numbers on ops for a few days. Do we think setting it at 30s is a right or is it too aggressive? @kvrhdn @joe-elliott
There was a problem hiding this comment.
30s as a default looks good to me. It seems this should be include 99% of the data while excluding 1% that is lagging behind.
It's also configurable, so other people can switch it up.
| var newSpansArr []*v1.Span | ||
| timeNow := time.Now().UnixNano() | ||
| for _, span := range ils.Spans { | ||
| if span.EndTimeUnixNano >= uint64(timeNow-i.cfg.MaxSpanAge*1000000000) { |
There was a problem hiding this comment.
both sides of this time range need to be checked. If the user sends a span that's 5 days in the future it should not impact metrics.
we have similar code in the wal:
tempo/tempodb/wal/append_block.go
Line 235 in 5c885f3
There was a problem hiding this comment.
Is there a reason why we pick 5 days?
There was a problem hiding this comment.
I think 5 days was just an example. I think we can start with a symmetrical time range, i.e. use the same duration before and after time.Now(). If this doesn't work well we could still break it out into two config options.
| @@ -261,14 +266,27 @@ func (i *instance) pushSpans(ctx context.Context, req *tempopb.PushSpansRequest) | |||
| func (i *instance) updatePushMetrics(req *tempopb.PushSpansRequest) { | |||
There was a problem hiding this comment.
agree that the function needs to be renamed if it's going to mutate the contents of the push request.
Side question. Currently the processor interface takes a complete PushSpansRequest. Can we change that to taking individual spans? Then we could avoid the potentially costly slice reallocations being done here.
knylander-grafana
left a comment
There was a problem hiding this comment.
Doc updates look good.
What this PR does: This PR adds a configurable variable under metrics generator to filter out any span that is older than "metrics_ingestion_time_range_slack" before metrics are aggregated. The current default is set to 30s.
Which issue(s) this PR fixes:
Fixes #1537
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]