[metrics-generator] filter out spans based on policy#2274
[metrics-generator] filter out spans based on policy#2274zalegrala merged 44 commits intografana:mainfrom
Conversation
|
Thank you for adding documentation! |
knylander-grafana
left a comment
There was a problem hiding this comment.
Docs look good! Thank you for adding them.
bf1a64f to
dcdd76f
Compare
bca1054 to
126f390
Compare
| spanMetricsCallsTotal registry.Counter | ||
| spanMetricsDurationSeconds registry.Histogram | ||
| spanMetricsSizeTotal registry.Counter | ||
| spanMetricsFilterDropsTotal registry.Counter |
There was a problem hiding this comment.
any particular reason we're pushing this and not just recording it as a normal prometheus metric in tempo?
There was a problem hiding this comment.
Nope, good call out.
There was a problem hiding this comment.
I spoke too soon. For a generator instance, we don't have the ID available currently, so to create a metric where we include the tenant label isn't feasible without a bit of refactor. This currently pushes the metric to the remote endpoint and would increase the series count, so perhaps this isn't something we want to do. Though, if we host the metric on the generator itself, then we'd have access to know which tenants would be filtering which spans, but likely this wouldn't have the desired value for folks who don't have access to those metrics. I.e. if a cloud user has a filter, they wouldn't be able to see how many spans are being rejected by their filter, which seems like the primary utility. I'm a little torn on even including this, but it does seem like we want some indication that the spans are being filtered out.
There was a problem hiding this comment.
yeah, i really think we shouldn't be pushing this using remote write. the metrics we push are the generated ones. this describes the operational state of tempo which would just publish normally.
if a cloud user has a filter, they wouldn't be able to see how many spans are being rejected by their filter, which seems like the primary utility. I'm a little torn on even including this, but it does seem like we want some indication that the spans are being filtered out.
we can push this back through billing and expose it to the end user. i think we may just need to push the tenant id down into the instance. alternatively you can push a counter metric into the instance that already has the tenant id configured
| } | ||
|
|
||
| for _, policy := range p.filterPolicies { | ||
| if policy.Include != nil { |
There was a problem hiding this comment.
nit: could move the policy.include check into policyMatch() which would clean this up.
There was a problem hiding this comment.
I'm not sure what policyMatch() should do in the case of a nil policy. I agree it would be a little cleaner, but include is the inverse of exclude here, so for example returning a true value from policyMatch() isn't as clean. Perhaps I can make exclude and include behave the same here, and then make the suggested adjustment.
joe-elliott
left a comment
There was a problem hiding this comment.
one broader question that just occurred to me: should all of this logic apply to both span metrics and service graph metrics? should we move this up a level?
|
Thanks for the review @joe-elliott. If we move this up a level, do you suppose the processors should have an independent filter config, or share one? Sharing one would be simpler and likely more performant, but I wonder if folks would want to tune them independently. |
|
In lieu of an specific idea about how widely to apply the filtering, I'm going to restructure the span filtering into |
b2df1e4 to
6e15687
Compare
ebced35 to
7fb6e5f
Compare
joe-elliott
left a comment
There was a problem hiding this comment.
Really liking the refactor to pull out the filters. Some thoughts but this is close
| spanMetricsCallsTotal registry.Counter | ||
| spanMetricsDurationSeconds registry.Histogram | ||
| spanMetricsSizeTotal registry.Counter | ||
| spanMetricsFilterDropsTotal registry.Counter |
There was a problem hiding this comment.
yeah, i really think we shouldn't be pushing this using remote write. the metrics we push are the generated ones. this describes the operational state of tempo which would just publish normally.
if a cloud user has a filter, they wouldn't be able to see how many spans are being rejected by their filter, which seems like the primary utility. I'm a little torn on even including this, but it does seem like we want some indication that the spans are being filtered out.
we can push this back through billing and expose it to the end user. i think we may just need to push the tenant id down into the instance. alternatively you can push a counter metric into the instance that already has the tenant id configured
| } | ||
| matches++ | ||
| case traceql.IntrinsicKind: | ||
| if !stringMatch(policy.MatchType, span.GetKind().String(), pa.Value.(string)) { |
There was a problem hiding this comment.
anyway we can do int matches here and on status? would be much faster
There was a problem hiding this comment.
Probably yes, but it might be a little clumsy, since we'd need to take this as an int in the config I think. There is a special yaml struct tag we could use iirc to parse a string as an int, or perhaps some custom unmarshaling. For now I'm inclined to leave it and come back to it later.
There was a problem hiding this comment.
I think I can work this. I'll take a closer look next week.
There was a problem hiding this comment.
I've added a commit for this. It made some parts less readable, but seems okay to me. How's that look?
5214024 to
95fd22e
Compare
…nt overrides Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Co-authored-by: Joe Elliott <joe.elliott@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
1cec33b to
79151bb
Compare
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
Signed-off-by: Zach Leslie <zach.leslie@grafana.com>
What this PR does:
Here we implement an approach to filtering out spans based on a policy, loosely based around the OTEL collector filterspan config format.
Which issue(s) this PR fixes:
Fixes #1482
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]