Feature: Distributor usage trackers#4162
Conversation
|
Should we create a doc issue for this when it's ready? |
joe-elliott
left a comment
There was a problem hiding this comment.
core code looks great. have some Qs but mainly just needs changelog and docs
I'm happy to add docs in this PR if we want. At the minimum I will update the config and url sections. |
|
Pushed some changes to configure via |
…f the previous span if they were missing values
…ys having a value
…e manifest and config index
|
Pushed new behavior for overflow and missing labels:
These two changes ensure that the output metrics always have the expected labels, and are consistent across edge cases. |
| return &tempopb.PushResponse{}, nil | ||
| } | ||
| // check limits | ||
| // todo - usage tracker include discarded bytes? |
There was a problem hiding this comment.
i believe the answer is no? we don't want it to included discarded?
There was a problem hiding this comment.
To clarify this TODO is about creating separate metrics for discards.
tempo_usage_tracker_bytes_received_total vs
tempo_usage_tracker_bytes_discarded_total
I think for now let's proceed without it. We can add it later if needed.
joe-elliott
left a comment
There was a problem hiding this comment.
partway done. will review the Observe method next
joe-elliott
left a comment
There was a problem hiding this comment.
final thoughts. overall lgtm just had some Qs. i'll drop a approval whenever you're ready
| // to rehash, but couldn't figure out a better way for now. | ||
| // The difficulty is tracking bucket dirty status while | ||
| // resetting to batch values and recording the span values. | ||
| if bucket == nil || !slices.Equal(buffer2, last) { |
There was a problem hiding this comment.
i feel like the normal case here is that people will configure resource level attributes only so we will almost always use the previously grabbed bucket.
the best idea i have to dodge slices.Equal is if we configured each dimension with resource or span level. then we would only search those attributes that could match and we'd know if we needed to look in span level at all.
that might not work with the broader goals of the project though if we intend people to specify attribute names only w/o the overhead of knowing its span or resource.
There was a problem hiding this comment.
Yes we discussed some of this in: #4162 (comment) For now I'd like to leave the configuration as-is (scopeless) because of the broader goals of the project and also it's similar to how dimensions for the metrics-generator are configured.
* First working draft of cost attribution usage tracker * Add missing tracker name label, more efficient batch proportioning, cleanup * Reduce series hashing * Fix user-configurable overrides tests for new json element * lint * Add per-tenant override for max cardinality * lint, review feedback * Default to not enabled, cleanup test config * Explicitly check for usage_metrics handler * review feedback * Update tracker to support many-to-one mapping with relabel * lint * New behavior for missing and overflow * Fix issue where subsequent spans would incorrectly reuse the series of the previous span if they were missing values * Revert maps back to slices now that we can depend on a dimension always having a value * Please ignore benchmark profiles * Tweak config to have specific cost attribution tracker section. Update manifest and config index * lint * changelog * Update api docs for new endpoint * Review feedback * review feedback * Swap loop order for a tad more performance
What this PR does:
This a new feature that allows a tenant to accurately track the amount of ingested traffic by a set of custom labels. It's similar to the existing
traces_spanmetrics_size_totalmetric created by the generators but improves on it in some key ways.** Need **
The core need is to export a set of highly accurate metrics on ingested traffic that tenants can use for cost-attribution. This means that every ingested byte can be attributed to something, i.e. a team or department, so that tracing costs can be reconciled. Any attribute in the tracing data can be used.
** Reasons for a new feature**:
service.name) and historical conventions (app). The per-tenant dimensions are a map of input->output. If multiple inputs correspond to the same output then they are added together.** Important concepts about this new feature **
/usage_metrics. This is so that they aren't mixed with the existing operational/metrics, because they are expected to have much higher cardinality and a different purpose.proto.Size(), and the series tracking contains buffering that enables it to be zero-allocation for existing series. If you run the benchmark you will see:BenchmarkUsageTracker 1995282 5904 ns/op 0 B/op 0 allocs/opTODOs
There is some remaining questions:1. Behavior for unconfigured tenants. Currently it's opt-in: ifoverrides.cost_attribution.dimensionsisn't set then the distributor records nothing. But maybe it makes sense to have default behavior here? Service name may be a good default.2. Behavior when hitting against max cardinality. Currently if we already have the max number of series, it records the data into the unlabeled series. But the existing series keep working. This means that it is not possible to assess the data quality after max cardinality is reached. Thinking about alternative behaviors here.Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]