Skip to content

Design proposal: metrics-generator#1206

Merged
yvrhdn merged 5 commits intografana:mainfrom
yvrhdn:kvrhdn/metrics-generator-proposal
Apr 13, 2022
Merged

Design proposal: metrics-generator#1206
yvrhdn merged 5 commits intografana:mainfrom
yvrhdn:kvrhdn/metrics-generator-proposal

Conversation

@yvrhdn
Copy link
Copy Markdown
Contributor

@yvrhdn yvrhdn commented Jan 5, 2022

What this PR does:
This is a design-proposal to generate metrics from trace data in Tempo. It describes adding a new optional component to Tempo: the metrics-generator.

Rendered preview (easier to read): 2022-01 Metrics-generator.md

This PR will remain open for several weeks to allow anyone (both maintainers and community members) to comment on this proposal. As suggestions are applied and we gather more insights this PR will be updated.
If you want to be notified of new developments related to this feature, subscribe to this PR.

This effort is being led by @mapno and @kvrhdn (that's me).

@thejosephstevens
Copy link
Copy Markdown
Contributor

thejosephstevens commented Jan 6, 2022

This is really exciting! A couple thoughts as a user, my initial reaction is that if the metrics generator emits metrics for all spans and edges you could blow your cardinality through the roof really quickly. Looking at the processors proposal, that sounds a bit like recording rules from Loki and Cortex to me, is there any way this could be written to operate like that with something resembling PromQL?

My guess is you could pretty quickly put together rules for global processing if you wanted like this

        - record: le_client_server:service_graph_request_server_seconds:99quantile
          expr: histogram_quantile(0.99, sum by(le, client, server) (rate(service_graph_request_server_seconds[1m])))

This could also allow users to select non-standard labels to aggregate on (for example we run many different environments, so we would be likely to want to customize the metric as sum by(le, client, server, environment)).

But you could also opt to not produce metrics for all of your data if you don't care about all of it. Additionally you could perform pre-aggregation to get metrics you want without having to hit your metrics backend with all of that cardinality.

This would be even more exciting if it could then be managed by cortextool for rule management 😄

@yvrhdn
Copy link
Copy Markdown
Contributor Author

yvrhdn commented Jan 7, 2022

This is really exciting! A couple thoughts as a user, my initial reaction is that if the metrics generator emits metrics for all spans and edges you could blow your cardinality through the roof really quickly.

Yes, that's an issue we have already seen/experienced with the span metrics processor in the Grafana Agent. Since 1) Tempo as a centralized database will often ingest a huge amount of traces and 2) the person/team sending traces might not be the operator of the cluster, we can only expect this problem to reappear and being worse.

Excessive cardinality is something we want to consider from the get go. While this hasn't been described much in the proposal yet, the way we want to work on this is basically 1) get metrics working and 2) get those metrics under control. Ideally our first (experimental) release already exposes metrics to observe cardinality and some levers to control it.

I think we have two options to keep cardinality under control (and we should adopt both in some way imo):

  • Provide configuration options to tweak how the processors work. This should allow a user to, for instance, include/exclude spans from specific services, tweak which labels are set and configure the amount of buckets to create. I think these settings will be mostly specific to each processor, but we might find a common superset.
    I can imagine a single Tempo tenant might be used by multiple teams/services, so it should be possible to only generate metrics for the team/service that wants that.

  • Add limits to euh... limit the amount of series being created. We already have limits to protect the Tempo cluster against ingesting too much data, similarly we should add limits to protect the metrics database we use (or the person paying for its bill).
    I'm not sure yet how we can implement this, but ideally we have a limit that goes something like "tenant X can generate max. 10k unique series per some interval". So if cardinality suddenly blows up, we drop metric series instead exhausting the metrics database.

I'll take some time next week to describe this in more detail in the proposal.

Looking at the processors proposal, that sounds a bit like recording rules from Loki and Cortex to me,

It does! In fact, we have been looking at the Loki ruler a lot: the way we generate metrics will be different (the ruler executes a LogQL query while we process the ingress stream), but the way we remote write metrics will be very similar. We also share the same multi-tenancy requirements.

💡 I think I should add a section to the doc comparing this with the ruler and describe what our setup will have in common and what the differences are.

is there any way this could be written to operate like that with something resembling PromQL?

My guess is you could pretty quickly put together rules for global processing if you wanted like this

        - record: le_client_server:service_graph_request_server_seconds:99quantile
          expr: histogram_quantile(0.99, sum by(le, client, server) (rate(service_graph_request_server_seconds[1m])))

This could also allow users to select non-standard labels to aggregate on (for example we run many different environments, so we would be likely to want to customize the metric as sum by(le, client, server, environment)).

But you could also opt to not produce metrics for all of your data if you don't care about all of it. Additionally you could perform pre-aggregation to get metrics you want without having to hit your metrics backend with all of that cardinality.

I'm sensing two ideas here:

  1. The ability to generate metrics using a query language, just like the Cortex and Loki ruler do with PromQL/LogQL
  2. The ability to aggregate metrics before sending them off

1. Using a query language

We are currently working on a query language native to Tempo: TraceQL. This is currently still in the design phase, but it is the intention to support metric queries. That is, you will be able to write a TraceQL query that returns one or more metric series (just like LogQL can do).
When TraceQL is implemented, I think it definitely makes sense to add a ruler component to Tempo. I think we could evolve the metrics-generator to become a metrics-generator+ruler combo.

Still, I think there are a couple of advantages for building a metrics-generator versus a ruler:

  • I think realistically it will not be possible to express everything using a TraceQL query. For instance to render service graphs in Grafana we have to generate very specific metrics about pairs of spans bridging services. In this case it's just easier to build a dedicated processor which can be (partly) shared with the Grafana Agent.

  • The metrics-generator directly consumes the ingress stream which is more efficient than a recording rule that queries the ingesters/backend. We are currently seeing that the load of a lot of search queries can slow down the ingesters, so I expect that recording rules would have a similar effect. So this is a tradeoff between flexibility and performance.

  • Finally, timing-wise we can deliver these features much faster with the metrics-generator than if we'd wait on TraceQL being production-ready.

2. Pre-aggregation

This is a super interesting suggestion and it would directly address the pain of dealing with high-cardinality metrics!

For this to work, I believe we would almost have to run a full Prometheus TSDB and query engine: if you want to execute a query operating on a range vector, we would have to buffer at least as much data as the range vector needs. As this introduces a lot of complexity I think we should keep this out of the proposal for now, but I would love to reconsider this once we get a basic metrics-generator going.

I also have the reflex to push this responsibility to other components that are dedicated to working with metrics. I'm thinking of Prometheus or the Grafana Agent in this case: if they support aggregating metrics at ingest-time, multiple components will be able to profit from this, not just Tempo. Additionally, it will be easier for us to adopt a similar system after they have done the design and implementation work.

This would be even more exciting if it could then be managed by cortextool for rule management 😄

Yes, having a tool to manage configuration is definitely the intention! I've briefly described remote management under the Metrics processor configuration section. I have something in mind like cortextool, though I doubt we will be able to reuse cortextool as our configuration is likely to be different.
I haven't explored this as much yet because this is less of a priority for now, but you can rest assured that we often look at Cortex and Loki to pick up lessons learned. So when we add this we'll check whether cortextool fits our needs or whether we need to make our own tool (tempotool?).

I'll try to clarify this in the proposal as well 🙂

@thejosephstevens
Copy link
Copy Markdown
Contributor

Sounds like this is really well thought out, makes a ton of sense to me!

Changes:
- Add Table of Contents
- Add comparison with Cortex/Loki ruler
- Update metrics-generator diagram and its components
- Some restructuring, improved wording etc.
@yvrhdn
Copy link
Copy Markdown
Contributor Author

yvrhdn commented Jan 13, 2022

Just pushed an update, some of the changes include:

  • Add Table of Contents
  • Add comparison with Cortex/Loki ruler
  • Update metrics-generator diagram and its components
  • Some restructuring, improved wording etc.

Changes:
- Improve wording, fix typo's etc.
- Clarified some sections with new insights
- Changed metric names to align closer with the Grafana Agent
- Flesh out the example configuration
@yvrhdn
Copy link
Copy Markdown
Contributor Author

yvrhdn commented Feb 7, 2022

Hi all, I've pushed another update mostly tweaking some wording and clarifying some sections. At this point I think the proposal is complete and I'd like some final reviews 🙂

What's next?

Our goal is to implement the metrics-generator as it is described by the proposal by Tempo 1.4. IMO this should be a strong foundation for the metrics-generator going further. Once we have this initial release we can look into adding additional features like using a WAL and a replication factor.

We have already been experimenting with this design on a separate branch (metrics-generator-dev), this has given us a lot of insight and hands-on experience which in turn made the proposal stronger/more realistic. At this point we feel confident that this design will work and can be implemented by Tempo 1.4 (in 2-3 months).

We will be submitting a PR soon (this or next week) to merge everything into main, from then on it will be easier to follow along with the development. You will even be able to run it yourself if you are feeling adventurous 🙈

That said, I really don't see merging this proposal as the end of the discussion about what the metrics-generator can do and how it should work. I would love to continue this discussion as we actually build this component and we have more people try it out!

Copy link
Copy Markdown
Collaborator

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall excellent. This is looking good.

I think we should also mention that currently we are using ingest time to calculate metrics (and not span time) and ways to eventually use the span time directly.


The distributor has to find metrics-generator instances present in the cluster. When multiple instances of the metrics-generator are running, the distributor should load balance writes across these instances. Load balancing has to be trace-aware: all spans with the same trace ID should consistently be sent to the same instance.

To achieve this we propose using the [dskit ring](https://github.com/grafana/dskit/tree/main/ring) backed by memberlist. This will be the same mechanism as used by the ingesters. The distributor will shard requests across metrics-generator instances based upon the tokens they own.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default tempo uses memberlist but technically this should be configurable like all our other rings. i.e. if an operator wants to use consul or etcd to store this ring then it should be possible.

but overall +1 the choice to use the same ring mechanism we use for sharding writes to the ingesters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I don't see why this wouldn't work with any other supported KV store. I mentioned memberlist because it's the default in Tempo. I'll clarify this.

// and processors. We can reduce the amount of data sent to the metrics-generator by trimming spans
// and their metadata in the distributor.
message PushSpansRequest {
// Note: these are full traces. For further optimisation we should consider using a slimmer span
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment thread docs/design-proposals/2022-01 Metrics-generator.md

Since the metrics-generator is directly in the write path, an increase in ingress will directly impact the metrics-generator. To reduce the amount of data sent from the distributor to the metrics-generator, the distributor should only send spans that are relevant for the configured metrics processors and tenants. If, for example, a processor only requires a subset of spans the distributor should drop not relevant spans before sending them. This should allow the metrics-generator to scale at a slower rate than the ingesters and saves bandwidth/processing time.

This will require that the distributor is aware of the tenants and processors configured in the metrics-generator. This configuration will thus have to be shared with both components.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the processors impact which spans would need to be sent? my impression is that if we enable metrics-generator for a tenant then all spans will be sent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another potential optimisation here: some processors don't need all spans and most processors don't need all tags of a span. So by knowing which processors are enabled and how they are configured, we can trim down the sent data.

2 examples:

  • The service graphs processor only uses spans with SpanKind client or server. All other spans are ignored. If a tenant only uses the service graphs processor, we could drop all other spans. I'd estimate this would easily drop over half of the spans for a specific tenant.

  • The span metrics processor uses a limited set of tags as a label in the generated metrics. By default this is the span name, service name, span kind and status. So instead of sending the entire span (which might contain very large tags) we could send a minimal version with just these 4 tags.
    ⚠️ Issue: we want to make the dimensions used by the span metrics processor configurable. That is, a user might want to include tags like http.target as well. This setting will have to be shared with the distributor to ensure it doesn't drop this tag accidentally.

Overall this optimisation seems promising, but the implementation is likely to be difficult to maintain and operate. We would have to:

  • Share most of the processor config with the distributor, this is necessary because the distributor has to know what to filter out. This will be tricky because some config lives in the metrics_generator config block and some in the runtime config. At some point we might also store user config in a bucket, similar to the rules in Cortex/Loki. So that would mean the distributor needs to acces this bucket as well.

  • For each processor, we would have to add a filter function. That is, each processor will have to tell us in the distributor whether this span has to be kept or not. This would mean that the implementation of the processor gets spread out over both components making it difficult to maintain.

I think it's a cool idea, but maybe we should keep it simple for now and just say: "the distributor has to know which tenants are included" and keep it at that.

Copy link
Copy Markdown
Collaborator

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work defining this new feature. Excited to see it in Tempo!

@yvrhdn
Copy link
Copy Markdown
Contributor Author

yvrhdn commented Apr 13, 2022

We can merge this design doc already, but I plan to update this soon with some changes we did to the design coming up to Tempo 1.4.

@yvrhdn yvrhdn enabled auto-merge (squash) April 13, 2022 14:56
@joe-elliott
Copy link
Copy Markdown
Collaborator

We can merge this design doc already, but I plan to update this soon with some changes we did to the design coming up to Tempo 1.4.

Whatever is easiest for you. I would like this merged before 1.4 since it has the feature :). Other than that don't care.

@yvrhdn yvrhdn merged commit 93196d9 into grafana:main Apr 13, 2022
@yvrhdn yvrhdn deleted the kvrhdn/metrics-generator-proposal branch April 13, 2022 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants