Skip to content

Add optional anonymous Tempo usage reporting#1481

Merged
zalegrala merged 37 commits intografana:mainfrom
zalegrala:usageStats
Jul 29, 2022
Merged

Add optional anonymous Tempo usage reporting#1481
zalegrala merged 37 commits intografana:mainfrom
zalegrala:usageStats

Conversation

@zalegrala
Copy link
Copy Markdown
Contributor

@zalegrala zalegrala commented Jun 9, 2022

What this PR does:

Here we implement an approach from the Loki squad for sending anonymous usage information to Grafana Labs to better understand the uses in the wild.

Fixes https://github.com/grafana/tempo-squad/issues/81
Related grafana/loki#5062

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@zalegrala zalegrala force-pushed the usageStats branch 2 times, most recently from 03da46a to 5d634ba Compare June 9, 2022 14:14
@zalegrala zalegrala changed the title Add Tempo usage stats Add optional anonymous Tempo usage reporting Jun 9, 2022
Comment thread pkg/usagestats/config.go
@zalegrala zalegrala marked this pull request as ready for review June 28, 2022 13:04
Comment thread modules/distributor/receiver/shim.go Outdated
Copy link
Copy Markdown
Collaborator

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some thoughts

Comment thread cmd/tempo/app/app.go Outdated
Comment thread cmd/tempo/app/modules.go
Comment thread docs/tempo/website/configuration/_index.md
Comment thread go.mod Outdated
Comment thread modules/distributor/receiver/shim.go Outdated
Comment thread modules/distributor/receiver/shim.go Outdated
case "opencensus":
receiverOpencensusStats.Set(1)
case "kafka":
receiverKafkaStats.Set(1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default, log an error?

is there a way to associate a string label with this "stat"? so instead of a bunch of individual stats we could says receiverStats.Type(<name>).Set(1) or something?

Copy link
Copy Markdown
Contributor

@mdisibio mdisibio Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to associate a string label with this "stat"? so instead of a bunch of individual stats we could says receiverStats.Type().Set(1) or something?

Discussed offline a bit, and the idea here is to centralize the stats more concretely in the usagestats package. Instead of calling something like usagestats.NewInt("feature_enabled_search"), int.Set(1) throughout the code base, this could be usagestats.SetFeatureEnabledSearch(1). The SetFeatureEnabledSearch method would call NewInt/Set.

Couple different ways to do this, but that's the idea.

Thoughts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current design of the usagesats package is such that it doesn't care at all about what stats are being set throughout the code base. To create some helper methods in the package might help readability on the implementation side, but I'm thinking the separation of concerns here is somewhat nice. If the variables names with a Set() method are ugly to look at, we could also make package local fucntions like setFeatureEnabledSearch() that would make the necessary calls in.

With a package variable like featureEnabledSearch, calling featureEnabledSearch.Set(1) feels almost as readable as usagestats.SetFeatureEnabledSearch(), the difference being that now the usagestats needs modification for each stat we want to include. Is there another advantage of moving the stats to the usagestats package here I'm not thinking of?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just playing with it a little, to move this noise out of the New function could just be something like recordConfigBoolStat(cfg.AuthEnabled, statAuthEnabled). Additionally we could move all the stats into a stats file within the packages that implement.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with either way. I like @mdisibio's suggestion to consolidate the stats b/c it will help users of the software quickly see what we are reporting. I don't consider this a blocker.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I see that it also might prevent sharing the package in the future, but I don't know how much to hold on to that idea.

Copy link
Copy Markdown
Contributor Author

@zalegrala zalegrala Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could move all the variables to modules.go if we wanted one place to look. We mostly need access to the config, which could be done in a helper on the Server. Hows that sound? I think that might help smooth out a dependency loop for the backend creation also when trying to implement backend.New() for shared use.

Comment thread pkg/usagestats/config.go Outdated
Comment thread pkg/usagestats/reporter.go
Comment thread tempodb/backend/raw.go Outdated
@zalegrala
Copy link
Copy Markdown
Contributor Author

That's great, thanks for the feedback.

Comment thread pkg/usagestats/reporter.go Outdated
Copy link
Copy Markdown
Contributor

@annanay25 annanay25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some concerns as discussed offline:

  1. I'm a little skeptical to adding a dependency on the backend storage to the usage reporting module - not only because of the complexity of running the module but also an unexpected failure scenario with the backend that might result in a component crash. I believe we could hold the token in the memberlistkv for components, in which case it will only ever be recreated if all components are wiped and restarted - at which point we might as well call it a new cluster. (Maybe we can go ahead for now but ease the dependency at some point).
  2. With the additional dependency on the backend, components like the distributor which initially had no dependency on the backend will have to be configured with the read token from GCS/S3/Azure. This is a breaking change and needs to be communicated clearly and updated in all docs.
  3. Schema changes - I'm not sure what happens if we change the usage report schema, does an old payload get rejected with a 400? Or do we accept with 200 ok but drop it internally? How hard is it to change the associated dashboards?
  4. The backoff to send data / wait for the token to be created should definitely be made configurable. In a large cluster with a 100 components, it can result in pretty spikey memberlist traffic if all components query the kv store every second

Comment thread pkg/usagestats/reporter.go Outdated
@zalegrala
Copy link
Copy Markdown
Contributor Author

zalegrala commented Jul 6, 2022

Thanks for the review @annanay25.

  1. You are probably right that we should call it a new cluster if all components go down. I don't have too strong a feelings about it, but persistent storage for cluster identification seems somewhat nice to me, since all the trace data would persist after all systems were shut down. I think about power outages also, which is probably rare for the environments that we are most likely deployed. I can test without storage and start refactoring if we have strong concerns about the distributor having access to the backend.
  2. Good callout on the breaking change here. I'll make sure to update some docs if we choose to keep this dependency.
  3. This has come up a few times, but I'm not sure where to communicate this in docs. The actuall json payload that gets sent has a few rigid items in there about the build information of the binary, so we would need to keep that structure in order to allow the bigquery data to be useful long term. Additionally, there is a map[string]interface{} metrics section of the payload that we have full control over. When we go to query, if the string keys there change, then dashboards will need to be udpated. Also keep in mind that the speed at which those change are the speed at which folks are upgrading, so ideally this remains somewhat stable. We can add new metrics easy enough. Nothing server side receiving this payload will reject the payload based on the schema.
  4. a: We can make the backoff for token retrieval configuratble, but note too that only the ingeters currently will be querying memberlist for the data, so somewhat reduced set of load on the memberlist. Though that is the current form. If we make changes so that the backend is never used for cluster ID, then this would increase the load on the memberlist.
  5. b: As for the data sending interval, I'm slightly of two minds about this. On one hand a configuration option seems good. On the other, having a hardcode means we can avoid users changing the load placed on us receiving the payload since the interval at which the reports are sent would be constant in the binary.

Comment thread docs/tempo/website/configuration/_index.md Outdated
Copy link
Copy Markdown
Contributor

@knylander-grafana knylander-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a minor suggestion to the text.

- [storage](#storage)
- [memberlist](#memberlist)
- [overrides](#overrides)
- [usage-report](#usage-report)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is placed below search in the actual description

Comment thread pkg/usagestats/config.go Outdated
Comment on lines +21 to +23
f.DurationVar(&cfg.Backoff.MaxBackoff, util.PrefixConfig(prefix, "backoff.max_backoff"), time.Minute, "maximum time to back off retry")
f.DurationVar(&cfg.Backoff.MinBackoff, util.PrefixConfig(prefix, "backoff.min_backoff"), time.Second, "minimum time to back off retry")
f.IntVar(&cfg.Backoff.MaxRetries, util.PrefixConfig(prefix, "backoff.max_retries"), 0, "maximum number of times to retry")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should replace this with cfg.Backoff.RegisterFlagsWithPrefix("usage-report", f)

// RegisterFlagsWithPrefix for Config.
func (cfg *Config) RegisterFlagsWithPrefix(prefix string, f *flag.FlagSet) {
f.DurationVar(&cfg.MinBackoff, prefix+".backoff-min-period", 100*time.Millisecond, "Minimum delay when backing off.")
f.DurationVar(&cfg.MaxBackoff, prefix+".backoff-max-period", 10*time.Second, "Maximum delay when backing off.")
f.IntVar(&cfg.MaxRetries, prefix+".backoff-retries", 10, "Number of times to backoff and retry before failing.")
}

And then we can override the default if needed. But the flag names should be consistent

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been updated. Hows that?

r, err := NewReporter(Config{Leader: true, Enabled: true}, kv.Config{
Store: "",
}, objectClient, objectClient, log.NewLogfmtLogger(os.Stdout), prometheus.NewPedanticRegistry())
require.NoError(t, err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we error on a wrong k/v store value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because only the ingesters will require the k/v store. This will get checked when the ingester is started during the call to running().

Copy link
Copy Markdown
Contributor

@annanay25 annanay25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a final few comments but approving to unblock once done!

@zalegrala zalegrala merged commit a72a095 into grafana:main Jul 29, 2022
@zalegrala zalegrala deleted the usageStats branch July 29, 2022 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants