-
Notifications
You must be signed in to change notification settings - Fork 820
Add nativeHistograms IngestionRate limit #6794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add nativeHistograms IngestionRate limit #6794
Conversation
Signed-off-by: Paurush Garg <[email protected]>
6c2fcba
to
e7d86bc
Compare
Why do we need new limits for nativeHistogram? cortex/pkg/distributor/distributor.go Line 775 in 844fa55
|
NH samples are much more expensive that float samples. |
I think this is more an timeseries limit, no? I am not sure the mem impact ingestionRate. But @PaurushGarg talked to me offline. It seems there are good reason for this. Lets just update the description |
Signed-off-by: Paurush Garg <[email protected]>
There is one test failure related to this change.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the change looks good.
Signed-off-by: Paurush Garg <[email protected]>
bb248fa
to
0b4cb0d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
pkg/distributor/distributor.go
Outdated
d.validateMetrics.DiscardedSamples.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(totalSamples)) | ||
d.validateMetrics.DiscardedExemplars.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(validatedExemplars)) | ||
d.validateMetrics.DiscardedMetadata.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(len(validatedMetadata))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we always returning NativeHistogramsRateLimited
? Can't this be trigger only by rateLimited
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much.
I needed to set label value validation.RateLimited in case it is rateLimited
due to IngestionRate limit, and set label value validation.NativeHistogramsRateLimited in case it is nhRateLimited
due to nativeHistogramsIngestionRate limit.
Updated now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to drop all samples, exemplars and metadata if native histograms are rate limited?
The default ingestion rate drop everything today because it passes all to the rate limiter. But NH limiter we only check native histograms so it should only throttle native histograms.
I think it doesn't make sense for this limit to impact the existing ingestion rate limit if NH limit is set very small but there is still big room for the default ingestion rate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. we can just block NH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still the same. no? Ben suggested we drop only validatedHistogramSamples
and dont fail the request right away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a use case for ingesting partial samples? I feel it's simpler to drop everything.
Also, we don't do partial ingestion for float samples. For example, in a remote write request with 10K samples, even if ingesting 9K samples is within the rate limit, we still reject the entire request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me this seems a different type of rejection. For the ones we have today is a global limiter which if we want to accept until the limit we need to "choose" what goes through and what is rejected. This seems more complicated.
For a new limit as specific for nh we can just limit nh samples at all even if we could accept part of it, but allow the rest to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Updated now.
I want to confirm on below two points:
Even when the NH samples are dropped by the NHRateLimiter, this PR does not reduce the totalN
value (by removing the NH Samples count that were dropped by NHRateLimiter).
This is to keep the existing rate limiter behaviour untouched.
So, basically, where NH Samples were discarded by the NHRateLimiter, the existing rateLimiter would still rateLimit on the basis of total samples received in the request, regardless of whether they were already discarded by NHRateLimiter or not. And similarly, the dropped NH samples (dropped by NH Rate Limiter) would still exhaust the tokens of current RateLimiter.
Incase of NHRateLimiting, this CR doesn't return any error message or 429 to Client. It only publishes the discarded samples metric with label value exclusive for NHRateLimiter.
pkg/distributor/distributor.go
Outdated
// Return a 429 here to tell the client it is going too fast. | ||
// Client may discard the data or slow down and re-send. | ||
// Prometheus v2.26 added a remote-write option 'retry_on_http_429'. | ||
if nhRateLimited { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this can still be in the previous if
for sending metrics? any reason to remove from inside it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Updated.
Signed-off-by: Paurush Garg <[email protected]>
@@ -3427,6 +3427,10 @@ The `limits_config` configures default and per-tenant limits imposed by Cortex s | |||
# CLI flag: -distributor.ingestion-rate-limit | |||
[ingestion_rate: <float> | default = 25000] | |||
|
|||
# Per-user nativeHistograms ingestion rate limit in samples per second. | |||
# CLI flag: -distributor.native-histograms-ingestion-rate-limit | |||
[native_histograms_ingestion_rate: <float> | default = 25000] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worry that we are deploying this as 25k default. This can cause impact for users who are using NH and override their ingester_rate but are not aware of new limit when deploying. i think i prefer this being 0 which means disabled as default. It will be limited by ingestion_rate if disabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much Daniel. I have now implemented this.
Signed-off-by: Paurush Garg <[email protected]>
Signed-off-by: Paurush Garg <[email protected]>
pkg/distributor/distributor.go
Outdated
d.validateMetrics.DiscardedSamples.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(totalSamples)) | ||
d.validateMetrics.DiscardedExemplars.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(validatedExemplars)) | ||
d.validateMetrics.DiscardedMetadata.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(len(validatedMetadata))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to drop all samples, exemplars and metadata if native histograms are rate limited?
The default ingestion rate drop everything today because it passes all to the rate limiter. But NH limiter we only check native histograms so it should only throttle native histograms.
I think it doesn't make sense for this limit to impact the existing ingestion rate limit if NH limit is set very small but there is still big room for the default ingestion rate
pkg/util/validation/limits.go
Outdated
@@ -240,8 +242,10 @@ func (l *Limits) RegisterFlags(f *flag.FlagSet) { | |||
|
|||
f.IntVar(&l.IngestionTenantShardSize, "distributor.ingestion-tenant-shard-size", 0, "The default tenant's shard size when the shuffle-sharding strategy is used. Must be set both on ingesters and distributors. When this setting is specified in the per-tenant overrides, a value of 0 disables shuffle sharding for the tenant.") | |||
f.Float64Var(&l.IngestionRate, "distributor.ingestion-rate-limit", 25000, "Per-user ingestion rate limit in samples per second.") | |||
f.Float64Var(&l.NativeHistogramsIngestionRate, "distributor.native-histograms-ingestion-rate-limit", 0, "Per-user nativeHistograms ingestion rate limit in samples per second. 0 to disable the limit") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. Let's not use nativeHistograms
in camel case as it looks weird. Let's use Native Histograms
or native histograms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have updated now.
pkg/util/validation/limits.go
Outdated
f.StringVar(&l.IngestionRateStrategy, "distributor.ingestion-rate-limit-strategy", "local", "Whether the ingestion rate limit should be applied individually to each distributor instance (local), or evenly shared across the cluster (global).") | ||
f.IntVar(&l.IngestionBurstSize, "distributor.ingestion-burst-size", 50000, "Per-user allowed ingestion burst size (in number of samples).") | ||
f.IntVar(&l.NativeHistogramsIngestionBurstSize, "distributor.native-histograms-ingestion-burst-size", 0, "Per-user allowed nativeHistograms ingestion burst size (in number of samples). 0 to disable the limit") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have updated now.
pkg/distributor/distributor.go
Outdated
// Return a 429 here to tell the client it is going too fast. | ||
// Client may discard the data or slow down and re-send. | ||
// Prometheus v2.26 added a remote-write option 'retry_on_http_429'. | ||
|
||
return nil, httpgrpc.Errorf(http.StatusTooManyRequests, "ingestion rate limit (%v) exceeded while adding %d samples and %d metadata", d.ingestionRateLimiter.Limit(now, userID), totalSamples, len(validatedMetadata)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR. We should mention number of dropped exemplars as well.
pkg/distributor/distributor.go
Outdated
|
||
return nil, httpgrpc.Errorf(http.StatusTooManyRequests, "nativeHistograms ingestion rate limit (%v) exceeded while adding %d samples and %d metadata", d.nativeHistogramsIngestionRateLimiter.Limit(now, userID), totalSamples, len(validatedMetadata)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use camel case for nativeHistograms
. How about
native histogram ingestion rate limit (%v) exceeded while adding %d native histogram samples`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have updated now.
pkg/distributor/distributor.go
Outdated
if limits.NativeHistogramsIngestionRate > 0 && limits.NativeHistogramsIngestionBurstSize > 0 { | ||
nhRateLimited = !d.nativeHistogramsIngestionRateLimiter.AllowN(now, userID, validatedHistogramSamples) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm i am not sure this is how was imagining it. I think the limiter could handle this 0 as disabled.
Looking at the rate documentation, 0 is a valid value meaning all will be blocked. MaxFloat is disabling it. We might can add the default as maxFloat and see if it will work
https://github.com/cortexproject/cortex/blob/master/vendor/golang.org/x/time/rate/rate.go#L40
@yeya24 opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good catch.
If ingestion rate is set to 0 then we just don't run d.nativeHistogramsIngestionRateLimiter.AllowN
. Or as Daniel said we use a very big value.
Let's set default to max float.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have updated now.
Signed-off-by: Paurush Garg <[email protected]>
pkg/distributor/distributor.go
Outdated
if !d.ingestionRateLimiter.AllowN(now, userID, totalN) { | ||
|
||
nhRateLimited := false | ||
if limits.NativeHistogramsIngestionRate != math.MaxFloat64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the if
. Lets leave the library do its job. I dont expect it to be resource intensive
pkg/distributor/distributor.go
Outdated
d.validateMetrics.DiscardedSamples.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(totalSamples)) | ||
d.validateMetrics.DiscardedExemplars.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(validatedExemplars)) | ||
d.validateMetrics.DiscardedMetadata.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(len(validatedMetadata))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still the same. no? Ben suggested we drop only validatedHistogramSamples
and dont fail the request right away
Signed-off-by: Paurush Garg <[email protected]>
@@ -765,6 +772,18 @@ func (d *Distributor) Push(ctx context.Context, req *cortexpb.WriteRequest) (*co | |||
d.receivedExemplars.WithLabelValues(userID).Add(float64(validatedExemplars)) | |||
d.receivedMetadata.WithLabelValues(userID).Add(float64(len(validatedMetadata))) | |||
|
|||
nhRateLimited := false | |||
if limits.NativeHistogramsIngestionRate != math.MaxFloat64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit. Maybe we can just call d.nativeHistogramsIngestionRateLimiter.AllowN
and skip this check. I understand that we want to avoid calling limiter if it is max float but that should be similar to call AllowN with infinity strategy
} | ||
|
||
if nhRateLimited { | ||
d.validateMetrics.DiscardedSamples.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(validatedHistogramSamples)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If native histogram is ratelimited, do we still include validated histogram samples on Line 794? Maybe you should reset validatedHistogramSamples to 0 as we are not going to ingest histogram samples
totalSamples := validatedFloatSamples + validatedHistogramSamples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add a log, to provide more information in addition to the metric
} else { | ||
seriesKeys = append(seriesKeys, nhSeriesKeys...) | ||
validatedTimeseries = append(validatedTimeseries, nhValidatedTimeseries...) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need more tests coverage. Let's try to at least add this test case since it seems related to the issue I mentioned above.
- If NH samples hit NH rate limit, other series and samples should still succeed if they are under rate limit
What this PR does:
As the ingestion of native histograms samples is much more CPU intensive than that of float samples - adding nativeHistogram samples specific IngestionRate limit to protect the service and to allow clients to adjust the NH series ingestion.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]