Skip to content

Add nativeHistograms IngestionRate limit #6794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

PaurushGarg
Copy link
Contributor

@PaurushGarg PaurushGarg commented Jun 5, 2025

What this PR does:
As the ingestion of native histograms samples is much more CPU intensive than that of float samples - adding nativeHistogram samples specific IngestionRate limit to protect the service and to allow clients to adjust the NH series ingestion.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@PaurushGarg PaurushGarg force-pushed the native-histograms-ingestion-rate-limit branch from 6c2fcba to e7d86bc Compare June 5, 2025 14:19
@danielblando
Copy link
Contributor

danielblando commented Jun 5, 2025

Why do we need new limits for nativeHistogram?
Isnt histogram already counted to the ingestionRante

totalSamples := validatedFloatSamples + validatedHistogramSamples

@harry671003
Copy link
Contributor

harry671003 commented Jun 5, 2025

Why do we need new limits for nativeHistogram?
Isnt histogram already counted to the ingestionRate

NH samples are much more expensive that float samples.
Float sample is always 8 bytes.
A NH sample with 160 buckets and 160 spans will be ~2.5 KB in size.

@danielblando
Copy link
Contributor

NH samples are much more expensive that float samples.
Float sample is always 8 bytes.
A NH sample with 160 buckets and 160 spans will be ~2.5 KB in size.

I think this is more an timeseries limit, no? I am not sure the mem impact ingestionRate.

But @PaurushGarg talked to me offline. It seems there are good reason for this. Lets just update the description

@harry671003
Copy link
Contributor

There is one test failure related to this change.

 --- FAIL: TestPush_QuorumError (17.76s)
    distributor_test.go:878: 
        	Error Trace:	/__w/cortex/cortex/pkg/distributor/distributor_test.go:878
        	Error:      	Received unexpected error:
        	            	rpc error: code = Code(429) desc = nativeHistograms ingestion rate limit (25000) exceeded while adding 40 samples and 20 metadata
        	Test:       	TestPush_QuorumError
FAIL

Copy link
Contributor

@harry671003 harry671003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the change looks good.

Signed-off-by: Paurush Garg <[email protected]>
@PaurushGarg PaurushGarg force-pushed the native-histograms-ingestion-rate-limit branch from bb248fa to 0b4cb0d Compare June 6, 2025 05:54
@PaurushGarg PaurushGarg marked this pull request as ready for review June 6, 2025 06:05
Copy link
Contributor

@harry671003 harry671003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +791 to +793
d.validateMetrics.DiscardedSamples.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(totalSamples))
d.validateMetrics.DiscardedExemplars.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(validatedExemplars))
d.validateMetrics.DiscardedMetadata.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(len(validatedMetadata)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we always returning NativeHistogramsRateLimited? Can't this be trigger only by rateLimited?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much.
I needed to set label value validation.RateLimited in case it is rateLimited due to IngestionRate limit, and set label value validation.NativeHistogramsRateLimited in case it is nhRateLimited due to nativeHistogramsIngestionRate limit.
Updated now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to drop all samples, exemplars and metadata if native histograms are rate limited?
The default ingestion rate drop everything today because it passes all to the rate limiter. But NH limiter we only check native histograms so it should only throttle native histograms.

I think it doesn't make sense for this limit to impact the existing ingestion rate limit if NH limit is set very small but there is still big room for the default ingestion rate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. we can just block NH

// Return a 429 here to tell the client it is going too fast.
// Client may discard the data or slow down and re-send.
// Prometheus v2.26 added a remote-write option 'retry_on_http_429'.
if nhRateLimited {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can still be in the previous if for sending metrics? any reason to remove from inside it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated.

Signed-off-by: Paurush Garg <[email protected]>
@@ -3427,6 +3427,10 @@ The `limits_config` configures default and per-tenant limits imposed by Cortex s
# CLI flag: -distributor.ingestion-rate-limit
[ingestion_rate: <float> | default = 25000]

# Per-user nativeHistograms ingestion rate limit in samples per second.
# CLI flag: -distributor.native-histograms-ingestion-rate-limit
[native_histograms_ingestion_rate: <float> | default = 25000]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worry that we are deploying this as 25k default. This can cause impact for users who are using NH and override their ingester_rate but are not aware of new limit when deploying. i think i prefer this being 0 which means disabled as default. It will be limited by ingestion_rate if disabled by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much Daniel. I have now implemented this.

Signed-off-by: Paurush Garg <[email protected]>
Signed-off-by: Paurush Garg <[email protected]>
@PaurushGarg PaurushGarg requested a review from danielblando June 12, 2025 16:39
Comment on lines +791 to +793
d.validateMetrics.DiscardedSamples.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(totalSamples))
d.validateMetrics.DiscardedExemplars.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(validatedExemplars))
d.validateMetrics.DiscardedMetadata.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(len(validatedMetadata)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to drop all samples, exemplars and metadata if native histograms are rate limited?
The default ingestion rate drop everything today because it passes all to the rate limiter. But NH limiter we only check native histograms so it should only throttle native histograms.

I think it doesn't make sense for this limit to impact the existing ingestion rate limit if NH limit is set very small but there is still big room for the default ingestion rate

@@ -240,8 +242,10 @@ func (l *Limits) RegisterFlags(f *flag.FlagSet) {

f.IntVar(&l.IngestionTenantShardSize, "distributor.ingestion-tenant-shard-size", 0, "The default tenant's shard size when the shuffle-sharding strategy is used. Must be set both on ingesters and distributors. When this setting is specified in the per-tenant overrides, a value of 0 disables shuffle sharding for the tenant.")
f.Float64Var(&l.IngestionRate, "distributor.ingestion-rate-limit", 25000, "Per-user ingestion rate limit in samples per second.")
f.Float64Var(&l.NativeHistogramsIngestionRate, "distributor.native-histograms-ingestion-rate-limit", 0, "Per-user nativeHistograms ingestion rate limit in samples per second. 0 to disable the limit")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Let's not use nativeHistograms in camel case as it looks weird. Let's use Native Histograms or native histograms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have updated now.

f.StringVar(&l.IngestionRateStrategy, "distributor.ingestion-rate-limit-strategy", "local", "Whether the ingestion rate limit should be applied individually to each distributor instance (local), or evenly shared across the cluster (global).")
f.IntVar(&l.IngestionBurstSize, "distributor.ingestion-burst-size", 50000, "Per-user allowed ingestion burst size (in number of samples).")
f.IntVar(&l.NativeHistogramsIngestionBurstSize, "distributor.native-histograms-ingestion-burst-size", 0, "Per-user allowed nativeHistograms ingestion burst size (in number of samples). 0 to disable the limit")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have updated now.

// Return a 429 here to tell the client it is going too fast.
// Client may discard the data or slow down and re-send.
// Prometheus v2.26 added a remote-write option 'retry_on_http_429'.

return nil, httpgrpc.Errorf(http.StatusTooManyRequests, "ingestion rate limit (%v) exceeded while adding %d samples and %d metadata", d.ingestionRateLimiter.Limit(now, userID), totalSamples, len(validatedMetadata))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR. We should mention number of dropped exemplars as well.


return nil, httpgrpc.Errorf(http.StatusTooManyRequests, "nativeHistograms ingestion rate limit (%v) exceeded while adding %d samples and %d metadata", d.nativeHistogramsIngestionRateLimiter.Limit(now, userID), totalSamples, len(validatedMetadata))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use camel case for nativeHistograms. How about

native histogram ingestion rate limit (%v) exceeded while adding %d native histogram samples`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have updated now.

Comment on lines 785 to 787
if limits.NativeHistogramsIngestionRate > 0 && limits.NativeHistogramsIngestionBurstSize > 0 {
nhRateLimited = !d.nativeHistogramsIngestionRateLimiter.AllowN(now, userID, validatedHistogramSamples)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm i am not sure this is how was imagining it. I think the limiter could handle this 0 as disabled.
Looking at the rate documentation, 0 is a valid value meaning all will be blocked. MaxFloat is disabling it. We might can add the default as maxFloat and see if it will work
https://github.com/cortexproject/cortex/blob/master/vendor/golang.org/x/time/rate/rate.go#L40

@yeya24 opinion?

Copy link
Contributor

@yeya24 yeya24 Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch.
If ingestion rate is set to 0 then we just don't run d.nativeHistogramsIngestionRateLimiter.AllowN. Or as Daniel said we use a very big value.

Let's set default to max float.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have updated now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants