Add nativeHistograms IngestionRate limit #6794

PaurushGarg · 2025-06-05T14:16:50Z

What this PR does:
As the ingestion of native histograms samples is much more CPU intensive than that of float samples - adding nativeHistogram samples specific IngestionRate limit to protect the service and to allow clients to adjust the NH series ingestion.

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Paurush Garg <[email protected]>

danielblando · 2025-06-05T14:59:18Z

Why do we need new limits for nativeHistogram?
Isnt histogram already counted to the ingestionRante

cortex/pkg/distributor/distributor.go

Line 775 in 844fa55

totalSamples := validatedFloatSamples + validatedHistogramSamples

harry671003 · 2025-06-05T16:00:59Z

Why do we need new limits for nativeHistogram?
Isnt histogram already counted to the ingestionRate

NH samples are much more expensive that float samples.
Float sample is always 8 bytes.
A NH sample with 160 buckets and 160 spans will be ~2.5 KB in size.

pkg/distributor/distributor.go

pkg/util/validation/limits.go

danielblando · 2025-06-05T17:49:12Z

NH samples are much more expensive that float samples.
Float sample is always 8 bytes.
A NH sample with 160 buckets and 160 spans will be ~2.5 KB in size.

I think this is more an timeseries limit, no? I am not sure the mem impact ingestionRate.

But @PaurushGarg talked to me offline. It seems there are good reason for this. Lets just update the description

Signed-off-by: Paurush Garg <[email protected]>

harry671003 · 2025-06-06T05:03:50Z

There is one test failure related to this change.

 --- FAIL: TestPush_QuorumError (17.76s)
    distributor_test.go:878: 
        	Error Trace:	/__w/cortex/cortex/pkg/distributor/distributor_test.go:878
        	Error:      	Received unexpected error:
        	            	rpc error: code = Code(429) desc = nativeHistograms ingestion rate limit (25000) exceeded while adding 40 samples and 20 metadata
        	Test:       	TestPush_QuorumError
FAIL

harry671003

Overall the change looks good.

Signed-off-by: Paurush Garg <[email protected]>

harry671003

LGTM!

CHANGELOG.md

danielblando · 2025-06-09T17:02:40Z

pkg/distributor/distributor.go

+		d.validateMetrics.DiscardedSamples.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(totalSamples))
+		d.validateMetrics.DiscardedExemplars.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(validatedExemplars))
+		d.validateMetrics.DiscardedMetadata.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(len(validatedMetadata)))


Are we always returning NativeHistogramsRateLimited? Can't this be trigger only by rateLimited?

Thanks very much.
I needed to set label value validation.RateLimited in case it is rateLimited due to IngestionRate limit, and set label value validation.NativeHistogramsRateLimited in case it is nhRateLimited due to nativeHistogramsIngestionRate limit.
Updated now.

Why we need to drop all samples, exemplars and metadata if native histograms are rate limited?
The default ingestion rate drop everything today because it passes all to the rate limiter. But NH limiter we only check native histograms so it should only throttle native histograms.

I think it doesn't make sense for this limit to impact the existing ingestion rate limit if NH limit is set very small but there is still big room for the default ingestion rate

+1. we can just block NH

This is still the same. no? Ben suggested we drop only validatedHistogramSamples and dont fail the request right away

Is there a use case for ingesting partial samples? I feel it's simpler to drop everything.

Also, we don't do partial ingestion for float samples. For example, in a remote write request with 10K samples, even if ingesting 9K samples is within the rate limit, we still reject the entire request.

For me this seems a different type of rejection. For the ones we have today is a global limiter which if we want to accept until the limit we need to "choose" what goes through and what is rejected. This seems more complicated.

For a new limit as specific for nh we can just limit nh samples at all even if we could accept part of it, but allow the rest to go.

Thanks. Updated now.
I want to confirm on below two points:

Even when the NH samples are dropped by the NHRateLimiter, this PR does not reduce the totalN value (by removing the NH Samples count that were dropped by NHRateLimiter).
This is to keep the existing rate limiter behaviour untouched.
So, basically, where NH Samples were discarded by the NHRateLimiter, the existing rateLimiter would still rateLimit on the basis of total samples received in the request, regardless of whether they were already discarded by NHRateLimiter or not. And similarly, the dropped NH samples (dropped by NH Rate Limiter) would still exhaust the tokens of current RateLimiter.

Incase of NHRateLimiting, this CR doesn't return any error message or 429 to Client. It only publishes the discarded samples metric with label value exclusive for NHRateLimiter.

danielblando · 2025-06-09T17:10:16Z

pkg/distributor/distributor.go

+	// Return a 429 here to tell the client it is going too fast.
+	// Client may discard the data or slow down and re-send.
+	// Prometheus v2.26 added a remote-write option 'retry_on_http_429'.
+	if nhRateLimited {


nit: this can still be in the previous if for sending metrics? any reason to remove from inside it?

Thanks. Updated.

pkg/distributor/ingestion_rate_strategy.go

Signed-off-by: Paurush Garg <[email protected]>

danielblando · 2025-06-10T02:08:54Z

docs/configuration/config-file-reference.md

@@ -3427,6 +3427,10 @@ The `limits_config` configures default and per-tenant limits imposed by Cortex s
 # CLI flag: -distributor.ingestion-rate-limit
 [ingestion_rate: <float> | default = 25000]

+# Per-user nativeHistograms ingestion rate limit in samples per second.
+# CLI flag: -distributor.native-histograms-ingestion-rate-limit
+[native_histograms_ingestion_rate: <float> | default = 25000]


I am worry that we are deploying this as 25k default. This can cause impact for users who are using NH and override their ingester_rate but are not aware of new limit when deploying. i think i prefer this being 0 which means disabled as default. It will be limited by ingestion_rate if disabled by default.

Thanks very much Daniel. I have now implemented this.

Signed-off-by: Paurush Garg <[email protected]>

yeya24 · 2025-06-12T16:53:28Z

pkg/distributor/distributor.go

+		d.validateMetrics.DiscardedSamples.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(totalSamples))
+		d.validateMetrics.DiscardedExemplars.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(validatedExemplars))
+		d.validateMetrics.DiscardedMetadata.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(len(validatedMetadata)))


Why we need to drop all samples, exemplars and metadata if native histograms are rate limited?
The default ingestion rate drop everything today because it passes all to the rate limiter. But NH limiter we only check native histograms so it should only throttle native histograms.

I think it doesn't make sense for this limit to impact the existing ingestion rate limit if NH limit is set very small but there is still big room for the default ingestion rate

yeya24 · 2025-06-12T16:56:03Z

pkg/util/validation/limits.go

@@ -240,8 +242,10 @@ func (l *Limits) RegisterFlags(f *flag.FlagSet) {

 	f.IntVar(&l.IngestionTenantShardSize, "distributor.ingestion-tenant-shard-size", 0, "The default tenant's shard size when the shuffle-sharding strategy is used. Must be set both on ingesters and distributors. When this setting is specified in the per-tenant overrides, a value of 0 disables shuffle sharding for the tenant.")
 	f.Float64Var(&l.IngestionRate, "distributor.ingestion-rate-limit", 25000, "Per-user ingestion rate limit in samples per second.")
+	f.Float64Var(&l.NativeHistogramsIngestionRate, "distributor.native-histograms-ingestion-rate-limit", 0, "Per-user nativeHistograms ingestion rate limit in samples per second. 0 to disable the limit")


Nit. Let's not use nativeHistograms in camel case as it looks weird. Let's use Native Histograms or native histograms

Thanks. I have updated now.

yeya24 · 2025-06-12T16:56:13Z

pkg/util/validation/limits.go

 	f.StringVar(&l.IngestionRateStrategy, "distributor.ingestion-rate-limit-strategy", "local", "Whether the ingestion rate limit should be applied individually to each distributor instance (local), or evenly shared across the cluster (global).")
 	f.IntVar(&l.IngestionBurstSize, "distributor.ingestion-burst-size", 50000, "Per-user allowed ingestion burst size (in number of samples).")
+	f.IntVar(&l.NativeHistogramsIngestionBurstSize, "distributor.native-histograms-ingestion-burst-size", 0, "Per-user allowed nativeHistograms ingestion burst size (in number of samples). 0 to disable the limit")


Thanks. I have updated now.

yeya24 · 2025-06-12T16:58:13Z

pkg/distributor/distributor.go

-		// Return a 429 here to tell the client it is going too fast.
-		// Client may discard the data or slow down and re-send.
-		// Prometheus v2.26 added a remote-write option 'retry_on_http_429'.
+
 		return nil, httpgrpc.Errorf(http.StatusTooManyRequests, "ingestion rate limit (%v) exceeded while adding %d samples and %d metadata", d.ingestionRateLimiter.Limit(now, userID), totalSamples, len(validatedMetadata))


Unrelated to this PR. We should mention number of dropped exemplars as well.

yeya24 · 2025-06-12T16:59:24Z

pkg/distributor/distributor.go


+		return nil, httpgrpc.Errorf(http.StatusTooManyRequests, "nativeHistograms ingestion rate limit (%v) exceeded while adding %d samples and %d metadata", d.nativeHistogramsIngestionRateLimiter.Limit(now, userID), totalSamples, len(validatedMetadata))


Let's not use camel case for nativeHistograms. How about

native histogram ingestion rate limit (%v) exceeded while adding %d native histogram samples`

Thanks. I have updated now.

danielblando · 2025-06-12T17:12:56Z

pkg/distributor/distributor.go

+	if limits.NativeHistogramsIngestionRate > 0 && limits.NativeHistogramsIngestionBurstSize > 0 {
+		nhRateLimited = !d.nativeHistogramsIngestionRateLimiter.AllowN(now, userID, validatedHistogramSamples)
+	}


Hm i am not sure this is how was imagining it. I think the limiter could handle this 0 as disabled.
Looking at the rate documentation, 0 is a valid value meaning all will be blocked. MaxFloat is disabling it. We might can add the default as maxFloat and see if it will work
https://github.com/cortexproject/cortex/blob/master/vendor/golang.org/x/time/rate/rate.go#L40

@yeya24 opinion?

That's a good catch.
~~If ingestion rate is set to 0 then we just don't run d.nativeHistogramsIngestionRateLimiter.AllowN. Or as Daniel said we use a very big value.~~

Let's set default to max float.

Thanks. I have updated now.

Signed-off-by: Paurush Garg <[email protected]>

danielblando · 2025-06-18T20:23:26Z

pkg/distributor/distributor.go

-	if !d.ingestionRateLimiter.AllowN(now, userID, totalN) {
+
+	nhRateLimited := false
+	if limits.NativeHistogramsIngestionRate != math.MaxFloat64 {


We can remove the if. Lets leave the library do its job. I dont expect it to be resource intensive

danielblando · 2025-06-18T20:28:21Z

pkg/distributor/distributor.go

+		d.validateMetrics.DiscardedSamples.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(totalSamples))
+		d.validateMetrics.DiscardedExemplars.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(validatedExemplars))
+		d.validateMetrics.DiscardedMetadata.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(len(validatedMetadata)))


This is still the same. no? Ben suggested we drop only validatedHistogramSamples and dont fail the request right away

Signed-off-by: Paurush Garg <[email protected]>

yeya24 · 2025-06-20T22:01:00Z

pkg/distributor/distributor.go

@@ -765,6 +772,18 @@ func (d *Distributor) Push(ctx context.Context, req *cortexpb.WriteRequest) (*co
 	d.receivedExemplars.WithLabelValues(userID).Add(float64(validatedExemplars))
 	d.receivedMetadata.WithLabelValues(userID).Add(float64(len(validatedMetadata)))

+	nhRateLimited := false
+	if limits.NativeHistogramsIngestionRate != math.MaxFloat64 {


Just a nit. Maybe we can just call d.nativeHistogramsIngestionRateLimiter.AllowN and skip this check. I understand that we want to avoid calling limiter if it is max float but that should be similar to call AllowN with infinity strategy

yeya24 · 2025-06-20T22:02:37Z

pkg/distributor/distributor.go

+	}
+
+	if nhRateLimited {
+		d.validateMetrics.DiscardedSamples.WithLabelValues(validation.NativeHistogramsRateLimited, userID).Add(float64(validatedHistogramSamples))


If native histogram is ratelimited, do we still include validated histogram samples on Line 794? Maybe you should reset validatedHistogramSamples to 0 as we are not going to ingest histogram samples

totalSamples := validatedFloatSamples + validatedHistogramSamples

Let's also add a log, to provide more information in addition to the metric

yeya24 · 2025-06-20T22:07:06Z

pkg/distributor/distributor.go

+	} else {
+		seriesKeys = append(seriesKeys, nhSeriesKeys...)
+		validatedTimeseries = append(validatedTimeseries, nhValidatedTimeseries...)
+	}


I think we need more tests coverage. Let's try to at least add this test case since it seems related to the issue I mentioned above.

If NH samples hit NH rate limit, other series and samples should still succeed if they are under rate limit

pull-request-size bot added the size/L label Jun 5, 2025

Add native histograms ingestion rate limit

e7d86bc

Signed-off-by: Paurush Garg <[email protected]>

PaurushGarg force-pushed the native-histograms-ingestion-rate-limit branch from 6c2fcba to e7d86bc Compare June 5, 2025 14:19

harry671003 reviewed Jun 5, 2025

View reviewed changes

pkg/distributor/distributor.go Outdated Show resolved Hide resolved

pkg/util/validation/limits.go Outdated Show resolved Hide resolved

Adding Tests and Updating Docs

be1d61b

Signed-off-by: Paurush Garg <[email protected]>

harry671003 approved these changes Jun 6, 2025

View reviewed changes

Resolving failed testcase

0b4cb0d

Signed-off-by: Paurush Garg <[email protected]>

PaurushGarg force-pushed the native-histograms-ingestion-rate-limit branch from bb248fa to 0b4cb0d Compare June 6, 2025 05:54

PaurushGarg marked this pull request as ready for review June 6, 2025 06:05

dosubot bot added the component/distributor label Jun 6, 2025

harry671003 approved these changes Jun 6, 2025

View reviewed changes

danielblando reviewed Jun 9, 2025

View reviewed changes

Resolving comments

822cb4d

Signed-off-by: Paurush Garg <[email protected]>

danielblando reviewed Jun 10, 2025

View reviewed changes

PaurushGarg added 2 commits June 11, 2025 20:19

Resolving comments

f2e4c1f

Signed-off-by: Paurush Garg <[email protected]>

Updating doc

55896c8

Signed-off-by: Paurush Garg <[email protected]>

PaurushGarg requested a review from danielblando June 12, 2025 16:39

yeya24 reviewed Jun 12, 2025

View reviewed changes

danielblando reviewed Jun 12, 2025

View reviewed changes

Changing NativeHistograms default ingestion limits

ff78e59

Signed-off-by: Paurush Garg <[email protected]>

PaurushGarg requested review from danielblando, harry671003 and yeya24 June 17, 2025 21:01

harry671003 approved these changes Jun 17, 2025

View reviewed changes

danielblando reviewed Jun 18, 2025

View reviewed changes

Discard only NH Samples for NH Rate Limiter

ab7a85c

Signed-off-by: Paurush Garg <[email protected]>

PaurushGarg requested a review from danielblando June 20, 2025 19:22

yeya24 reviewed Jun 20, 2025

View reviewed changes


		return nil, httpgrpc.Errorf(http.StatusTooManyRequests, "nativeHistograms ingestion rate limit (%v) exceeded while adding %d samples and %d metadata", d.nativeHistogramsIngestionRateLimiter.Limit(now, userID), totalSamples, len(validatedMetadata))

Add nativeHistograms IngestionRate limit #6794

Are you sure you want to change the base?

Add nativeHistograms IngestionRate limit #6794

Uh oh!

Conversation

PaurushGarg commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielblando commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harry671003 commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielblando commented Jun 5, 2025

Uh oh!

harry671003 commented Jun 6, 2025

Uh oh!

harry671003 left a comment

Choose a reason for hiding this comment

Uh oh!

harry671003 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harry671003 Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaurushGarg Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeya24 Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaurushGarg commented Jun 5, 2025 •

edited

Loading

danielblando commented Jun 5, 2025 •

edited

Loading

harry671003 commented Jun 5, 2025 •

edited

Loading

harry671003 Jun 18, 2025 •

edited

Loading

PaurushGarg Jun 20, 2025 •

edited

Loading

yeya24 Jun 12, 2025 •

edited

Loading