Improve rate limit error message for traces size exceeds rate limit#4986
Improve rate limit error message for traces size exceeds rate limit#4986electron0zero merged 7 commits intografana:mainfrom
Conversation
| globalLimit = limit * d.DistributorRing.InstancesCount() | ||
| } | ||
|
|
||
| if tracesSize > limit || tracesSize > globalLimit { |
There was a problem hiding this comment.
does this mean that their batch size is too large? should we advise they reduce it?
There was a problem hiding this comment.
yes, and yes but decided not to put that in error message because we don't give advice in other error messages as well.
There was a problem hiding this comment.
i think it's fine to give more details here. i don't find the new error message any more useful than the old one. you could mention that a single batch exceeded the rate limit.
also, i don't think this check is correct. the push is limited if it exceeds local limit or global limit depending on how tempo is configured. the local limit is just the limit. the global limit is limit / numDistributors. the logic is bit tough to unwind but you can see some if it in ingestion_rate_strategy.go.
There was a problem hiding this comment.
I'm wondering if what we need is to include the remaining burst size in the error message. I think that is an area that is unclear, because a push can exceed the local limit as a burst. But not when it's consistent.
There was a problem hiding this comment.
updated to add some more tests and comments to better explain the local and global ingestion rate strategy.
also, i don't think this check is correct. the push is limited if it exceeds local limit or global limit depending on how tempo is configured. the local limit is just the limit. the global limit is limit / numDistributors. the logic is bit tough to unwind but you can see some if it in ingestion_rate_strategy.go.
limit is fetched from ingestionRateLimiter so it will be the local or global limit based on the configured IngestionRateStrategy.
I also updated the check to not include globalLimit because that's not correct, instead we should be checking on limit AND burst to only return the batch size error when the batch is bigger then both limit.
if the batch size is more then limit but less then burst, it would allowed by burst and we should return the ingestion rate limit instead of batch size error.
9b0bdd1 to
171c74d
Compare
171c74d to
5c64d11
Compare
zalegrala
left a comment
There was a problem hiding this comment.
Looks reasonable to me.
6658e07 to
85424d0
Compare
What this PR does:
I saw errors that reads as
pusher failed to consume trace data" err="rpc error: code = ResourceExhausted desc = RATE_LIMITED: ingestion rate limit (local: 50000 bytes, global: 500000 bytes) exceeded while adding 966953 bytes for user xxxxxx"This error message had me confused because the ingestion rate for this user was way under the ingestion rate limit.
After some investigation I noticed that this push failed because the batch size of 966953 bytes is more then the global limit of 500000 bytes (my cluster is using
globalingestion rate limiting strategy)This PR improves the error message when the push failure is due to batch size being more then rate limit and not because user is hitting the ingestion rate limit.
Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]