Skip to content

Fix distributor rebatch bug#5186

Merged
mdisibio merged 3 commits intografana:mainfrom
mdisibio:distributor-rebatch-fix
May 29, 2025
Merged

Fix distributor rebatch bug#5186
mdisibio merged 3 commits intografana:mainfrom
mdisibio:distributor-rebatch-fix

Conversation

@mdisibio
Copy link
Copy Markdown
Contributor

@mdisibio mdisibio commented May 29, 2025

What this PR does:
Distributors rebatch incoming writes by trace ID, in order to ensure that all spans for a given trace end up on the same ingesters over time. It's using a 32-bit hash which can lead to collisions. When 2 traces in the incoming write request have the same hash (collide), their spans will get intermixed, i.e. bad data.

The core issue is that the distributor logic was conflating hashing for the ring, and hashing for dedupe. These don't need to be the same hash, nor should they be. Because the ring requires 32-bit hashes and collisions there don't matter, it just means 2 traces go the same ingesters (normal and unavoidable).

But collisions for dedupe must be avoided, and we do this by swapping that part to a 64-bit hash. Same approach as in trace combiner for spans, and there are test cases for known trace ID collisions under the old method and a test to check the collision rate of the new method.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Comment thread modules/distributor/distributor.go
maxAttributeBytes := d.getMaxAttributeBytes(userID)

keys, rebatchedTraces, truncatedAttributeCount, err := requestsByTraceID(batches, userID, spanCount, maxAttributeBytes)
ringTokens, rebatchedTraces, truncatedAttributeCount, err := requestsByTraceID(batches, userID, spanCount, maxAttributeBytes)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the rename

@mdisibio mdisibio marked this pull request as ready for review May 29, 2025 22:30
@mdisibio mdisibio merged commit b9a611e into grafana:main May 29, 2025
20 checks passed
@mdisibio mdisibio added type/bug Something isn't working backport release-v2.8 labels Jun 3, 2025
mdisibio added a commit to mdisibio/tempo that referenced this pull request Jun 3, 2025
* Fix distributor rebatch bug, by not using a 32-bit hash for deduping, only for ring sharding (as required)

* lint

* changelog
mdisibio added a commit that referenced this pull request Jun 3, 2025
* Fix distributor rebatch bug (#5186)

* Fix distributor rebatch bug, by not using a 32-bit hash for deduping, only for ring sharding (as required)

* lint

* changelog

* changelog
carles-grafana pushed a commit to carles-grafana/tempo that referenced this pull request Jun 4, 2025
* Fix distributor rebatch bug, by not using a 32-bit hash for deduping, only for ring sharding (as required)

* lint

* changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport release-v2.8 type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants