Skip to content

Bug fix: Avoid race condition by checking for duplicate entries in bucket stores user scan #6863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

justinjung04
Copy link
Contributor

@justinjung04 justinjung04 commented Jul 4, 2025

What this PR does:

We saw store gateways terminating due to map concurrent write:

fatal error: concurrent map writes

I was able to reproduce the race condition by forcing scanUsers to return duplicate user IDs in the list:

func TestBucketStores_RaceCheck(t *testing.T) {
	t.Parallel()

	type job struct {
		userID string
		store  *store.BucketStore
	}

	wg := &sync.WaitGroup{}
	jobs := make(chan job)

	const userID = "user-1"
	ctx := context.Background()
	cfg := prepareStorageConfig(t)
	storageDir := t.TempDir()

	bkt, err := filesystem.NewBucketClient(filesystem.Config{Directory: storageDir})
	require.NoError(t, err)

	reg := prometheus.NewPedanticRegistry()
	stores, err := NewBucketStores(cfg, NewNoShardingStrategy(log.NewNopLogger(), nil), objstore.WithNoopInstr(bkt), defaultLimitsOverrides(t), mockLoggingLevel(), log.NewNopLogger(), reg)
	require.NoError(t, err)

	for i := 0; i < 10; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()

			for job := range jobs {
				job.store.SyncBlocks(ctx)
			}
		}()
	}

	for i := 0; i < 10; i++ {
		bs, _ := stores.getOrCreateStore(userID)
		jobs <- job{userID: userID, store: bs}
	}

	close(jobs)
	wg.Wait()
}
==================
WARNING: DATA RACE
Write at 0x00c0008962c0 by goroutine 80:
  github.com/cortexproject/cortex/pkg/storegateway.(*shardingMetadataFilterAdapter).Filter()
      cortex/pkg/storegateway/sharding_strategy.go:280 +0xfc
  github.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch()
      cortex/vendor/github.com/thanos-io/thanos/pkg/block/fetcher.go:617 +0x54c
  github.com/thanos-io/thanos/pkg/block.(*MetaFetcher).Fetch()
      cortex/vendor/github.com/thanos-io/thanos/pkg/block/fetcher.go:658 +0xd4
  github.com/thanos-io/thanos/pkg/store.(*BucketStore).SyncBlocks()
      cortex/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go:737 +0x60
  github.com/cortexproject/cortex/pkg/storegateway.TestBucketStores_RaceCheck.func1()
      cortex/pkg/storegateway/bucket_stores_test.go:487 +0xb8

Previous write at 0x00c0008962c0 by goroutine 78:
  github.com/cortexproject/cortex/pkg/storegateway.(*shardingMetadataFilterAdapter).Filter()
      cortex/pkg/storegateway/sharding_strategy.go:280 +0xfc
  github.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch()
      cortex/vendor/github.com/thanos-io/thanos/pkg/block/fetcher.go:617 +0x54c
  github.com/thanos-io/thanos/pkg/block.(*MetaFetcher).Fetch()
      cortex/vendor/github.com/thanos-io/thanos/pkg/block/fetcher.go:658 +0xd4
  github.com/thanos-io/thanos/pkg/store.(*BucketStore).SyncBlocks()
      cortex/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go:737 +0x60
  github.com/cortexproject/cortex/pkg/storegateway.TestBucketStores_RaceCheck.func1()
      cortex/pkg/storegateway/bucket_stores_test.go:487 +0xb8

Goroutine 80 (running) created at:
  github.com/cortexproject/cortex/pkg/storegateway.TestBucketStores_RaceCheck()
      cortex/pkg/storegateway/bucket_stores_test.go:483 +0x3cc
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.24.1/libexec/src/testing/testing.go:1792 +0x180
  testing.(*T).Run.gowrap1()
      /opt/homebrew/Cellar/go/1.24.1/libexec/src/testing/testing.go:1851 +0x40

Goroutine 78 (running) created at:
  github.com/cortexproject/cortex/pkg/storegateway.TestBucketStores_RaceCheck()
      cortex/pkg/storegateway/bucket_stores_test.go:483 +0x3cc
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.24.1/libexec/src/testing/testing.go:1792 +0x180
  testing.(*T).Run.gowrap1()
      /opt/homebrew/Cellar/go/1.24.1/libexec/src/testing/testing.go:1851 +0x40
==================

To avoid such fatal error, I've added a check for duplicate users.

Which issue(s) this PR fixes:
n/a

Checklist

  • Tests updated
  • [n/a] Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@justinjung04 justinjung04 changed the title Make scanUser always return unique user IDs to avoid concurrent map w… Bug fix: Check for duplicate entries in bucket store user scan Jul 4, 2025
@justinjung04 justinjung04 changed the title Bug fix: Check for duplicate entries in bucket store user scan Bug fix: Check for duplicate entries in bucket stores user scan Jul 4, 2025
@justinjung04 justinjung04 changed the title Bug fix: Check for duplicate entries in bucket stores user scan Bug fix: Check for duplicate entries in bucket stores user scan to avoid race condition Jul 4, 2025
@justinjung04 justinjung04 changed the title Bug fix: Check for duplicate entries in bucket stores user scan to avoid race condition Bug fix: Avoid race condition by checking for duplicate entries in bucket stores user scan Jul 4, 2025
Signed-off-by: Justin Jung <[email protected]>
@SungJin1212
Copy link
Member

Thanks for catching!
How could this scenario arise?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants