Skip to content

[Rhythm] Live-store: downscale endpoint#5600

Merged
ruslan-mikhailov merged 17 commits intografana:mainfrom
ruslan-mikhailov:enhancement/live-store-downscale
Oct 7, 2025
Merged

[Rhythm] Live-store: downscale endpoint#5600
ruslan-mikhailov merged 17 commits intografana:mainfrom
ruslan-mikhailov:enhancement/live-store-downscale

Conversation

@ruslan-mikhailov
Copy link
Copy Markdown
Contributor

@ruslan-mikhailov ruslan-mikhailov commented Sep 2, 2025

What this PR does: adds endpoint /live-store/prepare-partition-downscale to prepare for downscaling. Live-store downscale flow:

  1. Trigger POST request to /live-store/prepare-partition-downscale
  2. Partition's set to INACTIVE state. It is in read-only mode and does not receive new records.
  3. Live-store lives N minutes until requests for traces that it stores start hitting backend, and it is safe to remove it from load.
  4. Trigger POST request to /live-store/prepare-downscale which allows live-store to remove itself from partition owners
  5. Shutdown the pod
  6. When no owners of the partition left, remove it from the ring

The implementation of the endpoint is similar to Mimir ingester's endpoint with slight changes

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

//
// - DELETE
// Sets partition back from INACTIVE to ACTIVE state.
func (s *LiveStore) PreparePartitionDownscaleHandler(w http.ResponseWriter, r *http.Request) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a risk of multiple parallel invocations causing issues here? IE in the HTTP POST where the partition state might change between line 53 and line 60?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint is called by the rollout-operator which is run as a singleton. This handler is not thread-safe, but I'm not sure it's a real problem

@ruslan-mikhailov ruslan-mikhailov marked this pull request as draft September 16, 2025 14:31
@ruslan-mikhailov ruslan-mikhailov force-pushed the enhancement/live-store-downscale branch 2 times, most recently from 5038522 to 6771900 Compare September 19, 2025 08:37
pvc_storage_class: error 'Must specify a live-store pvc storage class',
replicas: 0,
max_unavailable: 25,
downscale_delay: '30m',
Copy link
Copy Markdown
Contributor Author

@ruslan-mikhailov ruslan-mikhailov Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked the value based on default in query_backend_after (30 min). Possibly, need to increase to 45 min to avoid edge cases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to 35m

@ruslan-mikhailov ruslan-mikhailov force-pushed the enhancement/live-store-downscale branch 2 times, most recently from 93d945e to 7553798 Compare September 19, 2025 09:00
@ruslan-mikhailov ruslan-mikhailov force-pushed the enhancement/live-store-downscale branch 3 times, most recently from 990f442 to 824f995 Compare September 29, 2025 07:30
@ruslan-mikhailov ruslan-mikhailov marked this pull request as ready for review September 29, 2025 14:01
@ruslan-mikhailov ruslan-mikhailov force-pushed the enhancement/live-store-downscale branch from 99e955f to d7961f9 Compare September 29, 2025 14:04
Comment on lines +11 to +15
live_store: {
partition_ring: {
delete_inactive_partition_after: $._config.live_store.downscale_delay,
},
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this to tempo_live_store_config (line 112) so the blast radius is smaller? Otherwise this is propagated to all components

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, should be better now

@ruslan-mikhailov ruslan-mikhailov force-pushed the enhancement/live-store-downscale branch 2 times, most recently from 2c2f37d to c28ef4b Compare October 1, 2025 09:47
@ruslan-mikhailov
Copy link
Copy Markdown
Contributor Author

+ review and doc fixes
+ rebase from main

Comment on lines +116 to +119
if !primary then {
// Disable updates of ReplicaTemplate/live-store from non-primary (zone-b) statefulsets.
'grafana.com/rollout-mirror-replicas-from-resource-write-back': 'false',
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you briefly explain how this works? I thought that replicas were updated from the ReplicaTemplate to the StatefulSet. Does the ReplicaTemplate follow the replicas in one zone instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from mimir's implementation.

Here PR where it was introduced: grafana/rollout-operator#169

My understanding is that if it sets to true (default), then changes in statefulset's replica count will write back to replicatemplate. If understand correctly, then changes to zone-a's sts (due to autoscaling, for example) changes replicatemplate -> changes also zone-b

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the strategy for updating replicas then? Update the primary sts?

Copy link
Copy Markdown
Contributor

@mapno mapno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@ruslan-mikhailov ruslan-mikhailov force-pushed the enhancement/live-store-downscale branch from c28ef4b to 4eaf471 Compare October 7, 2025 10:44
@ruslan-mikhailov
Copy link
Copy Markdown
Contributor Author

+ rebase from latest main to resolve conflicts

@ruslan-mikhailov ruslan-mikhailov enabled auto-merge (squash) October 7, 2025 10:56
@ruslan-mikhailov ruslan-mikhailov merged commit 74a3532 into grafana:main Oct 7, 2025
23 checks passed
@ruslan-mikhailov ruslan-mikhailov deleted the enhancement/live-store-downscale branch October 7, 2025 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants