[Ingester] Create one goroutine per tenant to flush traces to disk by joe-elliott · Pull Request #4483 · grafana/tempo

joe-elliott · 2024-12-20T13:23:46Z

What this PR does:
We have identified a failure mode in the ingesters due to using a single goroutine to flush all live traces to disk. If there is a heavy query or other event that causes resource starvation this goroutine will fall behind and an ingester's memory will balloon and it will start refusing traces with a LIVE_TRACES_EXCEEDED error. In an extreme case it will OOM.

This PR creates one goroutine per tenant that manages the flushing of traces to disk for that tenant. This is similar to the local blocks in the metrics generator. I would have preferred to have the goroutine lifecycle managed by the instance itself, but that would be counter to the way things are currently designed and would have required more changes.

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Joe Elliott <number101010@gmail.com>

joe-elliott · 2024-12-20T13:28:54Z

-// Flush triggers a flush of all in memory traces to disk.  This is called
-// by the lifecycler on shutdown and will put our traces in the WAL to be
-// replayed.
-func (i *Ingester) Flush() {


this is the old way that we flushed traces to disk on shutdown. it was difficult to find and it was driven through an obsolete ring mechanic "FlushTransfer" so I removed it and moved it to "stopping". I think the new way is more easily discoverable and clear.

https://github.com/grafana/tempo/pull/4483/files#diff-b17cd433ae9859f67f0056e452237ddf70a44e9821c6128c8990005eaf7decd1R176

joe-elliott · 2024-12-20T13:29:32Z

+		i.cutOneInstanceToWal(instance, true)
 	} else {
-		i.sweepAllInstances(true)
+		i.cutAllInstancesToWal()


renamed these funcs for clarity

joe-elliott · 2024-12-20T13:29:56Z

-func (i *Ingester) sweepAllInstances(immediate bool) {
+// cutToWalLoop kicks off a goroutine for the passed instance that will periodically cut traces to WAL.
+// it signals completion through cutToWalWg, waits for cutToWalStart and stops on cutToWalStop.
+func (i *Ingester) cutToWalLoop(instance *instance) {


the new, per tenant loop, that drives flushing live traces to disk

joe-elliott · 2024-12-20T13:31:01Z

@@ -256,7 +275,6 @@ func (i *Ingester) handleComplete(ctx context.Context, op *flushOp) (retry bool,
 	}

 	start := time.Now()
-	level.Info(log.Logger).Log("msg", "completing block", "tenant", op.userID, "blockID", op.blockID)


this was being logged twice and the the first was actually wrong. fixed and removed the second log

joe-elliott · 2024-12-20T13:33:59Z

 	}

 	i.pushErr.Store(ErrStarting)

 	i.local = store.WAL().LocalBackend()

-	lc, err := ring.NewLifecycler(cfg.LifecyclerConfig, i, "ingester", cfg.OverrideRingKey, true, log.Logger, prometheus.WrapRegistererWithPrefix("tempo_", reg))
+	lc, err := ring.NewLifecycler(cfg.LifecyclerConfig, nil, "ingester", cfg.OverrideRingKey, true, log.Logger, prometheus.WrapRegistererWithPrefix("tempo_", reg))


no longer register ourselves as a "FlushTransferer"

https://github.com/grafana/dskit/blob/main/ring/lifecycler.go#L181

This is deprecated logic that we were only using to drive flush to disk on shutdown behavior. Removed in favor of just doing it clearly in the stopping func.

joe-elliott · 2024-12-20T13:34:32Z

@@ -351,19 +362,6 @@ func (i *Ingester) getInstances() []*instance {
 	return instances
 }

-// stopIncomingRequests implements ring.Lifecycler.


only called in one spot so removed

joe-elliott · 2024-12-20T13:34:48Z

-	i.pushErr.Store(ErrShuttingDown)
-}
-
-// TransferOut implements ring.Lifecycler.


only existed to satisfy the FlushTransferer interface

Signed-off-by: Joe Elliott <number101010@gmail.com>

This reverts commit 5e58675.

mapno

LGTM

Signed-off-by: Joe Elliott <number101010@gmail.com>

joe-elliott added 6 commits December 20, 2024 08:08

fix jsonnet example

5e58675

Signed-off-by: Joe Elliott <number101010@gmail.com>

make one goroutine per instance for wal flush

a75ff62

Signed-off-by: Joe Elliott <number101010@gmail.com>

restore subservice watcher in case it did something

2941b36

Signed-off-by: Joe Elliott <number101010@gmail.com>

harden shutdown

7d9e415

Signed-off-by: Joe Elliott <number101010@gmail.com>

remove flush event

43b30df

Signed-off-by: Joe Elliott <number101010@gmail.com>

undo accidental change

982d903

Signed-off-by: Joe Elliott <number101010@gmail.com>

joe-elliott requested review from electron0zero, ie-pham, javiermolinar, mapno, mdisibio, stoewer, yvrhdn and zalegrala as code owners December 20, 2024 13:23

joe-elliott added 2 commits December 20, 2024 08:25

changelog

24ea3ef

Signed-off-by: Joe Elliott <number101010@gmail.com>

remove debug log

5e1f686

Signed-off-by: Joe Elliott <number101010@gmail.com>

joe-elliott commented Dec 20, 2024

View reviewed changes

joe-elliott mentioned this pull request Dec 20, 2024

Remove pool goroutines from all components that don't need it #4484

Merged

3 tasks

joe-elliott added 3 commits December 20, 2024 09:47

Remove broken tests

e0a5f58

Signed-off-by: Joe Elliott <number101010@gmail.com>

Revert "fix jsonnet example"

8ffbe9e

This reverts commit 5e58675.

Merge branch 'main' into drop-shared-trace-cut

2d04995

mapno approved these changes Jan 17, 2025

View reviewed changes

Comment thread modules/ingester/ingester.go Outdated

joe-elliott added 2 commits January 17, 2025 08:46

update func name for clarity

abac709

Signed-off-by: Joe Elliott <number101010@gmail.com>

Merge branch 'main' into drop-shared-trace-cut

2d7d608

joe-elliott merged commit 1f8d337 into grafana:main Jan 17, 2025

electron0zero mentioned this pull request Jan 29, 2025

contri svs electron0zero/tempo#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ingester] Create one goroutine per tenant to flush traces to disk#4483

[Ingester] Create one goroutine per tenant to flush traces to disk#4483
joe-elliott merged 13 commits intografana:mainfrom
joe-elliott:drop-shared-trace-cut

joe-elliott commented Dec 20, 2024 •

edited

Loading

Uh oh!

joe-elliott Dec 20, 2024 •

edited

Loading

Uh oh!

joe-elliott Dec 20, 2024

Uh oh!

joe-elliott Dec 20, 2024

Uh oh!

joe-elliott Dec 20, 2024

Uh oh!

joe-elliott Dec 20, 2024

Uh oh!

joe-elliott Dec 20, 2024

Uh oh!

joe-elliott Dec 20, 2024

Uh oh!

mapno left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joe-elliott commented Dec 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joe-elliott Dec 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joe-elliott Dec 20, 2024

Choose a reason for hiding this comment

Uh oh!

joe-elliott Dec 20, 2024

Choose a reason for hiding this comment

Uh oh!

joe-elliott Dec 20, 2024

Choose a reason for hiding this comment

Uh oh!

joe-elliott Dec 20, 2024

Choose a reason for hiding this comment

Uh oh!

joe-elliott Dec 20, 2024

Choose a reason for hiding this comment

Uh oh!

joe-elliott Dec 20, 2024

Choose a reason for hiding this comment

Uh oh!

mapno left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joe-elliott commented Dec 20, 2024 •

edited

Loading

joe-elliott Dec 20, 2024 •

edited

Loading