Skip to content

generator: back off when instance creation fails to avoid resource exhaustion#6142

Merged
carles-grafana merged 1 commit intografana:mainfrom
carles-grafana:fix-generator-oom
Jan 8, 2026
Merged

generator: back off when instance creation fails to avoid resource exhaustion#6142
carles-grafana merged 1 commit intografana:mainfrom
carles-grafana:fix-generator-oom

Conversation

@carles-grafana
Copy link
Copy Markdown
Contributor

@carles-grafana carles-grafana commented Jan 7, 2026

What this PR does:

When a processor validation fails and the instance creation fails, the generator will attempt again to create it indefinitely,
potentially causing OOM errors.

This change caches failed instances and backs off to prevent the issue.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@carles-grafana carles-grafana force-pushed the fix-generator-oom branch 2 times, most recently from 01f17aa to a4a5eef Compare January 7, 2026 15:39
@carles-grafana carles-grafana changed the title wip generator: back off when instance fails to avoid resource exhaustion Jan 7, 2026
@carles-grafana carles-grafana changed the title generator: back off when instance fails to avoid resource exhaustion generator: back off when instance creation fails to avoid resource exhaustion Jan 7, 2026
@carles-grafana carles-grafana marked this pull request as ready for review January 7, 2026 16:13
Comment thread modules/generator/generator.go Outdated
NoGenerateMetricsContextKey = "no-generate-metrics"

// failureBackoff is the duration to wait before retrying failed tenant instance creation.
failureBackoff = 10 * time.Minute
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means that after fixing the config issue, the operator must wait up to 10 minutes to know it's working. That seems like a long time, what do you think about something shorter like 1 or 5 minutes? When we found this, the failure rate was tens/hundreds of times per second (every message received from the queue), so I think even a 1 minute backoff is a huge improvement and guarantees stability.

Or is there a way to respond when the configuration is reloaded, and we could clear failedInstances to make the fix quicker?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the processor is created successfully, the config is reloaded every 10 seconds: https://github.com/carles-grafana/tempo/blob/fix-generator-oom/modules/generator/instance.go#L144

so 1 minute for failed instances sounds good, changed

Comment thread modules/generator/generator.go Outdated
instances map[string]*instance
instancesMtx sync.RWMutex
instances map[string]*instance
failedTenants map[string]time.Time // tenant -> when creation last failed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I like the word tenant the best, everything else in this the generator is called instance. Rename to failedInstances?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, changed

- When a processor validation fails and the instance creation fails, the generator will attempt again to create it indefinitely,
  potentially causing OOM errors.

- This change caches failed instances and backs off to prevent the issue.
@carles-grafana carles-grafana merged commit b343b0f into grafana:main Jan 8, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants