generator: back off when instance creation fails to avoid resource exhaustion#6142
Conversation
01f17aa to
a4a5eef
Compare
4fcf045 to
d197127
Compare
| NoGenerateMetricsContextKey = "no-generate-metrics" | ||
|
|
||
| // failureBackoff is the duration to wait before retrying failed tenant instance creation. | ||
| failureBackoff = 10 * time.Minute |
There was a problem hiding this comment.
I think this means that after fixing the config issue, the operator must wait up to 10 minutes to know it's working. That seems like a long time, what do you think about something shorter like 1 or 5 minutes? When we found this, the failure rate was tens/hundreds of times per second (every message received from the queue), so I think even a 1 minute backoff is a huge improvement and guarantees stability.
Or is there a way to respond when the configuration is reloaded, and we could clear failedInstances to make the fix quicker?
There was a problem hiding this comment.
when the processor is created successfully, the config is reloaded every 10 seconds: https://github.com/carles-grafana/tempo/blob/fix-generator-oom/modules/generator/instance.go#L144
so 1 minute for failed instances sounds good, changed
| instances map[string]*instance | ||
| instancesMtx sync.RWMutex | ||
| instances map[string]*instance | ||
| failedTenants map[string]time.Time // tenant -> when creation last failed |
There was a problem hiding this comment.
Although I like the word tenant the best, everything else in this the generator is called instance. Rename to failedInstances?
There was a problem hiding this comment.
agree, changed
d197127 to
c3199a2
Compare
- When a processor validation fails and the instance creation fails, the generator will attempt again to create it indefinitely, potentially causing OOM errors. - This change caches failed instances and backs off to prevent the issue.
c3199a2 to
cad7f49
Compare
What this PR does:
When a processor validation fails and the instance creation fails, the generator will attempt again to create it indefinitely,
potentially causing OOM errors.
This change caches failed instances and backs off to prevent the issue.
Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]