-
Notifications
You must be signed in to change notification settings - Fork 220
Read the Primary from the Cache on Reconcile Start #1246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you detail under which circumstances, there would be such a high load that the SDK wouldn't process the primary soon enough? Slow dependent reconciliation? Load on the operator process or the Kubernetes API server? |
With the current default settings there does not needs to such high load. Basically now the fixed thread pool for executor has size of 5 (maybe double this at least for default?) : If the reconciliation takes long (calls for external APIs) then those 5 threads could be easily used up. The execution scope is created here before submit: https://github.com/java-operator-sdk/java-operator-sdk/blob/f15f948e5af18ac2e3642c739b3e309af5090187/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/EventProcessor.java#L139-L139 So it's actual execution could be delayed. Also seen related issue with The double processing as mentioned is a problem, but that we can handle with an in memory state. So at the end this issue is an optimization (not about correctness), but might help actually in significant number of cases. |
Maybe we could add contention detection on the executor and issue a warning if processing takes longer than a given amount of time to let users know they might need to increase the thread pool size? |
I think we should expose some metrics about it. Contention detection could work too. But maybe an better approach is to delegate it to tools like Prometheus. So there is history also about that. Since having some temporal spikes is could be normal, and not sure if an issues is just temporal the thread pool should be increased, probably not. So probably doing such decision is the best by observing history not a point in time. |
Yes, was thinking about using metrics for that in a first step, indeed. Coupled with alerting (for people who are interested in such signal) would work nicely enough, imo. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. |
This issue was closed because it has been stalled for 14 days with no activity. |
Uh oh!
There was an error while loading. Please reload this page.
Currently the primary resource is read from the cache before the resource is submitted to the executor before the reconciliation. If there is high load the actual reconciliation process might start significantly later. So when if primary resource is updated in between, we will process an older version of the primary while the new version is already in the cache. Reading the resource from the cache when the reconciliation actually starts will make to work on up to date primary resource (well that is known to the operator).
On the other hand this might lead to a double processing of the resource. Since the event from the primary change will mark the processor to make it reconcile again. We might however handle this situation, thus not marking the resource to reconcile if a resource with a same version is already submitted for reconciliation.
The text was updated successfully, but these errors were encountered: