Read the Primary from the Cache on Reconcile Start #1246

csviri · 2022-05-27T08:58:41Z

Currently the primary resource is read from the cache before the resource is submitted to the executor before the reconciliation. If there is high load the actual reconciliation process might start significantly later. So when if primary resource is updated in between, we will process an older version of the primary while the new version is already in the cache. Reading the resource from the cache when the reconciliation actually starts will make to work on up to date primary resource (well that is known to the operator).

On the other hand this might lead to a double processing of the resource. Since the event from the primary change will mark the processor to make it reconcile again. We might however handle this situation, thus not marking the resource to reconcile if a resource with a same version is already submitted for reconciliation.

metacosm · 2022-05-30T19:50:41Z

Can you detail under which circumstances, there would be such a high load that the SDK wouldn't process the primary soon enough? Slow dependent reconciliation? Load on the operator process or the Kubernetes API server?

csviri · 2022-05-31T08:39:37Z

With the current default settings there does not needs to such high load. Basically now the fixed thread pool for executor has size of 5 (maybe double this at least for default?) :
https://github.com/java-operator-sdk/java-operator-sdk/blob/f15f948e5af18ac2e3642c739b3e309af5090187/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/config/ConfigurationService.java#L83-L83

If the reconciliation takes long (calls for external APIs) then those 5 threads could be easily used up. The execution scope is created here before submit: https://github.com/java-operator-sdk/java-operator-sdk/blob/f15f948e5af18ac2e3642c739b3e309af5090187/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/EventProcessor.java#L139-L139

So it's actual execution could be delayed.

Also seen related issue with patch. Since we don't put resource into temp cache after reconcile. And there were events meanwhile for the resource (from event sources or cr), the execution will start without the fresh resource. However, typically seen in Flink operator logs that, when the executor service actually executes the dispatcher meanwhile the resource event from patch arrived and was already in cache.

The double processing as mentioned is a problem, but that we can handle with an in memory state.

So at the end this issue is an optimization (not about correctness), but might help actually in significant number of cases.

metacosm · 2022-05-31T10:43:27Z

Maybe we could add contention detection on the executor and issue a warning if processing takes longer than a given amount of time to let users know they might need to increase the thread pool size?

csviri · 2022-05-31T13:40:01Z

I think we should expose some metrics about it. Contention detection could work too. But maybe an better approach is to delegate it to tools like Prometheus. So there is history also about that. Since having some temporal spikes is could be normal, and not sure if an issues is just temporal the thread pool should be increased, probably not.

So probably doing such decision is the best by observing history not a point in time.

metacosm · 2022-05-31T13:45:30Z

Yes, was thinking about using metrics for that in a first step, indeed. Coupled with alerting (for people who are interested in such signal) would work nicely enough, imo.

github-actions · 2022-07-31T03:27:47Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions · 2022-09-30T04:18:26Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions · 2022-10-15T04:03:16Z

This issue was closed because it has been stalled for 14 days with no activity.

csviri mentioned this issue May 27, 2022

Conflicts on Finalizer Removal #1245

Closed

2 tasks

csviri added the kind/feature Categorizes issue or PR as related to a new feature. label May 27, 2022

csviri self-assigned this May 27, 2022

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2022

csviri removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2022

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 30, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 15, 2022

csviri reopened this Oct 15, 2022

csviri removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2022

csviri modified the milestones: 4.3, 4.2 Nov 9, 2022

csviri modified the milestones: 4.3, 4.2 Dec 1, 2022

csviri linked a pull request Dec 1, 2022 that will close this issue

feat: reading cache just on reconciliation dispatching #1640

Merged

csviri closed this as completed Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Read the Primary from the Cache on Reconcile Start #1246

Read the Primary from the Cache on Reconcile Start #1246

csviri commented May 27, 2022 •

edited

Loading

metacosm commented May 30, 2022

Uh oh!

csviri commented May 31, 2022

Uh oh!

metacosm commented May 31, 2022

Uh oh!

csviri commented May 31, 2022

Uh oh!

metacosm commented May 31, 2022

Uh oh!

github-actions bot commented Jul 31, 2022

Uh oh!

github-actions bot commented Sep 30, 2022

Uh oh!

github-actions bot commented Oct 15, 2022

Uh oh!

Read the Primary from the Cache on Reconcile Start #1246

Read the Primary from the Cache on Reconcile Start #1246

Comments

csviri commented May 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

metacosm commented May 30, 2022

Uh oh!

csviri commented May 31, 2022

Uh oh!

metacosm commented May 31, 2022

Uh oh!

csviri commented May 31, 2022

Uh oh!

metacosm commented May 31, 2022

Uh oh!

github-actions bot commented Jul 31, 2022

Uh oh!

github-actions bot commented Sep 30, 2022

Uh oh!

github-actions bot commented Oct 15, 2022

Uh oh!

csviri commented May 27, 2022 •

edited

Loading