Skip to content

Read the Primary from the Cache on Reconcile Start #1246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
csviri opened this issue May 27, 2022 · 8 comments · Fixed by #1640
Closed

Read the Primary from the Cache on Reconcile Start #1246

csviri opened this issue May 27, 2022 · 8 comments · Fixed by #1640
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@csviri
Copy link
Collaborator

csviri commented May 27, 2022

Currently the primary resource is read from the cache before the resource is submitted to the executor before the reconciliation. If there is high load the actual reconciliation process might start significantly later. So when if primary resource is updated in between, we will process an older version of the primary while the new version is already in the cache. Reading the resource from the cache when the reconciliation actually starts will make to work on up to date primary resource (well that is known to the operator).

On the other hand this might lead to a double processing of the resource. Since the event from the primary change will mark the processor to make it reconcile again. We might however handle this situation, thus not marking the resource to reconcile if a resource with a same version is already submitted for reconciliation.

@csviri csviri changed the title From the situation in the logs now it seems that when the reconciliation process starts, usually the up-to-date event/resource is already arrived into the cache, now the resource is submitted to the executor, but we could actually read the resource from the cache when the executor starts processing the resource. In most of the cases (not just for cleanup) this likely will lead to less stale requests. Read the Primary from the Cache on Reconcile Start May 27, 2022
@csviri csviri added the kind/feature Categorizes issue or PR as related to a new feature. label May 27, 2022
@csviri csviri self-assigned this May 27, 2022
@metacosm
Copy link
Collaborator

Can you detail under which circumstances, there would be such a high load that the SDK wouldn't process the primary soon enough? Slow dependent reconciliation? Load on the operator process or the Kubernetes API server?

@csviri
Copy link
Collaborator Author

csviri commented May 31, 2022

With the current default settings there does not needs to such high load. Basically now the fixed thread pool for executor has size of 5 (maybe double this at least for default?) :
https://github.com/java-operator-sdk/java-operator-sdk/blob/f15f948e5af18ac2e3642c739b3e309af5090187/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/config/ConfigurationService.java#L83-L83

If the reconciliation takes long (calls for external APIs) then those 5 threads could be easily used up. The execution scope is created here before submit: https://github.com/java-operator-sdk/java-operator-sdk/blob/f15f948e5af18ac2e3642c739b3e309af5090187/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/EventProcessor.java#L139-L139

So it's actual execution could be delayed.

Also seen related issue with patch. Since we don't put resource into temp cache after reconcile. And there were events meanwhile for the resource (from event sources or cr), the execution will start without the fresh resource. However, typically seen in Flink operator logs that, when the executor service actually executes the dispatcher meanwhile the resource event from patch arrived and was already in cache.

The double processing as mentioned is a problem, but that we can handle with an in memory state.

So at the end this issue is an optimization (not about correctness), but might help actually in significant number of cases.

@metacosm
Copy link
Collaborator

Maybe we could add contention detection on the executor and issue a warning if processing takes longer than a given amount of time to let users know they might need to increase the thread pool size?

@csviri
Copy link
Collaborator Author

csviri commented May 31, 2022

I think we should expose some metrics about it. Contention detection could work too. But maybe an better approach is to delegate it to tools like Prometheus. So there is history also about that. Since having some temporal spikes is could be normal, and not sure if an issues is just temporal the thread pool should be increased, probably not.

So probably doing such decision is the best by observing history not a point in time.

@metacosm
Copy link
Collaborator

Yes, was thinking about using metrics for that in a first step, indeed. Coupled with alerting (for people who are interested in such signal) would work nicely enough, imo.

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2022
@csviri csviri removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2022
@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 30, 2022
@github-actions
Copy link

This issue was closed because it has been stalled for 14 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 15, 2022
@csviri csviri reopened this Oct 15, 2022
@csviri csviri removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2022
@csviri csviri modified the milestones: 4.3, 4.2 Nov 9, 2022
@csviri csviri modified the milestones: 4.3, 4.2 Dec 1, 2022
@csviri csviri linked a pull request Dec 1, 2022 that will close this issue
@csviri csviri closed this as completed Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants