Skip to content

InformerEventSource cannot find resource after some time #1723

Not planned
@gyfora

Description

@gyfora

Bug Report

What did you do?

We are using a simple label selector based informer in the Flink Kubernetes Operator: https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/EventSourceUtils.java#L45

It happened in some cases that after a while, the informer could not find the target object (Deployment) anymore, while it definitely existed in Kubernetes (verified manually). Restarting the operator solved the problem.

Based on this we suspect that the informer simply stopped receiving new events after a while and never recovered.

Environment

Josdk: 4.1.1
Java 11

Activity

csviri

csviri commented on Jan 19, 2023

@csviri
Collaborator

So if I understand correctly, it was before. It is not the case, that it never received the resource in the informer.

Checked but this part of the code is very simple on our side, basically just reading, reading the resource from the informer cache.
But will add some logging to make sure that it can be made sure its not in JOSDK.

@manusa @shawkins haven't you encountered this problem before?

shawkins

shawkins commented on Jan 19, 2023

@shawkins
Collaborator

This likely was capture as fabric8io/kubernetes-client#4781 as well. We can work from the upstream side first based upon the comment over there.

csviri

csviri commented on Jan 19, 2023

@csviri
Collaborator

I discussed with @gyfora before, this seems to be a different issue. TBH I can't imaging how a resource is removed from the cache (ItemStore) without a delete event. But yep we can let's continue on fabric8 client side.

shawkins

shawkins commented on Jan 19, 2023

@shawkins
Collaborator

TBH I can't imaging how a resource is removed from the cache (ItemStore) without a delete event. But yep we can let's continue on fabric8 client side.

It shouldn't be possible for it to have existed, then not exist without emitting a delete event - at least at the informer level. The only circumstances where an entry are removed are a delete event from the watch, and on a relist (which should be rare in an environment where bookmarks are supported) where the item no longer exists. Is it possible that the item was known / cached by the operator sdk and was never populated in the informer cache to begin with?

csviri

csviri commented on Jan 19, 2023

@csviri
Collaborator

Is it possible that the item was known / cached by the operator sdk and was never populated in the informer cache to begin with?

No that is not possible. JOSDK reads the Informer cache for resources. There is an another layer, mapping the resource between primary custom resource and secondary resource in this case (this is where I added logging). But if that was found before, also not possible.

reopened this on Jan 20, 2023
linked a pull request that will close this issueimprovement: logging on informer stopping #1726on Jan 20, 2023
csviri

csviri commented on Jan 24, 2023

@csviri
Collaborator

I think we will need logs for this, i was not able to think about any scenarios where this could happen.

reopened this on Jan 24, 2023
self-assigned this
on Jan 27, 2023
github-actions

github-actions commented on Apr 5, 2023

@github-actions

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

added
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.
on Apr 5, 2023

2 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Participants

    @shawkins@csviri@gyfora

    Issue actions

      InformerEventSource cannot find resource after some time · Issue #1723 · operator-framework/java-operator-sdk