Skip to content

Operator crashing after sometime #188

Closed
@SaikiranDaripelli

Description

@SaikiranDaripelli

Hi,
We have an operator written using this SDK, and operator pod is restarting every few hours with below exception

2020-09-16 07:51:39,953 i.f.k.c.d.i.WatchConnectionManager [DEBUG] Current reconnect backoff is 1000 milliseconds (T0)
2020-09-16 07:51:40,953 i.f.k.c.d.i.WatchConnectionManager [DEBUG] Connecting websocket ... io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@71e4b308
2020-09-16 07:51:41,003 i.f.k.c.d.i.WatchConnectionManager [DEBUG] WebSocket successfully opened
2020-09-16 07:51:41,018 c.g.c.o.p.EventScheduler       [ERROR] Error:
io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 22472056 (22832853)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:257)
	at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
	at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

Code i am using is

         KubernetesClient client = new DefaultKubernetesClient();
        Operator operator = new Operator(client);
        operator.registerController(new KafkaTopicController(client));

Am i using it wrong?

Activity

adam-sandor

adam-sandor commented on Sep 16, 2020

@adam-sandor
Collaborator

Hi Saikiran, I don't think you're doing something wrong. The error is happening on the fabric8 level. Let us get back to you asap with some ideas.

csviri

csviri commented on Sep 16, 2020

@csviri
Collaborator

Hi @SaikiranDaripelli ,
Thank you for the issue, you are using it right. Unfortunately this is known issue not in our code but the Kubernetes client we are using: fabric8io/kubernetes-client#1800 - its not handled yet, see also
spring-cloud/spring-cloud-kubernetes#557
fabric8io/kubernetes-client#1318

In this case the restart is a simple workaround from our side, see:

https://github.com/ContainerSolutions/java-operator-sdk/blob/39107a309514a75f1c9fed745f7aa1de1bf4301c/operator-framework/src/main/java/com/github/containersolutions/operator/processing/EventScheduler.java#L142-L148

We will try to take a look on this soon.

csviri

csviri commented on Sep 16, 2020

@csviri
Collaborator

@adam-sandor we could try to reconnect automatically from our side, but that should be done after the current changes in progress.

adam-sandor

adam-sandor commented on Sep 16, 2020

@adam-sandor
Collaborator

Yeah it would be great if we could do something about this. I guess many users of the KubernetesClient don't have this problem as they don't watch things for a long time, but an operator does that by definition.

SaikiranDaripelli

SaikiranDaripelli commented on Sep 16, 2020

@SaikiranDaripelli
Author

Thanks for answering my query, i went through the fabric8 issue and they seem to suggest to do it on client end.
Retries would definitely help, with restarts i am seeing that all controller's createOrUpdate is getting called everytime after a restart even though there is no change to resource, is it expected and will happen even with retries?
I am using status sub-resource and adding version of last successful version inside status to avoid reprocessing, is it suggested way to workaround duplicate events?
Will this improvement address it? https://github.com/ContainerSolutions/java-operator-sdk/issues/38

csviri

csviri commented on Sep 16, 2020

@csviri
Collaborator

@SaikiranDaripelli in short not, because by default we check if the generation increase, and in this case it won't increase (which can be turned off, in case it will reprocess because we cannot know if it happened during an execution of controller or not). In this case we are maintaining the state (last processed generation) in memory.

The issue: https://github.com/ContainerSolutions/java-operator-sdk/issues/38
is a more tough one, we cannot have that state in memory, since the process gets restarted. It can be stored somewhere else like a configMap or some data store. We don't plan to implement this issue in short term. Although we are up to any suggestions and/or contributions.

SaikiranDaripelli

SaikiranDaripelli commented on Sep 16, 2020

@SaikiranDaripelli
Author

Thanks, then retries without restart will solve my current issue.
With occasional reprocessing only on restart, which is fine for my usecase.

Regarding storing state, can't sdk itself do what i am doing right now as a workaround, i.e store last successfully processed generation in status sub-resource upon successful controller execution, and discard event if current generation matches one in status sub-resource.

csviri

csviri commented on Sep 16, 2020

@csviri
Collaborator

@SaikiranDaripelli this could be done, it would be nicer if we could do this transparently. In the case when you suggesting we should probably provide some interface how to get the latest generation from the resource (name of the field can be different from different users). So this is definitely one of the ways to go.

We will take a look, after the current changes we are working on.

adam-sandor

adam-sandor commented on Sep 16, 2020

@adam-sandor
Collaborator

How about putting that into an annotation?

PookiPok

PookiPok commented on Sep 23, 2020

@PookiPok

Hi, i am encounter the same issue with the release version not match, @SaikiranDaripelli - can you please share how did you solve this issue on your end? is there any other solution for this?

SaikiranDaripelli

SaikiranDaripelli commented on Sep 23, 2020

@SaikiranDaripelli
Author

@PookiPok Right now there is no way to stop operator controller from restarting.

PookiPok

PookiPok commented on Sep 29, 2020

@PookiPok

@SaikiranDaripelli - So is there any workaround for this for now?

csviri

csviri commented on Sep 29, 2020

@csviri
Collaborator

@PookiPok @SaikiranDaripelli the restarting of controller is the workaround basically (thus it restarts but at least the system does not stop working) :(

We can try to improve on this in the current version, but we are working on a big change now, there it will be easiert to fix.

PookiPok

PookiPok commented on Sep 29, 2020

@PookiPok

Thank you, waiting for this fix on the next release
Gil

ankeetj

ankeetj commented on Nov 10, 2020

@ankeetj

@csviri

I'm facing similar issue with my operator. Because of restart pod is ending up in crash loop status. Is there any update on the fix or any workaround which we can use?

csviri

csviri commented on Nov 10, 2020

@csviri
Collaborator

@adam-sandor @charlottemach @kirek007 We should consider fix this in the current version (before the event sources are released, since that might take a long time)
@ankeetj not at this moment, will discuss it, and might provide a patch sooner then planned.

linked a pull request that will close this issueEvent sources M1 #235on Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @SaikiranDaripelli@csviri@ankeetj@adam-sandor@charlottemach

      Issue actions

        Operator crashing after sometime · Issue #188 · operator-framework/java-operator-sdk