Closed
Description
Hi,
We have an operator written using this SDK, and operator pod is restarting every few hours with below exception
2020-09-16 07:51:39,953 i.f.k.c.d.i.WatchConnectionManager [DEBUG] Current reconnect backoff is 1000 milliseconds (T0)
2020-09-16 07:51:40,953 i.f.k.c.d.i.WatchConnectionManager [DEBUG] Connecting websocket ... io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager@71e4b308
2020-09-16 07:51:41,003 i.f.k.c.d.i.WatchConnectionManager [DEBUG] WebSocket successfully opened
2020-09-16 07:51:41,018 c.g.c.o.p.EventScheduler [ERROR] Error:
io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 22472056 (22832853)
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:257)
at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Code i am using is
KubernetesClient client = new DefaultKubernetesClient();
Operator operator = new Operator(client);
operator.registerController(new KafkaTopicController(client));
Am i using it wrong?
Activity
adam-sandor commentedon Sep 16, 2020
Hi Saikiran, I don't think you're doing something wrong. The error is happening on the fabric8 level. Let us get back to you asap with some ideas.
csviri commentedon Sep 16, 2020
Hi @SaikiranDaripelli ,
Thank you for the issue, you are using it right. Unfortunately this is known issue not in our code but the Kubernetes client we are using: fabric8io/kubernetes-client#1800 - its not handled yet, see also
spring-cloud/spring-cloud-kubernetes#557
fabric8io/kubernetes-client#1318
In this case the restart is a simple workaround from our side, see:
https://github.com/ContainerSolutions/java-operator-sdk/blob/39107a309514a75f1c9fed745f7aa1de1bf4301c/operator-framework/src/main/java/com/github/containersolutions/operator/processing/EventScheduler.java#L142-L148
We will try to take a look on this soon.
csviri commentedon Sep 16, 2020
@adam-sandor we could try to reconnect automatically from our side, but that should be done after the current changes in progress.
adam-sandor commentedon Sep 16, 2020
Yeah it would be great if we could do something about this. I guess many users of the KubernetesClient don't have this problem as they don't watch things for a long time, but an operator does that by definition.
SaikiranDaripelli commentedon Sep 16, 2020
Thanks for answering my query, i went through the fabric8 issue and they seem to suggest to do it on client end.
Retries would definitely help, with restarts i am seeing that all controller's createOrUpdate is getting called everytime after a restart even though there is no change to resource, is it expected and will happen even with retries?
I am using status sub-resource and adding version of last successful version inside status to avoid reprocessing, is it suggested way to workaround duplicate events?
Will this improvement address it? https://github.com/ContainerSolutions/java-operator-sdk/issues/38
csviri commentedon Sep 16, 2020
@SaikiranDaripelli in short not, because by default we check if the generation increase, and in this case it won't increase (which can be turned off, in case it will reprocess because we cannot know if it happened during an execution of controller or not). In this case we are maintaining the state (last processed generation) in memory.
The issue: https://github.com/ContainerSolutions/java-operator-sdk/issues/38
is a more tough one, we cannot have that state in memory, since the process gets restarted. It can be stored somewhere else like a configMap or some data store. We don't plan to implement this issue in short term. Although we are up to any suggestions and/or contributions.
SaikiranDaripelli commentedon Sep 16, 2020
Thanks, then retries without restart will solve my current issue.
With occasional reprocessing only on restart, which is fine for my usecase.
Regarding storing state, can't sdk itself do what i am doing right now as a workaround, i.e store last successfully processed generation in status sub-resource upon successful controller execution, and discard event if current generation matches one in status sub-resource.
csviri commentedon Sep 16, 2020
@SaikiranDaripelli this could be done, it would be nicer if we could do this transparently. In the case when you suggesting we should probably provide some interface how to get the latest generation from the resource (name of the field can be different from different users). So this is definitely one of the ways to go.
We will take a look, after the current changes we are working on.
adam-sandor commentedon Sep 16, 2020
How about putting that into an annotation?
PookiPok commentedon Sep 23, 2020
Hi, i am encounter the same issue with the release version not match, @SaikiranDaripelli - can you please share how did you solve this issue on your end? is there any other solution for this?
SaikiranDaripelli commentedon Sep 23, 2020
@PookiPok Right now there is no way to stop operator controller from restarting.
PookiPok commentedon Sep 29, 2020
@SaikiranDaripelli - So is there any workaround for this for now?
csviri commentedon Sep 29, 2020
@PookiPok @SaikiranDaripelli the restarting of controller is the workaround basically (thus it restarts but at least the system does not stop working) :(
We can try to improve on this in the current version, but we are working on a big change now, there it will be easiert to fix.
PookiPok commentedon Sep 29, 2020
Thank you, waiting for this fix on the next release
Gil
ankeetj commentedon Nov 10, 2020
@csviri
I'm facing similar issue with my operator. Because of restart pod is ending up in crash loop status. Is there any update on the fix or any workaround which we can use?
csviri commentedon Nov 10, 2020
@adam-sandor @charlottemach @kirek007 We should consider fix this in the current version (before the event sources are released, since that might take a long time)
@ankeetj not at this moment, will discuss it, and might provide a patch sooner then planned.