Skip to content

After updating Operator operator doesn't seem to reconnect properly on kubernetes clusters #87

Closed
@snowdrop-bot

Description

@snowdrop-bot

Bug Report

We have been using Operator SDK in real production like scenarios. With +200 000 secrets on single kube cluster etc.
In that scenario we have noticed that Operator connection can be unstable and disconnect very often.

@secondsun have done fix to restart operator when connection is dropped but it looks like in recent versions this part of the code is not triggered due to connection being kept by underlying watcher. Problem is that we see that watchers have been idle and not responding - meaning that java operator SDK operators been running but not responding to any requests properly.

This is quite challenging with Java Operator SDK - we have seen "data loss". Golang based operators work better on such clusters mainly because their architecture checks for the CRs in event loop (rather than relying on the watch)

According to @secondsun:

https://github.com/java-operator-sdk/java-operator-sdk/blob/0cc051237f1639b9a419f9b0beaf3d1c8cb0e31d/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/event/internal/CustomResourceEventSource.java#L68 is a candidate bug. It might be that namespaces aren't getting watched properly

Logs:
https://gist.github.com/secondsun/7abd69a12e5a393841c0edd8156dcc1d

You can see difference in versions in PR that downgrades them
https://github.com/redhat-developer/app-services-operator/pull/288/files


operator-framework#657


$upstream:657$

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions