[bug] Operator becoming non-functional after transient RBAC changes #1419

andreaTP · 2022-08-24T18:35:10Z

Bug Report

Hi all and thanks for the amazing project!
I was looking at real-world edge cases where the functionality of the operator gets compromised because the Informers are crashing in background.
A little playing with RBAC resources with an operator running turns it to be completely unresponsive on any CR event.

What did you do?

start a new minikube cluster
deploy the sample tomcat-operator
kubectl apply -f sample-operators/tomcat-operator/k8s/tomcat-sample1.yaml
kubectl delete serviceaccount/tomcat-operator -n tomcat-operator
wait for the reconciliation loop to exhaust the retries
re-create the Service Account:

kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: tomcat-operator
  namespace: tomcat-operator
EOF

Now the operator becomes completely unresponsive:

doesn't react to changes to the test-tomcat1 CR
doesn't react to the creation of a new CR e.g. kubectl apply -f sample-operators/tomcat-operator/k8s/tomcat-sample2.yaml

What did you expect to see?

The operator pod should(probably) restart in case it loose access to the API in order to be able to restore the communication.
Alternatively, the situation should be handled and, somehow, the connection of the SharedInformers restored.

What did you see instead? Under which circumstances?

The operator remains unresponsive but alive.

Environment

Kubernetes cluster type:
minikube

$ Mention java-operator-sdk version from pom.xml file
main

$ java -version

openjdk version "11.0.15" 2022-04-19
OpenJDK Runtime Environment Temurin-11.0.15+10 (build 11.0.15+10)
OpenJDK 64-Bit Server VM Temurin-11.0.15+10 (build 11.0.15+10, mixed mode)

$ kubectl version

Possible Solution

The best would be to have callback endpoint in the Controller that gets called if an error happens with the SharedInformers, so that the user can decide what to do.
At a very minimum, in this specific situation, I do believe that crashing the Operator is the correct behavior, but it would be nice to have a more generic mechanism for handling SharedInformers failures that are currently happening in background.

Additional context

During my test, I verified that the communication with the API server gets restored if the API server becomes temporarily unavailable, that's great work 👍

The text was updated successfully, but these errors were encountered:

andreaTP · 2022-08-24T18:35:56Z

cc. @lburgazzoli

csviri · 2022-08-24T18:38:23Z

Probably related to this issue:
#1405

andreaTP · 2022-08-24T18:41:03Z

Related to #1170 also.
Please note that a pod restart recovers the situation.

andreaTP · 2022-10-27T12:16:54Z

@csviri do we have an integration test for this?

csviri · 2022-10-27T12:17:47Z

yep: https://github.com/java-operator-sdk/java-operator-sdk/blob/2cb616c4c4fd0094ee6e3a0ef2a0ea82173372bf/operator-framework/src/test/java/io/javaoperatorsdk/operator/InformerRelatedBehaviorITS.java
@andreaTP

it is a little special, since it is not trivial to test. See javadocs on the test class.

andreaTP · 2022-10-27T12:27:33Z

awesome! thanks!

csviri added this to the 3.3 milestone Aug 25, 2022

csviri mentioned this issue Aug 25, 2022

Umbrella issue for handling Informer Permission Errors and Probes #1422

Closed

3 tasks

csviri self-assigned this Oct 4, 2022

csviri linked a pull request Oct 26, 2022 that will close this issue

feat: informer related behavior on startup and in case of errors #1571

Merged

csviri closed this as completed Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bug] Operator becoming non-functional after transient RBAC changes #1419

[bug] Operator becoming non-functional after transient RBAC changes #1419

andreaTP commented Aug 24, 2022

andreaTP commented Aug 24, 2022

Uh oh!

csviri commented Aug 24, 2022

Uh oh!

andreaTP commented Aug 24, 2022

Uh oh!

andreaTP commented Oct 27, 2022

Uh oh!

csviri commented Oct 27, 2022 •

edited

Loading

Uh oh!

andreaTP commented Oct 27, 2022

Uh oh!

[bug] Operator becoming non-functional after transient RBAC changes #1419

[bug] Operator becoming non-functional after transient RBAC changes #1419

Comments

andreaTP commented Aug 24, 2022

Bug Report

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

Possible Solution

Additional context

andreaTP commented Aug 24, 2022

Uh oh!

csviri commented Aug 24, 2022

Uh oh!

andreaTP commented Aug 24, 2022

Uh oh!

andreaTP commented Oct 27, 2022

Uh oh!

csviri commented Oct 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreaTP commented Oct 27, 2022

Uh oh!

csviri commented Oct 27, 2022 •

edited

Loading