From d8aaadc625b1cc0805e90ed95d0e476e73b3f10d Mon Sep 17 00:00:00 2001 From: csviri <csviri@gmail.com> Date: Fri, 10 Mar 2023 16:13:17 +0100 Subject: [PATCH 1/2] docs: improvemens on reschedul retry behavior --- docs/documentation/features.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/docs/documentation/features.md b/docs/documentation/features.md index 1b7c91c2d1..203d6dd13c 100644 --- a/docs/documentation/features.md +++ b/docs/documentation/features.md @@ -368,7 +368,15 @@ these features: 1. A successful execution resets a retry and the rescheduled executions which were present before the reconciliation. However, a new rescheduling can be instructed from the reconciliation - outcome (`UpdateControl` or `DeleteControl`). + outcome (`UpdateControl` or `DeleteControl`). + + For example if there was an execution scheduled in 5 minutes, but an event triggered the + reconciliation (or cleanup) the scheduled execution is automatically cancelled, but it + can be of course scheduled again at the end of the reconciliation. + + Similarly, if there was a retry scheduled, but an event received (that triggers the execution, see next point) + which results in a successful execution the retry is cancelled. + 2. In case an exception happened, a retry is initiated. However, if an event is received meanwhile, it will be reconciled instantly, and this execution won't count as a retry attempt. 3. If the retry limit is reached (so no more automatic retry would happen), but a new event From 77af8755f44c5ba1688fb7717425e4afec89c519 Mon Sep 17 00:00:00 2001 From: Chris Laprun <metacosm@gmail.com> Date: Fri, 10 Mar 2023 17:42:41 +0100 Subject: [PATCH 2/2] docs: improve wording, re-format --- docs/documentation/features.md | 61 +++++++++++++++++++--------------- 1 file changed, 35 insertions(+), 26 deletions(-) diff --git a/docs/documentation/features.md b/docs/documentation/features.md index 203d6dd13c..ac4f8dfc69 100644 --- a/docs/documentation/features.md +++ b/docs/documentation/features.md @@ -368,14 +368,16 @@ these features: 1. A successful execution resets a retry and the rescheduled executions which were present before the reconciliation. However, a new rescheduling can be instructed from the reconciliation - outcome (`UpdateControl` or `DeleteControl`). - - For example if there was an execution scheduled in 5 minutes, but an event triggered the - reconciliation (or cleanup) the scheduled execution is automatically cancelled, but it - can be of course scheduled again at the end of the reconciliation. + outcome (`UpdateControl` or `DeleteControl`). - Similarly, if there was a retry scheduled, but an event received (that triggers the execution, see next point) - which results in a successful execution the retry is cancelled. + For example, if a reconciliation had previously been re-scheduled after some amount of time, but an event triggered + the reconciliation (or cleanup) in the mean time, the scheduled execution would be automatically cancelled, i.e. + re-scheduling a reconciliation does not guarantee that one will occur exactly at that time, it simply guarantees that + one reconciliation will occur at that time at the latest, triggering one if no event from the cluster triggered one. + Of course, it's always possible to re-schedule a new reconciliation at the end of that "automatic" reconciliation. + + Similarly, if a retry was scheduled, any event from the cluster triggering a successful execution in the mean time + would cancel the scheduled retry (because there's now no point in retrying something that already succeeded) 2. In case an exception happened, a retry is initiated. However, if an event is received meanwhile, it will be reconciled instantly, and this execution won't count as a retry attempt. @@ -384,6 +386,12 @@ these features: marked as the last attempt in the retry info. The point (1) still holds, but in case of an error, no retry will happen. +The thing to keep in mind when it comes to retrying or rescheduling is that JOSDK tries to avoid unnecessary work. When +you reschedule an operation, you instruct JOSDK to perform that operation at the latest by the end of the rescheduling +delay. If something occurred on the cluster that triggers that particular operation (reconciliation or cleanup), then +JOSDK considers that there's no point in attempting that operation again at the end of the specified delay since there +is now no point to do so anymore. The same idea also applies to retries. + ## Rate Limiting It is possible to rate limit reconciliation on a per-resource basis. The rate limit also takes @@ -619,15 +627,15 @@ Logging is enhanced with additional contextual information using [MDC](http://www.slf4j.org/manual.html#mdc). The following attributes are available in most parts of reconciliation logic and during the execution of the controller: -| MDC Key | Value added from primary resource | -| :--- |:----------------------------------| -| `resource.apiVersion` | `.apiVersion` | -| `resource.kind` | `.kind` | -| `resource.name` | `.metadata.name` | -| `resource.namespace` | `.metadata.namespace` | -| `resource.resourceVersion` | `.metadata.resourceVersion` | -| `resource.generation` | `.metadata.generation` | -| `resource.uid` | `.metadata.uid` | +| MDC Key | Value added from primary resource | +|:---------------------------|:----------------------------------| +| `resource.apiVersion` | `.apiVersion` | +| `resource.kind` | `.kind` | +| `resource.name` | `.metadata.name` | +| `resource.namespace` | `.metadata.namespace` | +| `resource.resourceVersion` | `.metadata.resourceVersion` | +| `resource.generation` | `.metadata.generation` | +| `resource.uid` | `.metadata.uid` | For more information about MDC see this [link](https://www.baeldung.com/mdc-in-log4j-2-logback). @@ -696,27 +704,28 @@ for this feature. ## Leader Election -Operators are generally deployed with a single running or active instance. However, it is -possible to deploy multiple instances in such a way that only one, called the "leader", processes the -events. This is achieved via a mechanism called "leader election". While all the instances are -running, and even start their event sources to populate the caches, only the leader will process -the events. This means that should the leader change for any reason, for example because it -crashed, the other instances are already warmed up and ready to pick up where the previous +Operators are generally deployed with a single running or active instance. However, it is +possible to deploy multiple instances in such a way that only one, called the "leader", processes the +events. This is achieved via a mechanism called "leader election". While all the instances are +running, and even start their event sources to populate the caches, only the leader will process +the events. This means that should the leader change for any reason, for example because it +crashed, the other instances are already warmed up and ready to pick up where the previous leader left off should one of them become elected leader. -See sample configuration in the [E2E test](https://github.com/java-operator-sdk/java-operator-sdk/blob/8865302ac0346ee31f2d7b348997ec2913d5922b/sample-operators/leader-election/src/main/java/io/javaoperatorsdk/operator/sample/LeaderElectionTestOperator.java#L21-L23) +See sample configuration in +the [E2E test](https://github.com/java-operator-sdk/java-operator-sdk/blob/8865302ac0346ee31f2d7b348997ec2913d5922b/sample-operators/leader-election/src/main/java/io/javaoperatorsdk/operator/sample/LeaderElectionTestOperator.java#L21-L23) . ## Runtime Info -[RuntimeInfo](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/RuntimeInfo.java#L16-L16) -is used mainly to check the actual health of event sources. Based on this information it is easy to implement custom +[RuntimeInfo](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/RuntimeInfo.java#L16-L16) +is used mainly to check the actual health of event sources. Based on this information it is easy to implement custom liveness probes. [stopOnInformerErrorDuringStartup](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/config/ConfigurationService.java#L168-L168) setting, where this flag usually needs to be set to false, in order to control the exact liveness properties. -See also an example implementation in the +See also an example implementation in the [WebPage sample](https://github.com/java-operator-sdk/java-operator-sdk/blob/3e2e7c4c834ef1c409d636156b988125744ca911/sample-operators/webpage/src/main/java/io/javaoperatorsdk/operator/sample/WebPageOperator.java#L38-L43) ## Automatic Generation of CRDs