Skip to content

Commit a20cdea

Browse files
committed
wip
Signed-off-by: Attila Mészáros <[email protected]>
1 parent 491b23a commit a20cdea

File tree

4 files changed

+129
-124
lines changed

4 files changed

+129
-124
lines changed
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
---
2+
title: Error handling and retries
3+
weight: 45
4+
---
5+
6+
## Automatic Retries on Error
7+
8+
JOSDK will schedule an automatic retry of the reconciliation whenever an exception is thrown by
9+
your `Reconciler`. The retry is behavior is configurable but a default implementation is provided
10+
covering most of the typical use-cases, see
11+
[GenericRetry](https://github.com/java-operator-sdk/java-operator-sdk/blob/master/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/retry/GenericRetry.java)
12+
.
13+
14+
```java
15+
GenericRetry.defaultLimitedExponentialRetry()
16+
.setInitialInterval(5000)
17+
.setIntervalMultiplier(1.5D)
18+
.setMaxAttempts(5);
19+
```
20+
21+
You can also configure the default retry behavior using the `@GradualRetry` annotation.
22+
23+
It is possible to provide a custom implementation using the `retry` field of the
24+
`@ControllerConfiguration` annotation and specifying the class of your custom implementation.
25+
Note that this class will need to provide an accessible no-arg constructor for automated
26+
instantiation. Additionally, your implementation can be automatically configured from an
27+
annotation that you can provide by having your `Retry` implementation implement the
28+
`AnnotationConfigurable` interface, parameterized with your annotation type. See the
29+
`GenericRetry` implementation for more details.
30+
31+
Information about the current retry state is accessible from
32+
the [Context](https://github.com/java-operator-sdk/java-operator-sdk/blob/master/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/Context.java)
33+
object. Of note, particularly interesting is the `isLastAttempt` method, which could allow your
34+
`Reconciler` to implement a different behavior based on this status, by setting an error message
35+
in your resource' status, for example, when attempting a last retry.
36+
37+
Note, though, that reaching the retry limit won't prevent new events to be processed. New
38+
reconciliations will happen for new events as usual. However, if an error also occurs that
39+
would normally trigger a retry, the SDK won't schedule one at this point since the retry limit
40+
is already reached.
41+
42+
A successful execution resets the retry state.
43+
44+
### Setting Error Status After Last Retry Attempt
45+
46+
In order to facilitate error reporting, `Reconciler` can implement the
47+
[ErrorStatusHandler](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/reconciler/ErrorStatusHandler.java)
48+
interface:
49+
50+
```java
51+
public interface ErrorStatusHandler<P extends HasMetadata> {
52+
53+
ErrorStatusUpdateControl<P> updateErrorStatus(P resource, Context<P> context, Exception e);
54+
55+
}
56+
```
57+
58+
The `updateErrorStatus` method is called in case an exception is thrown from the `Reconciler`. It is
59+
also called even if no retry policy is configured, just after the reconciler execution.
60+
`RetryInfo.getAttemptCount()` is zero after the first reconciliation attempt, since it is not a
61+
result of a retry (regardless of whether a retry policy is configured or not).
62+
63+
`ErrorStatusUpdateControl` is used to tell the SDK what to do and how to perform the status
64+
update on the primary resource, always performed as a status sub-resource request. Note that
65+
this update request will also produce an event, and will result in a reconciliation if the
66+
controller is not generation aware.
67+
68+
This feature is only available for the `reconcile` method of the `Reconciler` interface, since
69+
there should not be updates to resource that have been marked for deletion.
70+
71+
Retry can be skipped in cases of unrecoverable errors:
72+
73+
```java
74+
ErrorStatusUpdateControl.patchStatus(customResource).withNoRetry();
75+
```
76+
77+
### Correctness and Automatic Retries
78+
79+
While it is possible to deactivate automatic retries, this is not desirable, unless for very
80+
specific reasons. Errors naturally occur, whether it be transient network errors or conflicts
81+
when a given resource is handled by a `Reconciler` but is modified at the same time by a user in
82+
a different process. Automatic retries handle these cases nicely and will usually result in a
83+
successful reconciliation.
84+
85+
## Retry and Rescheduling and Event Handling Common Behavior
86+
87+
Retry, reschedule and standard event processing form a relatively complex system, each of these
88+
functionalities interacting with the others. In the following, we describe the interplay of
89+
these features:
90+
91+
1. A successful execution resets a retry and the rescheduled executions which were present before
92+
the reconciliation. However, a new rescheduling can be instructed from the reconciliation
93+
outcome (`UpdateControl` or `DeleteControl`).
94+
95+
For example, if a reconciliation had previously been re-scheduled after some amount of time, but an event triggered
96+
the reconciliation (or cleanup) in the mean time, the scheduled execution would be automatically cancelled, i.e.
97+
re-scheduling a reconciliation does not guarantee that one will occur exactly at that time, it simply guarantees that
98+
one reconciliation will occur at that time at the latest, triggering one if no event from the cluster triggered one.
99+
Of course, it's always possible to re-schedule a new reconciliation at the end of that "automatic" reconciliation.
100+
101+
Similarly, if a retry was scheduled, any event from the cluster triggering a successful execution in the mean time
102+
would cancel the scheduled retry (because there's now no point in retrying something that already succeeded)
103+
104+
2. In case an exception happened, a retry is initiated. However, if an event is received
105+
meanwhile, it will be reconciled instantly, and this execution won't count as a retry attempt.
106+
3. If the retry limit is reached (so no more automatic retry would happen), but a new event
107+
received, the reconciliation will still happen, but won't reset the retry, and will still be
108+
marked as the last attempt in the retry info. The point (1) still holds, but in case of an
109+
error, no retry will happen.
110+
111+
The thing to keep in mind when it comes to retrying or rescheduling is that JOSDK tries to avoid unnecessary work. When
112+
you reschedule an operation, you instruct JOSDK to perform that operation at the latest by the end of the rescheduling
113+
delay. If something occurred on the cluster that triggers that particular operation (reconciliation or cleanup), then
114+
JOSDK considers that there's no point in attempting that operation again at the end of the specified delay since there
115+
is now no point to do so anymore. The same idea also applies to retries.

docs/content/en/docs/documentation/features.md

Lines changed: 0 additions & 122 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@ facilitating the implementation of Kubernetes operators. The features are by def
88
the best practices in an opinionated way. However, feature flags and other configuration options
99
are provided to fine tune or turn off these features.
1010

11-
12-
1311
## Generation Awareness and Event Filtering
1412

1513
A best practice when an operator starts up is to reconcile all the associated resources because
@@ -73,116 +71,7 @@ or any non-positive number.
7371
The automatic retries are not affected by this feature so a reconciliation will be re-triggered
7472
on error, according to the specified retry policy, regardless of this maximum interval setting.
7573

76-
## Automatic Retries on Error
77-
78-
JOSDK will schedule an automatic retry of the reconciliation whenever an exception is thrown by
79-
your `Reconciler`. The retry is behavior is configurable but a default implementation is provided
80-
covering most of the typical use-cases, see
81-
[GenericRetry](https://github.com/java-operator-sdk/java-operator-sdk/blob/master/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/retry/GenericRetry.java)
82-
.
83-
84-
```java
85-
GenericRetry.defaultLimitedExponentialRetry()
86-
.setInitialInterval(5000)
87-
.setIntervalMultiplier(1.5D)
88-
.setMaxAttempts(5);
89-
```
90-
91-
You can also configure the default retry behavior using the `@GradualRetry` annotation.
92-
93-
It is possible to provide a custom implementation using the `retry` field of the
94-
`@ControllerConfiguration` annotation and specifying the class of your custom implementation.
95-
Note that this class will need to provide an accessible no-arg constructor for automated
96-
instantiation. Additionally, your implementation can be automatically configured from an
97-
annotation that you can provide by having your `Retry` implementation implement the
98-
`AnnotationConfigurable` interface, parameterized with your annotation type. See the
99-
`GenericRetry` implementation for more details.
100-
101-
Information about the current retry state is accessible from
102-
the [Context](https://github.com/java-operator-sdk/java-operator-sdk/blob/master/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/Context.java)
103-
object. Of note, particularly interesting is the `isLastAttempt` method, which could allow your
104-
`Reconciler` to implement a different behavior based on this status, by setting an error message
105-
in your resource' status, for example, when attempting a last retry.
106-
107-
Note, though, that reaching the retry limit won't prevent new events to be processed. New
108-
reconciliations will happen for new events as usual. However, if an error also occurs that
109-
would normally trigger a retry, the SDK won't schedule one at this point since the retry limit
110-
is already reached.
111-
112-
A successful execution resets the retry state.
113-
114-
### Setting Error Status After Last Retry Attempt
115-
116-
In order to facilitate error reporting, `Reconciler` can implement the
117-
[ErrorStatusHandler](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/reconciler/ErrorStatusHandler.java)
118-
interface:
119-
120-
```java
121-
public interface ErrorStatusHandler<P extends HasMetadata> {
122-
123-
ErrorStatusUpdateControl<P> updateErrorStatus(P resource, Context<P> context, Exception e);
124-
125-
}
126-
```
127-
128-
The `updateErrorStatus` method is called in case an exception is thrown from the `Reconciler`. It is
129-
also called even if no retry policy is configured, just after the reconciler execution.
130-
`RetryInfo.getAttemptCount()` is zero after the first reconciliation attempt, since it is not a
131-
result of a retry (regardless of whether a retry policy is configured or not).
132-
133-
`ErrorStatusUpdateControl` is used to tell the SDK what to do and how to perform the status
134-
update on the primary resource, always performed as a status sub-resource request. Note that
135-
this update request will also produce an event, and will result in a reconciliation if the
136-
controller is not generation aware.
137-
138-
This feature is only available for the `reconcile` method of the `Reconciler` interface, since
139-
there should not be updates to resource that have been marked for deletion.
140-
141-
Retry can be skipped in cases of unrecoverable errors:
142-
143-
```java
144-
ErrorStatusUpdateControl.patchStatus(customResource).withNoRetry();
145-
```
146-
147-
### Correctness and Automatic Retries
148-
149-
While it is possible to deactivate automatic retries, this is not desirable, unless for very
150-
specific reasons. Errors naturally occur, whether it be transient network errors or conflicts
151-
when a given resource is handled by a `Reconciler` but is modified at the same time by a user in
152-
a different process. Automatic retries handle these cases nicely and will usually result in a
153-
successful reconciliation.
15474

155-
## Retry and Rescheduling and Event Handling Common Behavior
156-
157-
Retry, reschedule and standard event processing form a relatively complex system, each of these
158-
functionalities interacting with the others. In the following, we describe the interplay of
159-
these features:
160-
161-
1. A successful execution resets a retry and the rescheduled executions which were present before
162-
the reconciliation. However, a new rescheduling can be instructed from the reconciliation
163-
outcome (`UpdateControl` or `DeleteControl`).
164-
165-
For example, if a reconciliation had previously been re-scheduled after some amount of time, but an event triggered
166-
the reconciliation (or cleanup) in the mean time, the scheduled execution would be automatically cancelled, i.e.
167-
re-scheduling a reconciliation does not guarantee that one will occur exactly at that time, it simply guarantees that
168-
one reconciliation will occur at that time at the latest, triggering one if no event from the cluster triggered one.
169-
Of course, it's always possible to re-schedule a new reconciliation at the end of that "automatic" reconciliation.
170-
171-
Similarly, if a retry was scheduled, any event from the cluster triggering a successful execution in the mean time
172-
would cancel the scheduled retry (because there's now no point in retrying something that already succeeded)
173-
174-
2. In case an exception happened, a retry is initiated. However, if an event is received
175-
meanwhile, it will be reconciled instantly, and this execution won't count as a retry attempt.
176-
3. If the retry limit is reached (so no more automatic retry would happen), but a new event
177-
received, the reconciliation will still happen, but won't reset the retry, and will still be
178-
marked as the last attempt in the retry info. The point (1) still holds, but in case of an
179-
error, no retry will happen.
180-
181-
The thing to keep in mind when it comes to retrying or rescheduling is that JOSDK tries to avoid unnecessary work. When
182-
you reschedule an operation, you instruct JOSDK to perform that operation at the latest by the end of the rescheduling
183-
delay. If something occurred on the cluster that triggers that particular operation (reconciliation or cleanup), then
184-
JOSDK considers that there's no point in attempting that operation again at the end of the specified delay since there
185-
is now no point to do so anymore. The same idea also applies to retries.
18675

18776
## Rate Limiting
18877

@@ -512,17 +401,6 @@ See sample configuration in
512401
the [E2E test](https://github.com/java-operator-sdk/java-operator-sdk/blob/8865302ac0346ee31f2d7b348997ec2913d5922b/sample-operators/leader-election/src/main/java/io/javaoperatorsdk/operator/sample/LeaderElectionTestOperator.java#L21-L23)
513402
.
514403

515-
## Runtime Info
516-
517-
[RuntimeInfo](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/RuntimeInfo.java#L16-L16)
518-
is used mainly to check the actual health of event sources. Based on this information it is easy to implement custom
519-
liveness probes.
520-
521-
[stopOnInformerErrorDuringStartup](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/config/ConfigurationService.java#L168-L168)
522-
setting, where this flag usually needs to be set to false, in order to control the exact liveness properties.
523-
524-
See also an example implementation in the
525-
[WebPage sample](https://github.com/java-operator-sdk/java-operator-sdk/blob/3e2e7c4c834ef1c409d636156b988125744ca911/sample-operators/webpage/src/main/java/io/javaoperatorsdk/operator/sample/WebPageOperator.java#L38-L43)
526404

527405
## Automatic Generation of CRDs
528406

docs/content/en/docs/documentation/observability.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,20 @@
11
---
22
title: Observability
3-
weight: 44
3+
weight: 47
44
---
55

6+
## Runtime Info
7+
8+
[RuntimeInfo](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/RuntimeInfo.java#L16-L16)
9+
is used mainly to check the actual health of event sources. Based on this information it is easy to implement custom
10+
liveness probes.
11+
12+
[stopOnInformerErrorDuringStartup](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/config/ConfigurationService.java#L168-L168)
13+
setting, where this flag usually needs to be set to false, in order to control the exact liveness properties.
14+
15+
See also an example implementation in the
16+
[WebPage sample](https://github.com/java-operator-sdk/java-operator-sdk/blob/3e2e7c4c834ef1c409d636156b988125744ca911/sample-operators/webpage/src/main/java/io/javaoperatorsdk/operator/sample/WebPageOperator.java#L38-L43)
17+
618
## Contextual Info for Logging with MDC
719

820
Logging is enhanced with additional contextual information using

docs/content/en/docs/documentation/reconciler.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Implementing a reconciler
3-
weight: 42
3+
weight: 43
44
---
55

66
## Reconciliation Execution in a Nutshell

0 commit comments

Comments
 (0)