fix(cloudflare): improve handling of rate limiting errors#5524
Conversation
|
Welcome @Hackatosh! |
|
Hi @Hackatosh. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
I can't say whether this will be accepted, but regarding the code, I would suggest adding a new function |
|
I agree with you about making a function for this duplicated code. But can you expand on why this wouldn't be accepted? This is a huge problem for multi-cluster multi-domain setups using external-dns, it's very likely you'll run into rate limit errors with the default polling interval. That means the only solution is to increase the polling interval with each new cluster or domain that external-dns manages records for, which I'd argue isn't a great solution or user experience. With Cloudflare cold shouldering the fix on their end, it seems silly to not at least temporarily account for a known quirk of their API library here. |
|
@Hackatosh any chance you sign EasyCLA so that we could review it? |
|
This PR make sense, but not sure if it will be accepted as well There was previous attempts #5233 to do something similar and similar issue #5225 (comment) No need to close this; the idea itself makes sense. I'm very much on the fence about the order of resolution as well. Adding soft errors first feels like treating symptoms instead of the root cause. External DSN currently abusing Cloudflare API, which is not correct. We should be good citizens and improve not just crash/no-crash, but actually start adding retry/backkoff etc to cloudflare provider. Example config, err = cloudflare.NewWithAPIToken(token, []cloudflare.Option{
cloudflare.UsingRateLimit(rateLimit),
cloudflare.UsingRetryPolicy(retry, minDelaySeconds, maxDelaySeconds), // 3 retries, min 1s delay, max 5s delay
cloudflare.UserAgent(externaldns.UserAgent()),
}...)Something like that required as well. As at the moment, cloudflare API as it looks, uses some defaults |
|
@ivankatliarchuk I have signed the CLA, you can review the PR :) We could improve retry/backoff policy if you want. What do you think would be appropriate ? Maybe we can make this configurable using environnement variables |
|
/ok-to-test |
|
I can see there were an attempt to fix #4437 and agree with the comment that the issue is with the upstream library. As temp solution I think it should be acceptable to implement the fix. Just follow the suggestion #5524 (comment) and make sure to add unit tests for all added lines.
If you have time and interest, you could try to improve retry/backoff in follow-up for sure |
|
How do you know, the fix is actually solves the problem? |
|
I factorized error handling in To know if the fix works correctly, I am going to build external-dns with this fix and try it on a staging environnement where external-dns crashes everyday due to this issue, and see what happens. Is that OK for you ? |
| return provider.NewSoftError(err) | ||
| } | ||
| } | ||
| if strings.Contains(err.Error(), "exceeded available rate limit retries") { |
There was a problem hiding this comment.
Maybe add a comment as a remainder that this is a workaround for an issue with the cloudflare client ?
So we know that we can remove this when fixed ?
There was a problem hiding this comment.
I added a comment
|
I tried the fix on our staging environnement and got logs like this instead of crash : For me it works correctly :D Is it ok for you to merge this PR ? 🙏 |
|
Hi @vflaux anything left on your side ? |
|
@vflaux: changing LGTM is restricted to collaborators DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Co-authored-by: vflaux <38909103+vflaux@users.noreply.github.com>
Co-authored-by: vflaux <38909103+vflaux@users.noreply.github.com>
Co-authored-by: vflaux <38909103+vflaux@users.noreply.github.com>
|
/lgtm |
|
@vflaux: changing LGTM is restricted to collaborators DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Hello ! @ivankatliarchuk Does someone else need to approve this PR ? 🤔 |
|
cc @mloiseleur for final review |
|
Many thanks for your review @vflaux 👍 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mloiseleur, vflaux The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Thank you for the merge ! :) Do you know when this fix will be released in the official image ? |
|
You can use the staging image to use it before official release. |
|
Thank you very much ! Thank you for your reactivity on this PR, it was a nice experience to contribute :) |
What does it do ?
When a rate limiting error is encountered while using Cloudflare provider, external dns crashes. This PR fixes this behavior.
Motivation
Cloudflare library return a generic error when a rate limiting error is encountered (see #4876 (comment)). I added a condition to handle this type of error better and return a soft error.
An issue has been opened on Cloudflare side (cloudflare/cloudflare-go#4155) and a fix has been submitted (cloudflare/cloudflare-go#4156), but unfortunately there was no reaction from the library maintainers. That's why I try to fix the issue on external-dns side, even if it feels more like a "hack"
More