-
Notifications
You must be signed in to change notification settings - Fork 7
Drop retry strategy and up cache time #461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…urce when it's rate limited
inc/HttpClient/HttpClient.php
Outdated
@@ -124,7 +124,7 @@ public static function retry_decider( int $retries, RequestInterface $request, ? | |||
|
|||
$should_retry = false; | |||
|
|||
if ( $response && $response->getStatusCode() >= 500 ) { | |||
if ( $response && ( $response->getStatusCode() >= 500 || $response->getStatusCode() === 429 ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added status code 429 here so we retry when we are being rate limited as well.
Test this PR in WordPress Playground. |
inc/HttpClient/HttpClient.php
Outdated
@@ -154,6 +160,7 @@ public static function retry_delay( int $retries, ?ResponseInterface $response ) | |||
} | |||
} | |||
|
|||
// Convert it to milliseconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept it this way so that both the calculated value, and the "retry-after" values are converted from s to ms.
inc/HttpClient/HttpClient.php
Outdated
$retry_after = $retries; | ||
// Implement an exponential backoff strategy, with the delay capped at 10s and a minimum of 1s. | ||
// Given we only retry 3 times, this will really never exceed 8s. | ||
$retry_after = min( 10, 1 * pow( 2, $retries ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering we only allow 3 retry attempts, and that's not overridable it means that the max time possible is 8s. I kept it at 10s just in case.
So with this, the retry times would be
2s
4s
8s
instead of
1s
2s
3s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice improvement over existing linear delay. Was there a reason the change to use jitter was reverted ?
The Airtable docs around rate limit says the API would start working after 30 secs after hitting the rate limit, so even 8 secs of delay might not be sufficient. The Airtable SDK itself starts the delay from 10s and goes from there to 20s, 40s, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It felt like overkill to use jitter considering how unpredictable the times could be. The wp_rand
started giving me problems which got me to think if we needed it in the first place or not.
The Airtable SDK itself starts the delay from 10s and goes from there to 20s, 40s, etc.
The value before jitter will be 10s, 20s, 40s, and 60s. After jitter, it will be some value between 1 and the value before jitter. They have cases where it will be below 30s as well. The idea is that, because each request is spread out exponentially, on retry 1 or 2 it will succeed. I kept our values conservative to avoid blowing up the request times, as they can be overriden within AirableIntegration if we need to anyways.
Note - I did experiment with these values and fired off 50,100,150 requests and the max retries it had was 2 for 2 requests, while a small number was 1 thanks to exponential backoff.
I couldn't actually trigger this code, save for in tests despite disabling caching and duplicating blocks such that 100 api calls were fired off. So I made a node script that replicated the behaviour that I was expecting to test this. What this tells me is that likelihood, at least to start with, of our calls being rate limited at the moment is low. We do have caching in place which will be helpful, and the tweak to our retry strategy should further help with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our retry strategy only exists in the context of a single API request, so it does not protect the API client. It is not applied across several API requests within the same render cycle, nor does it apply across several render cycles happening at about the same time across multiple workers.
Our retry strategy has not received a lot of scrutiny since it was written. I am not sure we should be retrying requests at all:
- Retries will block the render cycle. An exponential back-off strategy could easily result in a request timeout for the end user.
- We execute the exact same request multiple times for each block binding inside of a remote data block. For successful requests, repeated requests resolve from object or in-memory cache. For unsuccessful requests, repeated requests miss cache are executed again. This is effectively a primitive retry strategy, and would circumvent any sophisticated retry strategy we implemented in our Guzzle middleware.
- Under load, a site may be executing the same request nearly simultaneously across multiple workers, all of which would effectively retry the same request.
We might be better served by disabling retries entirely and falling back to the serialized block content. Separately, we might want to implement error caching with short TTLs to prevent hammering during rate-limiting or when the API is experiencing errors.
Summary of the discussion with @chriszarate on the above:
|
Summary of the changes:
What I haven't done:
|
Description
In trying to figure out how to add rate limiting to the plugin, I wanted to first see if our retry strategy was good enough especially in cases where a retry might be needed. This would cover the case where rate limiting is starting to occur, and we are able to ensure that we can stop it from worsening. It wouldn't cover the case where rate limiting may start happening, but hasn't yet. The Airtable SDK caught my eye, as they implement an exponential backoff strategy with jitter to try and cover the case where rate limiting is starting to occur.
I experimented with a script that fired off 50, and 100 requests to see rate limiting in action. I then added in linear backoff (what we have), followed by exponential backoff to see if they would solve it. Both options did solve it, with linear sometimes needing an extra request or two as the overall requests began to scale up unlike exponential. Exponential does add a bit more of a delay, compared to the linear backoff but its a much better guarantee in requests succeeding as requests don't go out at a consistent time.
Between this, and the fact that API calls are cached for 60s we should be in an even better place.
Testing
https://api.airtable.com/v0/base_id/table_id
with an exponential backoff retry strategy. It would log when a request was retried, which was only on 429s though it can be expanded to 500s as well.