Skip to content

Drop retry strategy and up cache time #461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 18, 2025
Merged

Conversation

ingeniumed
Copy link
Contributor

@ingeniumed ingeniumed commented Apr 15, 2025

Description

In trying to figure out how to add rate limiting to the plugin, I wanted to first see if our retry strategy was good enough especially in cases where a retry might be needed. This would cover the case where rate limiting is starting to occur, and we are able to ensure that we can stop it from worsening. It wouldn't cover the case where rate limiting may start happening, but hasn't yet. The Airtable SDK caught my eye, as they implement an exponential backoff strategy with jitter to try and cover the case where rate limiting is starting to occur.

I experimented with a script that fired off 50, and 100 requests to see rate limiting in action. I then added in linear backoff (what we have), followed by exponential backoff to see if they would solve it. Both options did solve it, with linear sometimes needing an extra request or two as the overall requests began to scale up unlike exponential. Exponential does add a bit more of a delay, compared to the linear backoff but its a much better guarantee in requests succeeding as requests don't go out at a consistent time.

Between this, and the fact that API calls are cached for 60s we should be in an even better place.

Testing

  • I tested this using a node script to isolate the retry behaviour without any caching. Essentially the script would make 50 GET requests to https://api.airtable.com/v0/base_id/table_id with an exponential backoff retry strategy. It would log when a request was retried, which was only on 429s though it can be expanded to 500s as well.

@@ -124,7 +124,7 @@ public static function retry_decider( int $retries, RequestInterface $request, ?

$should_retry = false;

if ( $response && $response->getStatusCode() >= 500 ) {
if ( $response && ( $response->getStatusCode() >= 500 || $response->getStatusCode() === 429 ) ) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added status code 429 here so we retry when we are being rate limited as well.

Copy link
Contributor

Test this PR in WordPress Playground.

@@ -154,6 +160,7 @@ public static function retry_delay( int $retries, ?ResponseInterface $response )
}
}

// Convert it to milliseconds.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it this way so that both the calculated value, and the "retry-after" values are converted from s to ms.

@ingeniumed ingeniumed marked this pull request as draft April 15, 2025 05:33
$retry_after = $retries;
// Implement an exponential backoff strategy, with the delay capped at 10s and a minimum of 1s.
// Given we only retry 3 times, this will really never exceed 8s.
$retry_after = min( 10, 1 * pow( 2, $retries ) );
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering we only allow 3 retry attempts, and that's not overridable it means that the max time possible is 8s. I kept it at 10s just in case.

So with this, the retry times would be

2s
4s
8s

instead of

1s
2s
3s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice improvement over existing linear delay. Was there a reason the change to use jitter was reverted ?

The Airtable docs around rate limit says the API would start working after 30 secs after hitting the rate limit, so even 8 secs of delay might not be sufficient. The Airtable SDK itself starts the delay from 10s and goes from there to 20s, 40s, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It felt like overkill to use jitter considering how unpredictable the times could be. The wp_rand started giving me problems which got me to think if we needed it in the first place or not.

The Airtable SDK itself starts the delay from 10s and goes from there to 20s, 40s, etc.

The value before jitter will be 10s, 20s, 40s, and 60s. After jitter, it will be some value between 1 and the value before jitter. They have cases where it will be below 30s as well. The idea is that, because each request is spread out exponentially, on retry 1 or 2 it will succeed. I kept our values conservative to avoid blowing up the request times, as they can be overriden within AirableIntegration if we need to anyways.

Note - I did experiment with these values and fired off 50,100,150 requests and the max retries it had was 2 for 2 requests, while a small number was 1 thanks to exponential backoff.

@ingeniumed
Copy link
Contributor Author

I couldn't actually trigger this code, save for in tests despite disabling caching and duplicating blocks such that 100 api calls were fired off. So I made a node script that replicated the behaviour that I was expecting to test this.

What this tells me is that likelihood, at least to start with, of our calls being rate limited at the moment is low. We do have caching in place which will be helpful, and the tweak to our retry strategy should further help with this.

@ingeniumed ingeniumed self-assigned this Apr 15, 2025
@ingeniumed ingeniumed marked this pull request as ready for review April 15, 2025 06:04
@ingeniumed ingeniumed requested a review from shekharnwagh April 15, 2025 06:04
Copy link
Member

@chriszarate chriszarate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our retry strategy only exists in the context of a single API request, so it does not protect the API client. It is not applied across several API requests within the same render cycle, nor does it apply across several render cycles happening at about the same time across multiple workers.

Our retry strategy has not received a lot of scrutiny since it was written. I am not sure we should be retrying requests at all:

  1. Retries will block the render cycle. An exponential back-off strategy could easily result in a request timeout for the end user.
  2. We execute the exact same request multiple times for each block binding inside of a remote data block. For successful requests, repeated requests resolve from object or in-memory cache. For unsuccessful requests, repeated requests miss cache are executed again. This is effectively a primitive retry strategy, and would circumvent any sophisticated retry strategy we implemented in our Guzzle middleware.
  3. Under load, a site may be executing the same request nearly simultaneously across multiple workers, all of which would effectively retry the same request.

We might be better served by disabling retries entirely and falling back to the serialized block content. Separately, we might want to implement error caching with short TTLs to prevent hammering during rate-limiting or when the API is experiencing errors.

@ingeniumed
Copy link
Contributor Author

Summary of the discussion with @chriszarate on the above:

  • We are going to remove the retry strategy entirely. Given the way PHP works, there isn't a way that we can re-create the lazy loading model when rendering a post that would allow us to make the calls async.
  • We are going to cache the error responses for a few secs, and up the time on successful responses.

@ingeniumed ingeniumed changed the title Change the backoff strategy to exponential Drop retry strategy and up cache time Apr 17, 2025
@ingeniumed
Copy link
Contributor Author

ingeniumed commented Apr 17, 2025

Summary of the changes:

  1. Drop the retry strategy entirely
  2. Up the cache time for successful API requests to 5 mins

What I haven't done:

  1. Added caching for errors - Guzzle throws exceptions when this happens, and so the caching layer doesn't get to caching such requests. Separately, the caching middleware we use doesn't accept multiple TTLs and we don't want successful responses to be cached as long as problematic responses. This is going to require some experiment to see how we can achieve this. I'm going to punt this to another PR.
  2. Tracking requests and metrics - The work to send out notifications when a rate limit is being reached, or to provide analysis based on request metrics hasn't been done. There is an action already for when a query response is available, so we can hook into that. There's also another PR in review Custom query monitor panel, improve validation issue reporting #465 that will help with this work. Hence, this has also been punted to another PR.

@ingeniumed ingeniumed requested a review from chriszarate April 17, 2025 04:00
@chriszarate chriszarate merged commit 52d8caa into trunk Apr 18, 2025
13 checks passed
@chriszarate chriszarate deleted the add/rate-limiting-to-plugin branch April 18, 2025 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants