Description
Problem
We rely on staging to verify that new code deployments are zero-downtime safe. Part of zero-downtime safety necessarily requires the service to actually handle and respond to requests. Some migrations which are decided not zero-downtime safe, would never cause problems in a service that is not handling requests. That is because zero-downtime safety is precisely related to whether the service is able to handle and respond to requests when two versions of the application are running at the same time. Nothing is verified if the versions do not handle requests during that period.
Because the staging API does not see virtually any traffic, and certainly not consistent traffic, we cannot rely on staging deployments to verify zero-downtime safety from this perspective.
Description
Write k6 load tests that can run on a timer and exercise all non-deprecated media requests: search, thumbnail, waveform, related, single result. We will rely on HMAC request signing to bypass caching and rate limiting.
Ideally, the tests would also register new OAuth applications and make authenticated requests. However, there is currently no way to register and verify an application programmatically. Enable a new option in the staging API that auto-verifies OAuth applications if the request has a valid HMAC signature. Now we can add additional tests that exercise the authentication workflow and make authenticated requests. (This will require adding the HMAC signing secret as an environment variable to the staging API).
The tests must be able to run using one of the constant timed k6 executors (probably constant-vus
but maybe ramping-vus
if the staging API needs to warm up for request handling before the deployment rather than jumping straight to the peak traffic level of the test).
Unlike the frontend k6 staging tests, which execute post deployment, the API tests will execute during deployment. Initiate the k6 tests as a parallel task to dispatching the staging deployment workflow. The staging API typically takes 8–10 minutes to deploy, so the k6 tests should execute for a sufficient period of time before and after the deployment to give a head and tail to the peak traffic levels in relation to the deployment period. For example, k6 could be started with at least 2 minutes before triggering the staging deployment, and allowed to run for 15 minutes total, resulting in a 2-minute head (+/- the time it takes the deployment GitHub Workflow to start and get to the point of deploying) and 5–7 minute tail of traffic compared to the deployment period.
Steps, to be done in separate PRs:
- Add HMAC signature verification to OAuth registration route and auto-verify when a valid signature is supplied.
- Add the HMAC signing secret to the staging API environment variables to support the above.
- Write API k6 tests and sign all requests with the HMAC signing secret. Follow the example from the frontend tests which already implements this using the
http.ts
utility. No work needs to be done to enable HMAC signing other than using the customhttp.ts
wrapper utility instead of k6'shttp
directly. - Run API k6 tests against the local API in CI.
- Run API k6 tests according to the process described above during staging deployments.
Additional context
I've written this issue in response to a recent incident which highlighted the differences between staging and production as a vulnerability to our confidence in staging as a representative environment that we can trust to validate changes to the fullest possible extent.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status