Skip to content

Write and run API k6 load tests during staging API deployments #5000

Open
@sarayourfriend

Description

@sarayourfriend

Problem

We rely on staging to verify that new code deployments are zero-downtime safe. Part of zero-downtime safety necessarily requires the service to actually handle and respond to requests. Some migrations which are decided not zero-downtime safe, would never cause problems in a service that is not handling requests. That is because zero-downtime safety is precisely related to whether the service is able to handle and respond to requests when two versions of the application are running at the same time. Nothing is verified if the versions do not handle requests during that period.

Because the staging API does not see virtually any traffic, and certainly not consistent traffic, we cannot rely on staging deployments to verify zero-downtime safety from this perspective.

Description

Write k6 load tests that can run on a timer and exercise all non-deprecated media requests: search, thumbnail, waveform, related, single result. We will rely on HMAC request signing to bypass caching and rate limiting.

Ideally, the tests would also register new OAuth applications and make authenticated requests. However, there is currently no way to register and verify an application programmatically. Enable a new option in the staging API that auto-verifies OAuth applications if the request has a valid HMAC signature. Now we can add additional tests that exercise the authentication workflow and make authenticated requests. (This will require adding the HMAC signing secret as an environment variable to the staging API).

The tests must be able to run using one of the constant timed k6 executors (probably constant-vus but maybe ramping-vus if the staging API needs to warm up for request handling before the deployment rather than jumping straight to the peak traffic level of the test).

Like the frontend k6 local tests, they should be executed against the local API in test during CI on pull requests.

Unlike the frontend k6 staging tests, which execute post deployment, the API tests will execute during deployment. Initiate the k6 tests as a parallel task to dispatching the staging deployment workflow. The staging API typically takes 8–10 minutes to deploy, so the k6 tests should execute for a sufficient period of time before and after the deployment to give a head and tail to the peak traffic levels in relation to the deployment period. For example, k6 could be started with at least 2 minutes before triggering the staging deployment, and allowed to run for 15 minutes total, resulting in a 2-minute head (+/- the time it takes the deployment GitHub Workflow to start and get to the point of deploying) and 5–7 minute tail of traffic compared to the deployment period.

Steps, to be done in separate PRs:

  • Add HMAC signature verification to OAuth registration route and auto-verify when a valid signature is supplied.
  • Add the HMAC signing secret to the staging API environment variables to support the above.
  • Write API k6 tests and sign all requests with the HMAC signing secret. Follow the example from the frontend tests which already implements this using the http.ts utility. No work needs to be done to enable HMAC signing other than using the custom http.ts wrapper utility instead of k6's http directly.
  • Run API k6 tests against the local API in CI.
  • Run API k6 tests according to the process described above during staging deployments.

Additional context

I've written this issue in response to a recent incident which highlighted the differences between staging and production as a vulnerability to our confidence in staging as a representative environment that we can trust to validate changes to the fullest possible extent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    🌟 goal: additionAddition of new feature💻 aspect: codeConcerns the software code in the repository🔒 staff onlyRestricted to staff members🟨 priority: mediumNot blocking but should be addressed soon🧱 stack: infraRelated to the Terraform config and other infrastructure

    Type

    No type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions