Skip to content

tests: reduce integ test flakiness + improve debugability #5282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 7, 2025

Conversation

lrstewart
Copy link
Contributor

@lrstewart lrstewart commented May 2, 2025

Release Summary:

Resolved issues:

Description of changes:

Our integ tests have been failing more often lately. I suspect it's a Codebuild noisy neighbor problem. The main tests I see fail are "test_renegotiate" and "test_serialization".

I ran the integ tests on an ec2 instance also running stress -c 4 -i 4 -m 4 -d 4. It was almost comically effective. I was able to consistently repro the failures from the CI, plus more. The downside is we're completely unable to use DHE (:

Initially, I focused my investigation on test_renegotiate. I attempted to explain my fix for test_renegotiate in a code comment-- see if that makes sense.

Call-outs:

See #5282 (comment) for an attempt to explain what I'm doing with the stdout / stderr printing logic.

Testing:

The integ tests are still flaky because "test_serialization" is still flaky. But they should not fail for "test_renegotiate". test_renegotiate passes locally with stress running (and DHE cipher suites disabled) and I have not seen a test_renegotiate failure in the CI.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@lrstewart lrstewart changed the title tests: reduce integ test flakiness + debugability tests: reduce integ test flakiness + improve debugability May 2, 2025
@github-actions github-actions bot added the s2n-core team label May 2, 2025
Comment on lines +98 to +99
# Always print the results
if p.results:
Copy link
Contributor Author

@lrstewart lrstewart May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so bear with me: ManagedProcess is async. We don't print stdout / stderr until after the results for a process are marked ready. Once the results for one process are marked ready, a test can assert the success of that process. If that assertion fails, we may exit the test without printing stdout / stderr. Usually this just results in failure to print stdout / stderr for one process, but because we mark the results ready BEFORE doing the print, theoretically we could fail to print stdout / stderr for both processes.

Rather than trying to cleanup that async mess, I moved the printing here. This logic should ALWAYS trigger, regardless of failed assertions. Handling teardown logic is one of the specific responsibilities of fixtures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code references to attempt to explain:

  • get_results() waits for _results_ready(): link
  • _results_ready() waits for results to exist: link
  • results are set in run(): link
  • stdout/stderr are printed later in run(): link

@lrstewart lrstewart requested review from goatgoose and maddeleine May 5, 2025 19:38
@lrstewart lrstewart marked this pull request as ready for review May 5, 2025 19:38
Comment on lines 447 to 448
4. Client s2n_recv reads ApplicationData
5. Client s2n_renegotiate sends ClientHello
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I had to remove the select from the renegotiate method in echo.c. That select assumes there will be ApplicationData to read, but the previous call to s2n_recv that read the HelloRequest might also have read the ApplicationData.

Comment on lines -468 to -472
# In order to test the case where application data is received during renegotiation,
# we must verify that the data was received after renegotiation started but before the new handshake finished.
reneg_starts = stdout_str.find(S2N_RENEG_START_MARKER)
reneg_finishes = stdout_str.find(S2N_RENEG_SUCCESS_MARKER)
assert to_marker(first_server_app_data) in stdout_str[reneg_starts:reneg_finishes]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't reliably assert that we're testing what we think we are, I'm not sure I see much value in keeping the test. If we cause this test to break and it only results in flakiness, it seems unlikely that we'll actually catch the issue? Although I guess it doesn't hurt to keep the test anyway?

I wonder how often it's testing the wrong behavior though. Like, could we keep running this test until this assertion is met, and fail the test if it isn't met after a few runs or something?

Copy link
Contributor Author

@lrstewart lrstewart May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it basically never failed in CI before this week. But with stress running, I put the check back and ran the tests a few times:
5 passed before first failure
9 passed before first failure
1 passed before first failure
0 passed before first failure
2 passed before first failure
1 passed before first failure
16 passed before first failure
There are 171 test cases :)

So the answer is that it usually works under normal conditions. But when the conditions get weird, the order of messages can get weird, and it has a high chance of failing.

Like, could we keep running this test until this assertion is met, and fail the test if it isn't met after a few runs or something?

That would make sense to me if there was a well-defined X percent chance of failure so we could choose the number of test runs such that the actual failure chance is known and acceptably small. But this is entirely "whims of the network". The assertion could fail 100% of the time.

I'd be fine with just deleting the test though. @maddeleine thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't like this. We're essentially flying blind without these asserts. This is unit tested so we should just delete the test.
We need a different type of integ testing that is deterministic(e.g. not fiddling around with processes). Any of our sufficiently complex IO tests could go there, because we literally cannot get this right with our current framework.

@lrstewart lrstewart requested a review from goatgoose May 6, 2025 16:26
@lrstewart lrstewart requested a review from goatgoose May 6, 2025 23:17
@lrstewart lrstewart added this pull request to the merge queue May 7, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 7, 2025
@lrstewart lrstewart added this pull request to the merge queue May 7, 2025
Merged via the queue into aws:main with commit cc8fb18 May 7, 2025
46 checks passed
@lrstewart lrstewart deleted the integ_fixes branch May 7, 2025 18:47
dougch pushed a commit to dougch/s2n-tls that referenced this pull request May 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants