-
Notifications
You must be signed in to change notification settings - Fork 727
tests: reduce integ test flakiness + improve debugability #5282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Always print the results | ||
if p.results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so bear with me: ManagedProcess is async. We don't print stdout / stderr until after the results for a process are marked ready. Once the results for one process are marked ready, a test can assert the success of that process. If that assertion fails, we may exit the test without printing stdout / stderr. Usually this just results in failure to print stdout / stderr for one process, but because we mark the results ready BEFORE doing the print, theoretically we could fail to print stdout / stderr for both processes.
Rather than trying to cleanup that async mess, I moved the printing here. This logic should ALWAYS trigger, regardless of failed assertions. Handling teardown logic is one of the specific responsibilities of fixtures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4. Client s2n_recv reads ApplicationData | ||
5. Client s2n_renegotiate sends ClientHello |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why I had to remove the select
from the renegotiate method in echo.c. That select
assumes there will be ApplicationData to read, but the previous call to s2n_recv that read the HelloRequest might also have read the ApplicationData.
# In order to test the case where application data is received during renegotiation, | ||
# we must verify that the data was received after renegotiation started but before the new handshake finished. | ||
reneg_starts = stdout_str.find(S2N_RENEG_START_MARKER) | ||
reneg_finishes = stdout_str.find(S2N_RENEG_SUCCESS_MARKER) | ||
assert to_marker(first_server_app_data) in stdout_str[reneg_starts:reneg_finishes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can't reliably assert that we're testing what we think we are, I'm not sure I see much value in keeping the test. If we cause this test to break and it only results in flakiness, it seems unlikely that we'll actually catch the issue? Although I guess it doesn't hurt to keep the test anyway?
I wonder how often it's testing the wrong behavior though. Like, could we keep running this test until this assertion is met, and fail the test if it isn't met after a few runs or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, it basically never failed in CI before this week. But with stress
running, I put the check back and ran the tests a few times:
5 passed before first failure
9 passed before first failure
1 passed before first failure
0 passed before first failure
2 passed before first failure
1 passed before first failure
16 passed before first failure
There are 171 test cases :)
So the answer is that it usually works under normal conditions. But when the conditions get weird, the order of messages can get weird, and it has a high chance of failing.
Like, could we keep running this test until this assertion is met, and fail the test if it isn't met after a few runs or something?
That would make sense to me if there was a well-defined X percent chance of failure so we could choose the number of test runs such that the actual failure chance is known and acceptably small. But this is entirely "whims of the network". The assertion could fail 100% of the time.
I'd be fine with just deleting the test though. @maddeleine thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't like this. We're essentially flying blind without these asserts. This is unit tested so we should just delete the test.
We need a different type of integ testing that is deterministic(e.g. not fiddling around with processes). Any of our sufficiently complex IO tests could go there, because we literally cannot get this right with our current framework.
Release Summary:
Resolved issues:
Description of changes:
Our integ tests have been failing more often lately. I suspect it's a Codebuild noisy neighbor problem. The main tests I see fail are "test_renegotiate" and "test_serialization".
I ran the integ tests on an ec2 instance also running
stress -c 4 -i 4 -m 4 -d 4
. It was almost comically effective. I was able to consistently repro the failures from the CI, plus more. The downside is we're completely unable to use DHE (:Initially, I focused my investigation on test_renegotiate. I attempted to explain my fix for test_renegotiate in a code comment-- see if that makes sense.
Call-outs:
See #5282 (comment) for an attempt to explain what I'm doing with the stdout / stderr printing logic.
Testing:
The integ tests are still flaky because "test_serialization" is still flaky. But they should not fail for "test_renegotiate". test_renegotiate passes locally with
stress
running (and DHE cipher suites disabled) and I have not seen a test_renegotiate failure in the CI.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.