Skip to content

tests: reduce integ test flakiness + improve debugability #5282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 7, 2025
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions bin/echo.c
Original file line number Diff line number Diff line change
Expand Up @@ -314,11 +314,11 @@ int renegotiate(struct s2n_connection *conn, int fd, bool wait_for_more_data)
fprintf(stdout, "RENEGOTIATE\n");
fflush(stdout);

/* Do not proceed with renegotiation until we receive more data from the server */
/* Pause before beginning renegotiation, giving the server the chance to
* potentially send more application data.
*/
if (wait_for_more_data) {
fd_set fds = { 0 };
FD_SET(fd, &fds);
select(FD_SETSIZE, &fds, NULL, NULL, NULL);
sleep(1);
}

while (s2n_renegotiate(conn, buffer, sizeof(buffer), &data_read, &blocked) != S2N_SUCCESS) {
Expand Down
23 changes: 22 additions & 1 deletion tests/integrationv2/fixtures.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,29 @@ def _fn(
# is nothing to capture here.
pass
finally:
# Whether the processes succeeded or not, clean then up.
for p in processes:
# Always print the results
if p.results:
Comment on lines +98 to +99
Copy link
Contributor Author

@lrstewart lrstewart May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so bear with me: ManagedProcess is async. We don't print stdout / stderr until after the results for a process are marked ready. Once the results for one process are marked ready, a test can assert the success of that process. If that assertion fails, we may exit the test without printing stdout / stderr. Usually this just results in failure to print stdout / stderr for one process, but because we mark the results ready BEFORE doing the print, theoretically we could fail to print stdout / stderr for both processes.

Rather than trying to cleanup that async mess, I moved the printing here. This logic should ALWAYS trigger, regardless of failed assertions. Handling teardown logic is one of the specific responsibilities of fixtures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code references to attempt to explain:

  • get_results() waits for _results_ready(): link
  • _results_ready() waits for results to exist: link
  • results are set in run(): link
  • stdout/stderr are printed later in run(): link

width = 90
padchar = "#"

print(padchar * width)
print(f" {p.cmd_line[0]} ".center(width, padchar))
print(padchar * width)

print(f"Command line:\n\t{' '.join(p.cmd_line)}")
print(f"Exit code:\n\t {p.results.exit_code}")
print("")

print(" Stdout ".center(width, padchar))
print(p.results.stdout.decode("utf-8", "backslashreplace"))
print("")

print(" Stderr ".center(width, padchar))
print(p.results.stderr.decode("utf-8", "backslashreplace"))
print("")

# Whether the processes succeeded or not, clean them up.
if aborted:
p.kill()
else:
Expand Down
23 changes: 1 addition & 22 deletions tests/integrationv2/processes.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

_PopenSelector = selectors.PollSelector
_PIPE_BUF = getattr(select, "PIPE_BUF", 512)
_DEBUG_LEN = 80
_DEBUG_LEN = 200


class _processCommunicator(object):
Expand Down Expand Up @@ -475,27 +475,6 @@ def run(self):
self.expect_stderr,
)
raise ex
finally:
# This data is dumped to stdout so we capture this
# information no matter where a test fails.
print("###############################################################")
print(
f"####################### {self.cmd_line[0]} #######################"
)
print("###############################################################")

print(f"Command line:\n\t{' '.join(self.cmd_line)}")
print(f"Exit code:\n\t {proc.returncode}")

print("##########################################################")
print("########################### Stdout #######################")
print("##########################################################")
print(proc_results[0].decode("utf-8", "backslashreplace"))

print("#########################################################")
print("########################### Stderr #######################")
print("#########################################################")
print(proc_results[1].decode("utf-8", "backslashreplace"))

def kill(self):
self.proc.kill()
Expand Down
57 changes: 47 additions & 10 deletions tests/integrationv2/test_renegotiate.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,7 @@ def basic_reneg_test(
provider,
messages=RENEG_MESSAGES,
reneg_option=None,
timeout=8,
):
options = ProviderOptions(
port=next(available_ports),
Expand All @@ -217,15 +218,15 @@ def basic_reneg_test(
provider,
server_options,
send_marker=Msg.send_markers(messages, Provider.ServerMode),
timeout=8,
timeout=timeout,
)

s2n_client = managed_process(
S2N,
client_options,
send_marker=Msg.send_markers(messages, Provider.ClientMode),
close_marker=Msg.close_marker(messages),
timeout=8,
timeout=timeout,
)

return (s2n_client, server)
Expand Down Expand Up @@ -422,6 +423,49 @@ def test_s2n_client_renegotiate_with_client_auth_with_openssl(

"""
The s2n-tls client successfully reads ApplicationData during the renegotiation handshake.

This test is best-effort, and flakiness could be a sign of problems with how
s2n-tls handles ApplicationData during the renegotiation handshake.

Ideally the actions in the test would be:
1. Server sends the HelloRequest
2. Client s2n_recv reads HelloRequest
3. Client s2n_renegotiate sends ClientHello
4. Server sends ApplicationData
5. Server reads ClientHello, begins new handshake
6. Server sends ServerHello
7. Client s2n_renegotiate reads ApplicationData

Notice that the ApplicationData appears between two handshake messages,
so is read by s2n_renegotiate instead of s2n_recv. That is the scenario
we are attempting to test.

But that means that the actions might occur in this order:
1. Server sends the HelloRequest
2. Server sends ApplicationData
3. Client s2n_recv reads HelloRequest
4. Client s2n_recv reads ApplicationData
5. Client s2n_renegotiate sends ClientHello
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I had to remove the select from the renegotiate method in echo.c. That select assumes there will be ApplicationData to read, but the previous call to s2n_recv that read the HelloRequest might also have read the ApplicationData.

6. Server reads ClientHello, begins new handshake
7. Server sends ServerHello

You might think we could avoid this scenario by waiting to send the server
ApplicationData until the server reads the ClientHello. However, at that point
the server cannot send any ApplicationData because it has begun a new handshake.
There is no other viable trigger for the ApplicationData, which is why this test
is only best-effort.

The actions might also occur in this order:
1. Server sends the HelloRequest
2. Client s2n_recv reads HelloRequest
3. Client s2n_renegotiate sends ClientHello
4. Server reads ClientHello, begins new handshake
5. Server sends ServerHello
6. ... rest of the handshake...
6. Server sends ApplicationData
7. Client s2n_recv reads ApplicationData
We can make this scenario less likely by forcing the client to wait a few seconds
before sending the ClientHello, but it is still theoretically possible.
"""


Expand All @@ -440,7 +484,6 @@ def test_s2n_client_renegotiate_with_app_data_with_openssl(
protocol,
provider,
):
first_server_app_data = Msg.expected_output(RENEG_MESSAGES, Provider.ClientMode)[0]
(s2n_client, server) = basic_reneg_test(
managed_process,
cipher,
Expand All @@ -449,6 +492,7 @@ def test_s2n_client_renegotiate_with_app_data_with_openssl(
protocol,
provider,
reneg_option=S2N_RENEG_WAIT,
timeout=30, # Set the timeout much higher than the wait
)

for results in server.get_results():
Expand All @@ -463,10 +507,3 @@ def test_s2n_client_renegotiate_with_app_data_with_openssl(
for output in Msg.expected_output(RENEG_MESSAGES, Provider.ClientMode):
assert output in results.stdout
assert renegotiate_was_successful(results)
stdout_str = str(results.stdout)

# In order to test the case where application data is received during renegotiation,
# we must verify that the data was received after renegotiation started but before the new handshake finished.
reneg_starts = stdout_str.find(S2N_RENEG_START_MARKER)
reneg_finishes = stdout_str.find(S2N_RENEG_SUCCESS_MARKER)
assert to_marker(first_server_app_data) in stdout_str[reneg_starts:reneg_finishes]
Comment on lines -468 to -472
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't reliably assert that we're testing what we think we are, I'm not sure I see much value in keeping the test. If we cause this test to break and it only results in flakiness, it seems unlikely that we'll actually catch the issue? Although I guess it doesn't hurt to keep the test anyway?

I wonder how often it's testing the wrong behavior though. Like, could we keep running this test until this assertion is met, and fail the test if it isn't met after a few runs or something?

Copy link
Contributor Author

@lrstewart lrstewart May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it basically never failed in CI before this week. But with stress running, I put the check back and ran the tests a few times:
5 passed before first failure
9 passed before first failure
1 passed before first failure
0 passed before first failure
2 passed before first failure
1 passed before first failure
16 passed before first failure
There are 171 test cases :)

So the answer is that it usually works under normal conditions. But when the conditions get weird, the order of messages can get weird, and it has a high chance of failing.

Like, could we keep running this test until this assertion is met, and fail the test if it isn't met after a few runs or something?

That would make sense to me if there was a well-defined X percent chance of failure so we could choose the number of test runs such that the actual failure chance is known and acceptably small. But this is entirely "whims of the network". The assertion could fail 100% of the time.

I'd be fine with just deleting the test though. @maddeleine thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't like this. We're essentially flying blind without these asserts. This is unit tested so we should just delete the test.
We need a different type of integ testing that is deterministic(e.g. not fiddling around with processes). Any of our sufficiently complex IO tests could go there, because we literally cannot get this right with our current framework.

Loading