Skip to content

Add failfast option #133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Sep 18, 2024
Merged

Add failfast option #133

merged 18 commits into from
Sep 18, 2024

Conversation

nickrobinson251
Copy link
Collaborator

@nickrobinson251 nickrobinson251 commented Jan 6, 2024

Since for us "running tests" involves running test-items which themselves can run multiple tests, we can "fail fast" at two levels:

  • Stop runtests from running new testitems as soon as one returns as a failure/error (mark that whole run as a failure)
  • Stop @testitem from running the tests inside as soon as there is a failure/error (mark that test-item as a failure)

Originally this PR added just a failfast keyword to runtests: runtests(..., failfast=true) stops as soon as any testitem fails.

But i've extended it to add the ability for an @testitem to set failfast=true: @testitem "foo" failfast=true stops on the first test error/failure. This can be set for all test-items by a second new runtests keyword named testitem_failfast, (i.e. this can be used to set the default for all testitems).

This matches how @testitem "foo" timeout=60 corresponds to runtests(...; testitem_timeout=60)

failfast defaults to false. I've set testitem_failfast to default to the same value as given to failfast (if not set explicitly on a @testitem).

So the proposed behaviour is:

  • runtests(...) => neither runtests nor individual testitems stop early
  • runtests(...; failfast=true) => both stop early
  • runtests(...; testitem_failfast=true) => only testitems stop early
  • runtests(...; failfast=true, testitem_failfast=false) => only runtests stops early

API question

Would it be simpler to have the separate keywords operate independently?

The downside would be you have to set both to true to get "fail fastest", i.e. to both have individual testitems stop when they hit an error/failure, and to have no new testitems run once one has hit an error/failure you have to run with runtests(...; failfast=true, testitem_failfast=true)

Alternative (keywords operate independently):

  • runtests(...;) => neither runtests nor individual testitems stop early
  • runtests(...; failfast=true) => only runtests stops early
  • runtests(...; testitem_failfast=true) => only testitems stop early
  • runtests(...; failfast=true, testitem_failfast=true) => both stop early

Really the question is whether failfast=true should default to turning on both forms of early stopping (as currently proposed by this PR), or if it should mean only runtests fails fast?

I would quite like the seperation of the two... one keyword controls one, the other controls the other... BUT I think in practice it is more ergonomic for users to just pass failfast=true to get the "fastest" failures (hence proposing that) ...i'd love some feedback on this decision!

@nickrobinson251 nickrobinson251 force-pushed the npr-fail-fast branch 4 times, most recently from 8bd277b to 9051fd7 Compare January 12, 2024 00:55
@nickrobinson251 nickrobinson251 changed the title [WIP] Add failfast option Add failfast option Jan 12, 2024
@nickrobinson251 nickrobinson251 marked this pull request as ready for review January 12, 2024 01:18
Copy link
Collaborator

@Drvi Drvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the feature and I agree with the current API. 👍

I didn't manage to get through all of this today, will continue tomorrow.

Copy link
Collaborator

@Drvi Drvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

It's a bit sad that we can't guarantee actual fast failing when nworkers > 1 since then the current test items scheduled on other test workers might take up to timeout seconds to finish. What if we pretend that the timeout is effectively zero and just kill any other workers to guarantee truly fast failure? The user likely doesn't care about all the test results when they run runtests with failfast... especially if we'd call the kwarg failfastandfurious:)

@@ -278,7 +301,7 @@ end
# By tracking and reusing test environments, we can avoid this issue.
const TEST_ENVS = Dict{String, String}()

function _runtests(ti_filter, paths, nworkers::Int, nworker_threads::String, worker_init_expr::Expr, test_end_expr::Expr, testitem_timeout::Int, retries::Int, memory_threshold::Real, verbose_results::Bool, debug::Int, report::Bool, logs::Symbol, timeout_profile_wait::Int, gc_between_testitems::Bool)
function _runtests(ti_filter, paths, nworkers::Int, nworker_threads::String, worker_init_expr::Expr, test_end_expr::Expr, testitem_timeout::Int, retries::Int, memory_threshold::Real, verbose_results::Bool, debug::Int, report::Bool, logs::Symbol, timeout_profile_wait::Int, gc_between_testitems::Bool, failfast::Bool, testitem_failfast::Bool)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily for this PR, but I think we should just bundle all these args into a Context struct

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sure! #186

@nickrobinson251
Copy link
Collaborator Author

It's a bit sad that we can't guarantee actual fast failing when nworkers > 1 since then the current test items scheduled on other test workers might take up to timeout seconds to finish. What if we pretend that the timeout is effectively zero and just kill any other workers to guarantee truly fast failure?

Yeah, i wonder about that too 🤔 It was so long ago that i did this i can't remember why i didn't go kill the other workers... I'll need to look into it. I suspect/hope it was laziness/simplicity (i.e. this implementation is so simple, because we just have next_testitem check for a Bool, so there's no need to coordinate workers)

In fact, i wonder if it's even worse than "might take up to timeout seconds to finish", because of retries? in which case that could be pretty bad, and we'd have to just name this failfastish or failabitfaster

@nickrobinson251
Copy link
Collaborator Author

nickrobinson251 commented Sep 18, 2024

Buuut, also i've no time to work on this at the minute, and it's a pain to keep rebasing and a bit of a shame not to have it at least in it's current form... so i think i might just update the documentation to call-out that we wait for testitems on others workers to finish in the current implementation, and say this may change in future releases to proactively cancel other running testitems to enable even faster failures, and merge what's here (i.e. let us land that improvement in a follow-up, non-breaking release) -- what do you think?

To make clear this may change in a non-breaking release
@Drvi
Copy link
Collaborator

Drvi commented Sep 18, 2024

Good point about the retries! I wonder if retries could be skipped relatively easily by checking first whether the run has been canceled. But I think it's fine to refine the behavior of this feature in the future 👍

@nickrobinson251 nickrobinson251 enabled auto-merge (squash) September 18, 2024 12:38
@nickrobinson251 nickrobinson251 merged commit fc7845b into main Sep 18, 2024
7 checks passed
@nickrobinson251 nickrobinson251 deleted the npr-fail-fast branch September 18, 2024 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Fail-fast option
2 participants