run_loop::run-loop-sender should remove the item from the list when a stop-request is sent #293

New issue

Open

run_loop::run-loop-sender should remove the item from the list when a stop-request is sent#293

Labels

P0designdiscussionprocessed

lewissbaker

opened

on Oct 15, 2024

Collaborator

The current specification for the run_loop::schedule() operation seems to be to just wait until the worker thread dequeues the task and only then check to see if a stop-request was sent - calling set_stopped() if there was a stop-request, otherwise calling set_value().

This could be problematic if trying to use run_loop with the get_delegation_scheduler feature and could lead to deadlocks.

For example, say someone schedules some work on another thread and wants to block waiting for that work to complete using sync_wait(). This will inject a run_loop scheduler as the get_delegation_scheduler.

The user, wanting to make use of the get_delegation_scheduler, schedules work on a composite scheduler that tries to use a primary scheduler, but also schedules to the delegation scheduler to allow the work to run in either the current thread or on some other context. This way, if all other threads on the other context are busy/blocked then we can still make forward progress on the task using the current thread.

But this approach of scheduling a task on each scheduler and running on whichever completes first only really works if both of the schedulers support "synchronous cancellation". i.e. when a stop-request is sent then either it completes inline in the call to request_stop() with set_stopped() or it is guaranteed to eventually complete with set_value(). i.e. some thread has already dequeued the task and is about to/already calling set_value().
This property allows whichever scheduler completed first to cancel the schedule operation on the other scheduler and then block waiting for the cancellation to finish before then continuing to signal completion.

However, the current specification of run_loop does not have this behaviour and so there is no guarantee that if the other scheduler completed first that the cancelled run_loop-schedule operation will complete in a timely manner (or at all).

added

Collaborator

Do you have any thoughts on what the wording change needs to be? This whole design space is difficult to get right so the gap isn't especially surprising.

lewissbaker

CollaboratorAuthor

The remedy here would be to change the run_loop::run-loop-opstate-base to have a prev pointer, making it a doubly linked list.

Then have run-loop-opstate<Rcvr> have two specialisations. One default one which registers a stop-callback, and one constrained by unstoppable_token<stop_token_of_t<env_of_t<Rcvr>> which does not register a stop-callback.

The one with the stop-callback would call set_stopped() inline in the stop-callback if it successfully removed the item from the queue before the worker thread removed it from the queue.

LeeHowes

Collaborator

We could remove this for now, or make sync_wait do nothing but call start() and wait for completion (inline or not).

The actual delegating version could be added later: sync_wait_with_provided_run_loop_context(), tbb_sync_wait() etc. Or one that embeds the run loop in a sender.

It would remove functionality, in that it would be harder out of the box to implement concurrency on the current thread in the way that sync_wait(task()); can work. It would at least be additive in library code or in C++29, without embedding something broken in C++26.

Another baby solution would be to simply rename sync_wait to be more explicit about how it behaves. It would be less surprising.

lewissbaker

CollaboratorAuthor

One concern was that having sync_wait() bake in use of run_loop prevents it from later being extended to support things like time_schedulers, or I/O scheduling.

We discussed several options in the meeting.

We discussed removing sync_wait until we have fully fleshed out the forward-progress-delegation facilities/design, but this would leave users without an out-of-the-box way to start asynchronous work without writing their own equivalent to sync_wait().

One option was to make sync_wait(sender auto&& s) just connect/start and then block until the operation completes, without running any event-loop internally. This would be simpler and lighter-weight and less-likely to be outdated quickly.

Then we could add a subsequent drivable_context concept that provides access to a run(stoppable_token auto st) member function to drive the execution of the event-loop until a stop-request is sent, then provide a sync_wait(sender auto&& s, drivable_context auto& ctx) that would then connect/start the sender and then call ctx.drive(st) until the operation completed, which would then trigger a stop-request on st that would cause ctx.drive() to exit.
This way, users can provide their own execution context to drive when calling sync_wait().

We also discussed the possibility of having an async_drivable_context that has an async_drive() method that returns a sender, which would use the parent scheduler injected in the receiver's environment to schedule driving that context on the parent context. e.g. enqueueing a task to the parent context whenever there is a non-empty queue in the child context. This would allow, e.g. multiple contexts to all be driven from a single synchronously-driven context.

We also discussed another option similar to async_drivable_context which was to represent the drive() function as a sender and have it block in start() until it received a stop-request. This could be viewed as a degenerate case of the above async_drivable_context but in the case where there is no parent context which can be delegated to, so it must drive the context synchronously inside start().

Concerns were raised about this approach with regards to its safety/composability. e.g. you would have to describe the work using sync_wait(when_any(work, ctx.async_drive())) and ensure that the work was started before the ctx.async_drive() as nothing else will start once the async_drive operation enters start().
Trying to run multiple event loops this way using when_all(ctx1.async_drive(), ctx2.async_drive()) would never get to driving ctx2.

Ideally a paper needs to explore this more - in particular the drivable_context concept and also forward-progress-delegation in general. Nobody had capacity to look into this at the moment, however.

BenFrantzDale

@lewissbaker what about renaming the current sync_wait to sync_wait_run_loop or some similar ugly non-ideal name, leaving sync_wait available for future use?

Also, related to this, you've mentioned the idea of a synchronously-cancelable scheduler. Do you have an API in mind? I'm picturing a scheduler that produces a move-only sender that has a .handle().sync_cancel() -> bool and .handle().sync_cancel_requested() -> bool member functions, so you could do

auto snd = ex::schedule(sch);
auto handle = snd.handle();
auto state = ex::connect(std::move(snd), rec);
auto canceled = handle.sync_cancel();
assert(canceled);
state.start(); //< Completes in-line with set_stopped.

I think that would let us do something similar to when_any where we try to schedule on two sync-cancellable schedulers, but as they complete, atomically count.

// The set_value() for the two inner receivers:
if (count.fetch_sub(1uz) == 2uz) {
    // First thread out.
    if (otherHandle.sync_cancel()) {
        // We won and canceled the other. It'll never need our state.
        ex::set_value(receiver);
    } else {
        // We failed to cancel the other one, so that thread must already be in flight.
        // It'll get here in just a sec, so just let it take over.
    }
} else {
    // Second thread out.
    if (ourHandle.sync_cancel_requested()) {
        // The other thread won but would have seen that we were in flight,
        // so it's expecting us to take over.
        ex::set_value(receiver);
    }
}

Not sure if that's correct. I suspect there's a race in there but that it's fixable. :-D

added

CollaboratorAuthor

Also, related to this, you've mentioned the idea of a synchronously-cancelable scheduler. Do you have an API in mind?

I was just thinking about having a query on a sender (in this case the schedule-sender) that lets you ask if a stop-request is synchronously-cancellable. The actual stop-request would still be sent in the same way (via stop-tokens) but the sender is guaranteeing that when a stop-request is sent to it that it is guaranteed to make forward progress and call the completion-handler (either set_stopped on the thread sending the stop-request) or set_value/set_error on some other thread, in the case that the operation has already completed and is being processed.

BenFrantzDale

Also, related to this, you've mentioned the idea of a synchronously-cancelable scheduler. Do you have an API in mind?

I was just thinking about having a query on a sender (in this case the schedule-sender) that lets you ask if a stop-request is synchronously-cancellable. The actual stop-request would still be sent in the same way (via stop-tokens) but the sender is guaranteeing that when a stop-request is sent to it that it is guaranteed to make forward progress and call the completion-handler (either set_stopped on the thread sending the stop-request) or set_value/set_error on some other thread, in the case that the operation has already completed and is being processed.

I see. I really like reusing the existing cancelation API like that. So the way to detect if sync-canceling worked is to ask it to cancel and then see if the receiver got a set_stopped()? Or for the work-stealing use case, the winning thread cancels the other thread, which either works (and calls set_stopped() which decrements a count) or doesn’t (in which case the first thread can let the other thread take over. Something like that, right?

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

P0designdiscussionprocessed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

run_loop::run-loop-sender should remove the item from the list when a stop-request is sent #293

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

run_loop::run-loop-sender should remove the item from the list when a stop-request is sent #293

Description

Activity

LeeHowes commented on Oct 15, 2024

lewissbaker commented on Oct 15, 2024

LeeHowes commented on Oct 23, 2024

lewissbaker commented on Oct 23, 2024

BenFrantzDale commented on Oct 24, 2024

lewissbaker commented on Oct 24, 2024

BenFrantzDale commented on Oct 25, 2024

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions