Skip to content

[lighthouse] fast failure on missing heartbeat instead of timeout #164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rualark opened this issue Apr 15, 2025 · 2 comments
Open

[lighthouse] fast failure on missing heartbeat instead of timeout #164

rualark opened this issue Apr 15, 2025 · 2 comments
Labels
enhancement New feature or request lighthouse Lighthouse and quorum related

Comments

@rualark
Copy link

rualark commented Apr 15, 2025

My understanding is that there are always some collective operations between replication groups to allreduce the gradients (if any form of ddp or hsdp is used). If one node fails in a replication group, all other groups will timeout because they will not finish allreduce. As lighthouse already knows that the node failed, should allreduce be aborted to avoid waiting for the timeout?

@WarrenZhu050413
Copy link
Contributor

Hi Rualark,

From how I am understanding the question, if one node fails, the lighthouse should know based on the node heartbeat information that it failed, so it would be nice for us to have a mechanism for the lighthouse send a message to the individual managers to abort the allreduce?

I am indeed working on a solution to this effect. You would need a listening thread to listen from the messages, and abort the NCCL communication group when listening to an error message.

@rualark
Copy link
Author

rualark commented Apr 23, 2025

Sounds great! Do you want to use this issue for your PR, or do you want to close it as a duplicate of another existing issue?

@d4l3k d4l3k changed the title If one node fails, can workers in all groups stop without waiting for allreduce timeout? [lighthouse] fast failure on missing heartbeat instead of timeout Apr 25, 2025
@d4l3k d4l3k added enhancement New feature or request lighthouse Lighthouse and quorum related labels Apr 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lighthouse Lighthouse and quorum related
Projects
None yet
Development

No branches or pull requests

3 participants