[lighthouse] fast failure on missing heartbeat instead of timeout #164

rualark · 2025-04-15T20:42:20Z

My understanding is that there are always some collective operations between replication groups to allreduce the gradients (if any form of ddp or hsdp is used). If one node fails in a replication group, all other groups will timeout because they will not finish allreduce. As lighthouse already knows that the node failed, should allreduce be aborted to avoid waiting for the timeout?

WarrenZhu050413 · 2025-04-23T13:12:26Z

Hi Rualark,

From how I am understanding the question, if one node fails, the lighthouse should know based on the node heartbeat information that it failed, so it would be nice for us to have a mechanism for the lighthouse send a message to the individual managers to abort the allreduce?

I am indeed working on a solution to this effect. You would need a listening thread to listen from the messages, and abort the NCCL communication group when listening to an error message.

rualark · 2025-04-23T17:08:26Z

Sounds great! Do you want to use this issue for your PR, or do you want to close it as a duplicate of another existing issue?

…ytorch#188)

d4l3k changed the title ~~If one node fails, can workers in all groups stop without waiting for allreduce timeout?~~ [lighthouse] fast failure on missing heartbeat instead of timeout Apr 25, 2025

d4l3k added enhancement New feature or request lighthouse Lighthouse and quorum related labels Apr 25, 2025

WarrenZhu050413 added a commit to WarrenZhu050413/torchft that referenced this issue May 20, 2025

Added proactive heartbeat timeout failure propagation (pytorch#164) (p…

fcc87c3

…ytorch#188)

WarrenZhu050413 added a commit to WarrenZhu050413/torchft that referenced this issue May 21, 2025

Added proactive heartbeat timeout failure propagation (pytorch#164) (p…

ebb3953

…ytorch#188)

WarrenZhu050413 added a commit to WarrenZhu050413/torchft that referenced this issue May 21, 2025

Added proactive heartbeat timeout failure propagation (pytorch#164) (p…

2a7bac7

…ytorch#188)

WarrenZhu050413 added a commit to WarrenZhu050413/torchft that referenced this issue May 21, 2025

Added proactive heartbeat timeout failure propagation (pytorch#164) (p…

a3dae49

…ytorch#188)

WarrenZhu050413 added a commit to WarrenZhu050413/torchft that referenced this issue May 21, 2025

Added proactive heartbeat timeout failure propagation (pytorch#164) (p…

c26d366

…ytorch#188)

WarrenZhu050413 added a commit to WarrenZhu050413/torchft that referenced this issue May 21, 2025

Added proactive heartbeat timeout failure propagation (pytorch#164) (p…

17f44f4

…ytorch#188)

WarrenZhu050413 added a commit to WarrenZhu050413/torchft that referenced this issue May 21, 2025

Added proactive heartbeat timeout failure propagation (pytorch#164) (p…

daa3adf

…ytorch#188)

WarrenZhu050413 added a commit to WarrenZhu050413/torchft that referenced this issue May 21, 2025

Added proactive heartbeat timeout failure propagation (pytorch#164) (p…

7b550aa

…ytorch#188)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[lighthouse] fast failure on missing heartbeat instead of timeout #164

[lighthouse] fast failure on missing heartbeat instead of timeout #164

rualark commented Apr 15, 2025 •

edited

Loading

WarrenZhu050413 commented Apr 23, 2025

Uh oh!

rualark commented Apr 23, 2025

Uh oh!

[lighthouse] fast failure on missing heartbeat instead of timeout #164

[lighthouse] fast failure on missing heartbeat instead of timeout #164

Comments

rualark commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WarrenZhu050413 commented Apr 23, 2025

Uh oh!

rualark commented Apr 23, 2025

Uh oh!

rualark commented Apr 15, 2025 •

edited

Loading