Skip to content

support multiple quorums on the same LighthouseServer #173

Open
@d4l3k

Description

@d4l3k

Having the ability to support multiple quorums on the same lighthouse server would make it much easier to deploy torchft in certain scenarios.

With this feature you could deploy a single lighthouse server and then use it for all jobs running in that cluster by using the job ID. This simplifies discovery and would work with most batch job schedulers.

Two possible designs:

1. room_id outside of GRPC

We likely want to make it so you can create a new Lighthouse client with a certain key and it'll automatically isolate the requests to that namespace.

Implementing this cleanly on the server side may be a bit tricky -- we may need to do some manipulation under the hood to instantiate one LighthouseServer instance per incoming request to make this cleaner and avoid polluting the API with "room" ids.

https://github.com/pytorch/torchft/blob/main/src/lighthouse.rs#L601

1. room_id as field on heartbeat + quorum methods

This may be simpler in some ways to implement as we can just add a room_id field to all the lighthouse requests and then internally route as necessary. No magic with GRPC services is required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestlighthouseLighthouse and quorum related

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions