Description
Having the ability to support multiple quorums on the same lighthouse server would make it much easier to deploy torchft in certain scenarios.
With this feature you could deploy a single lighthouse server and then use it for all jobs running in that cluster by using the job ID. This simplifies discovery and would work with most batch job schedulers.
Two possible designs:
1. room_id outside of GRPC
We likely want to make it so you can create a new Lighthouse client with a certain key and it'll automatically isolate the requests to that namespace.
Implementing this cleanly on the server side may be a bit tricky -- we may need to do some manipulation under the hood to instantiate one LighthouseServer instance per incoming request to make this cleaner and avoid polluting the API with "room" ids.
https://github.com/pytorch/torchft/blob/main/src/lighthouse.rs#L601
1. room_id as field on heartbeat + quorum methods
This may be simpler in some ways to implement as we can just add a room_id field to all the lighthouse requests and then internally route as necessary. No magic with GRPC services is required.