torchrun defaults for concurrent distributed training jobs #2015
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously it was not possible to launch more than one distributed training job on the same node at the same time, as torchrun will try to use the same port for both of them by default. It's possible to manually pass
--rdzv-backend
and--rdzv-endpoint
flags to torchrun anytime you kick off a second run, but this is annoying (and not obvious). Instead we can just default to letting torchrun find a free port automatically.Test plan:
Run both the following commands on the same node.
Before this PR, the second job will fail with an error message like:
After this PR, both jobs are able to train successfully: