Skip to content

torchrun defaults for concurrent distributed training jobs #2015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 16, 2024

Conversation

ebsmothers
Copy link
Contributor

Previously it was not possible to launch more than one distributed training job on the same node at the same time, as torchrun will try to use the same port for both of them by default. It's possible to manually pass --rdzv-backend and --rdzv-endpoint flags to torchrun anytime you kick off a second run, but this is annoying (and not obvious). Instead we can just default to letting torchrun find a free port automatically.

Test plan:

Run both the following commands on the same node.

CUDA_VISIBLE_DEVICES=0,1 tune run --nproc_per_node 2 lora_finetune_distributed --config llama3_2/1B_lora
CUDA_VISIBLE_DEVICES=2,3 tune run --nproc_per_node 2 lora_finetune_distributed --config llama3_2/1B_lora

Before this PR, the second job will fail with an error message like:

  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers
    self._rendezvous(worker_group)
  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous
    rdzv_info = spec.rdzv_handler.next_rendezvous()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 67, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use

After this PR, both jobs are able to train successfully:

1|79|Loss: 1.0284696817398071:  10%|█████████▉                                                                                            | 79/808 [01:44<15:26,  1.27s/it]
1|63|Loss: 1.0152941942214966:   8%|███████▋                                                                                           | 63/808 [01:24<16:20,  1.32s/it]

Copy link

pytorch-bot bot commented Nov 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2015

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 8af1708 with merge base bca5899 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2024
@RdoubleA
Copy link
Collaborator

Can users still have the option to pass in a specific port? Wondering for production environments / high compute if this degree of control is more preferrable than auto selecting

@ebsmothers
Copy link
Contributor Author

@RdoubleA yeah fair point. I guess we could wrap the endpoint definition in an if statement or something to check if it’s already been passed by the user. The annoying thing is that there are already other torchrun defaults for these values (e.g. they use “static” for backend by default instead of “c10d” and we need to override that)

@ebsmothers
Copy link
Contributor Author

OK @RdoubleA let me know if the latest updates look reasonable to you. I realized there is a --standalone torchrun flag that will do the same thing as --rdzv-backend=c10d --rdzv-endpoint=localhost:0 (see the description in this commit message). So I will enable this by default, but only if --rdzv-endpoint is not passed. That way we still give the ability to set the endpoint explicitly and don't have to muck around with any other defaults like --rdzv-backend ourselves to do it.

@ebsmothers ebsmothers merged commit ac14e96 into pytorch:main Nov 16, 2024
17 checks passed
@ebsmothers ebsmothers mentioned this pull request Nov 26, 2024
44 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants