torchrun defaults for concurrent distributed training jobs #2015

ebsmothers · 2024-11-15T22:26:05Z

Previously it was not possible to launch more than one distributed training job on the same node at the same time, as torchrun will try to use the same port for both of them by default. It's possible to manually pass --rdzv-backend and --rdzv-endpoint flags to torchrun anytime you kick off a second run, but this is annoying (and not obvious). Instead we can just default to letting torchrun find a free port automatically.

Test plan:

Run both the following commands on the same node.

CUDA_VISIBLE_DEVICES=0,1 tune run --nproc_per_node 2 lora_finetune_distributed --config llama3_2/1B_lora
CUDA_VISIBLE_DEVICES=2,3 tune run --nproc_per_node 2 lora_finetune_distributed --config llama3_2/1B_lora

Before this PR, the second job will fail with an error message like:

  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 683, in _initialize_workers
    self._rendezvous(worker_group)
  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 500, in _rendezvous
    rdzv_info = spec.rdzv_handler.next_rendezvous()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ebs/.conda/envs/tt-alt-10-24/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 67, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use

After this PR, both jobs are able to train successfully:

1|79|Loss: 1.0284696817398071:  10%|█████████▉                                                                                            | 79/808 [01:44<15:26,  1.27s/it]
1|63|Loss: 1.0152941942214966:   8%|███████▋                                                                                           | 63/808 [01:24<16:20,  1.32s/it]

pytorch-bot · 2024-11-15T22:26:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2015

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ No Failures

As of commit 8af1708 with merge base bca5899 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

RdoubleA · 2024-11-16T01:15:20Z

Can users still have the option to pass in a specific port? Wondering for production environments / high compute if this degree of control is more preferrable than auto selecting

ebsmothers · 2024-11-16T01:51:39Z

@RdoubleA yeah fair point. I guess we could wrap the endpoint definition in an if statement or something to check if it’s already been passed by the user. The annoying thing is that there are already other torchrun defaults for these values (e.g. they use “static” for backend by default instead of “c10d” and we need to override that)

ebsmothers · 2024-11-16T03:52:58Z

OK @RdoubleA let me know if the latest updates look reasonable to you. I realized there is a --standalone torchrun flag that will do the same thing as --rdzv-backend=c10d --rdzv-endpoint=localhost:0 (see the description in this commit message). So I will enable this by default, but only if --rdzv-endpoint is not passed. That way we still give the ability to set the endpoint explicitly and don't have to muck around with any other defaults like --rdzv-backend ourselves to do it.

torchrun defaults for concurrent distributed training jobs

1b58729

ebsmothers requested a review from joecummings November 15, 2024 22:26

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2024

ebsmothers requested review from pbontrager, RdoubleA and felipemello1 November 15, 2024 22:26

use standalone mode instead

8af1708

RdoubleA approved these changes Nov 16, 2024

View reviewed changes

ebsmothers merged commit ac14e96 into pytorch:main Nov 16, 2024
17 checks passed

ebsmothers mentioned this pull request Nov 26, 2024

v0.5.0 tracker #2008

Closed

44 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

torchrun defaults for concurrent distributed training jobs #2015

torchrun defaults for concurrent distributed training jobs #2015

Uh oh!

ebsmothers commented Nov 15, 2024

Uh oh!

pytorch-bot bot commented Nov 15, 2024 •

edited

Loading

Uh oh!

RdoubleA commented Nov 16, 2024

Uh oh!

ebsmothers commented Nov 16, 2024

Uh oh!

ebsmothers commented Nov 16, 2024

Uh oh!

Uh oh!

Uh oh!

torchrun defaults for concurrent distributed training jobs #2015

torchrun defaults for concurrent distributed training jobs #2015

Uh oh!

Conversation

ebsmothers commented Nov 15, 2024

Test plan:

Uh oh!

pytorch-bot bot commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2015

❗ 1 Active SEVs

✅ No Failures

Uh oh!

RdoubleA commented Nov 16, 2024

Uh oh!

ebsmothers commented Nov 16, 2024

Uh oh!

ebsmothers commented Nov 16, 2024

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 15, 2024 •

edited

Loading