Improve dp attention port assignment scheme #5889

jokerwyt · 2025-04-29T13:30:29Z

Motivation

When we enable DP attention on many gpus (for example, 64) , the number of ports on node 0 we need is equal to the DP size. In many cases we need to share port space with others (such as container with hostnetwork, or baremetal), the possibility of port conflict is quite high.

Modifications

We get some free ports on node 0 and broadcast them to other nodes using dist_init_addr before assigning the zmq port from the DP controller to the scheduler with attn_tp_rank=0. We also move the port binding next to get_free_port to reduce the possibility of port conflict.

Test

Call for more tests on different settings, especially non-PD disaggregated settings.

2025-04-29 06:19:28,796 - pdutils - INFO - runCommand remotely: ssh -o StrictHostKeyChecking=no  ytw0 "PS1=[] source ~/.bashrc  && ( UCX_TLS=rc,gdr_copy,rc_x,cuda_copy,cuda_ipc UCX_NET_DEVICES=mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1,mlx5_bond_4:1,mlx5_bond_5:1,mlx5_bond_6:1,mlx5_bond_7:1,mlx5_bond_8:1 UCX_LOG_LEVEL=info NCCL_DEBUG=WARN SGLANG_PD_NIXL_DEBUG_TRANSFER_TIME=1 SGL_ENABLE_JIT_DEEPGEMM=0 python3.10 -m sglang.launch_server --host 0.0.0.0 --nnodes 2 --node-rank 0 --dist-init-addr ytw0:44725 --tp 16 --model-path /mnt/gemininjceph2/geminicephfs/mm-base-plt2/opensource_model/DeepSeek-R1_with_draft/DeepSeek-R1 --trust-remote-code --disable-radix-cache --schedule-policy fcfs --mem-fraction-static 0.70 --disable-overlap-schedule --chunked-prefill-size 32768  --log-level debug --enable-metrics --page-size 64 --disaggregation-mode prefill --disaggregation-transfer-backend nixl --disaggregation-bootstrap-port 5813 --max-running-requests 32 --port 40081 )"
2025-04-29 06:19:28,797 - pdutils - INFO - runCommand remotely: ssh -o StrictHostKeyChecking=no  ytw1 "PS1=[] source ~/.bashrc  && ( UCX_TLS=rc,gdr_copy,rc_x,cuda_copy,cuda_ipc UCX_NET_DEVICES=mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1,mlx5_bond_4:1,mlx5_bond_5:1,mlx5_bond_6:1,mlx5_bond_7:1,mlx5_bond_8:1 UCX_LOG_LEVEL=info NCCL_DEBUG=WARN SGLANG_PD_NIXL_DEBUG_TRANSFER_TIME=1 SGL_ENABLE_JIT_DEEPGEMM=0 python3.10 -m sglang.launch_server --nnodes 2 --node-rank 1 --dist-init-addr ytw0:44725 --tp 16 --model-path /mnt/gemininjceph2/geminicephfs/mm-base-plt2/opensource_model/DeepSeek-R1_with_draft/DeepSeek-R1 --trust-remote-code --disable-radix-cache --schedule-policy fcfs --mem-fraction-static 0.70 --disable-overlap-schedule --chunked-prefill-size 32768  --log-level debug --enable-metrics --page-size 64 --disaggregation-mode prefill --disaggregation-transfer-backend nixl --disaggregation-bootstrap-port 5813 --max-running-requests 32 --port 8417 )"
2025-04-29 06:19:28,797 - pdutils - INFO - runCommand remotely: ssh -o StrictHostKeyChecking=no  ytw2 "PS1=[] source ~/.bashrc  && ( UCX_TLS=rc,gdr_copy,rc_x,cuda_copy,cuda_ipc UCX_NET_DEVICES=mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1,mlx5_bond_4:1,mlx5_bond_5:1,mlx5_bond_6:1,mlx5_bond_7:1,mlx5_bond_8:1 UCX_LOG_LEVEL=info NCCL_DEBUG=WARN SGLANG_PD_NIXL_DEBUG_TRANSFER_TIME=1 SGL_ENABLE_JIT_DEEPGEMM=0 python3.10 -m sglang.launch_server --host 0.0.0.0 --nnodes 2 --node-rank 0 --dist-init-addr ytw2:16187 --enable-dp-attention --dp-size 16 --tp 16 --model-path /mnt/gemininjceph2/geminicephfs/mm-base-plt2/opensource_model/DeepSeek-R1_with_draft/DeepSeek-R1 --trust-remote-code --disable-radix-cache --schedule-policy fcfs --mem-fraction-static 0.70 --disable-overlap-schedule --chunked-prefill-size 32768  --log-level debug --enable-metrics --page-size 64 --disaggregation-mode decode --disaggregation-transfer-backend nixl --max-running-requests 32 --port 63339 )"
2025-04-29 06:19:28,797 - pdutils - INFO - runCommand remotely: ssh -o StrictHostKeyChecking=no  ytw3 "PS1=[] source ~/.bashrc  && ( UCX_TLS=rc,gdr_copy,rc_x,cuda_copy,cuda_ipc UCX_NET_DEVICES=mlx5_bond_1:1,mlx5_bond_2:1,mlx5_bond_3:1,mlx5_bond_4:1,mlx5_bond_5:1,mlx5_bond_6:1,mlx5_bond_7:1,mlx5_bond_8:1 UCX_LOG_LEVEL=info NCCL_DEBUG=WARN SGLANG_PD_NIXL_DEBUG_TRANSFER_TIME=1 SGL_ENABLE_JIT_DEEPGEMM=0 python3.10 -m sglang.launch_server --nnodes 2 --node-rank 1 --dist-init-addr ytw2:16187 --enable-dp-attention --dp-size 16 --tp 16 --model-path /mnt/gemininjceph2/geminicephfs/mm-base-plt2/opensource_model/DeepSeek-R1_with_draft/DeepSeek-R1 --trust-remote-code --disable-radix-cache --schedule-policy fcfs --mem-fraction-static 0.70 --disable-overlap-schedule --chunked-prefill-size 32768  --log-level debug --enable-metrics --page-size 64 --disaggregation-mode decode --disaggregation-transfer-backend nixl --max-running-requests 32 --port 23093 )"
2025-04-29 06:19:28,797 - __main__ - INFO - waiting for instance with log path /tmp/sgl-prefill-0-0.log to be ready...
2025-04-29 06:19:28,798 - pdutils - INFO - wait_server: ytw0:40081
2025-04-29 06:20:46,880 - __main__ - INFO - waiting for instance with log path /tmp/sgl-decode-0-0.log to be ready...
2025-04-29 06:20:46,880 - pdutils - INFO - wait_server: ytw2:63339
2025-04-29 06:21:13,911 - __main__ - INFO - All instances are ready! Wait some seconds to let the server warm up.
2025-04-29 06:21:23,921 - pdutils - INFO - runCommand remotely: ssh -o StrictHostKeyChecking=no  ytw0 "PS1=[] source ~/.bashrc  && ( python3.10 -m sglang.srt.disaggregation.mini_lb --prefill http://ytw0:40081 --decode http://ytw2:63339 --host 0.0.0.0 --port 11441 --prefill-bootstrap-ports 5813 )"

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and [Accuracy Results] (https://docs.sglang.ai/references/accuracy_evaluation.html). This PR is irrelevant to performance.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

python/sglang/srt/managers/data_parallel_controller.py

Qinyu-Xu · 2025-05-13T07:08:54Z

Do you have any progress on this pr? @merrymercy @zhyncs @ByronHsu @jokerwyt

…port-dispatch

jokerwyt · 2025-05-13T07:49:04Z

@Qinyu-Xu I have just resolved the conflict with the newest main. #6258 blocks me off testing. Once that issue is resolved I think we can test and merge this PR. Welcome to adopt this PR in your use case and share your experience.

…port-dispatch

jokerwyt · 2025-05-14T06:45:09Z

Tested okay. Ready for review and merge.

python/sglang/srt/managers/data_parallel_controller.py

…port-dispatch

…into dp-port-dispatch

jokerwyt · 2025-05-28T07:06:42Z

Can we merge this? It's a little bit time-consuming...

@fzyzcjy @ch-wan

fzyzcjy · 2025-05-29T02:19:52Z

The general idea LGTM, but I have no time to review the details now :( If you can find someone to review then it can usually be merged.

ch-wan · 2025-05-29T05:18:49Z

@jokerwyt This part was implemented by @merrymercy and @ispobock. I have added them to the review list.

python/sglang/srt/utils.py

…port-dispatch

…into dp-port-dispatch

jokerwyt · 2025-06-07T07:10:29Z

@merrymercy @ispobock @zhyncs
Added a command arg and the test. Wait for an approval for CI/CD and merge.

jokerwyt added 8 commits April 28, 2025 16:32

feat: dynamic DP controller port dispatch

f96d599

Merge remote-tracking branch 'gh/main' into dp-port-dispatch

bfecdbb

Fix completions endpoint bootstrap port passing

60f8a55

[WIP] dynamic DP port

3e5d6ed

Dynamic DP port assignment

68fdf09

Better dynamic port, lower conflict

2865267

small fix

6f9eea5

NIXL DP support (sgl-project#5681)

c96c1b0

jokerwyt requested review from merrymercy, Ying1123, hnyls2002, xiezhq-hermann, zhyncs, ispobock, HaiShaw and ByronHsu as code owners April 29, 2025 13:30

jokerwyt commented Apr 29, 2025

View reviewed changes

python/sglang/srt/managers/data_parallel_controller.py Outdated Show resolved Hide resolved

Remove some debug print

e221b28

Merge branch 'main' of https://github.com/sgl-project/sglang into dp-…

6468136

…port-dispatch

jokerwyt added 2 commits May 14, 2025 06:37

Merge branch 'main' of https://github.com/sgl-project/sglang into dp-…

8636070

…port-dispatch

Merge branch 'main' into dp-port-dispatch

0fa37d3

fzyzcjy reviewed May 16, 2025

View reviewed changes

python/sglang/srt/managers/data_parallel_controller.py Outdated Show resolved Hide resolved

jokerwyt added 4 commits May 16, 2025 03:19

Atomic assignment of dp attention scheduler ports

ad828c1

Merge branch 'main' of https://github.com/sgl-project/sglang into dp-…

ac7662b

…port-dispatch

Merge branch 'dp-port-dispatch' of github.com:jokerwyt/sglang-public …

f5930a0

…into dp-port-dispatch

Refine

dc379a6

Merge branch 'main' into dp-port-dispatch

b21222a

ShangmingCai requested review from ch-wan and removed request for HaiShaw May 29, 2025 02:45

ch-wan assigned ch-wan, ispobock and merrymercy and unassigned ch-wan May 29, 2025

ispobock and others added 2 commits May 31, 2025 23:08

Merge branch 'main' into dp-port-dispatch

0872328

Merge branch 'main' into dp-port-dispatch

ee343ac

ispobock reviewed Jun 1, 2025

View reviewed changes

python/sglang/srt/utils.py Show resolved Hide resolved

jokerwyt added 4 commits June 7, 2025 05:51

Merge branch 'main' of https://github.com/sgl-project/sglang into dp-…

b23a42f

…port-dispatch

Add cmd args and test

50fc888

Merge branch 'dp-port-dispatch' of github.com:jokerwyt/sglang-public …

96ed813

…into dp-port-dispatch

Merge branch 'main' into dp-port-dispatch

2a7f2d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve dp attention port assignment scheme #5889

Improve dp attention port assignment scheme #5889

Uh oh!

jokerwyt commented Apr 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Qinyu-Xu commented May 13, 2025

Uh oh!

jokerwyt commented May 13, 2025

Uh oh!

jokerwyt commented May 14, 2025

Uh oh!

Uh oh!

jokerwyt commented May 28, 2025

Uh oh!

fzyzcjy commented May 29, 2025

Uh oh!

ch-wan commented May 29, 2025

Uh oh!

Uh oh!

jokerwyt commented Jun 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Improve dp attention port assignment scheme #5889

Are you sure you want to change the base?

Improve dp attention port assignment scheme #5889

Uh oh!

Conversation

jokerwyt commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Test

Checklist

Uh oh!

Uh oh!

Qinyu-Xu commented May 13, 2025

Uh oh!

jokerwyt commented May 13, 2025

Uh oh!

jokerwyt commented May 14, 2025

Uh oh!

Uh oh!

jokerwyt commented May 28, 2025

Uh oh!

fzyzcjy commented May 29, 2025

Uh oh!

ch-wan commented May 29, 2025

Uh oh!

Uh oh!

jokerwyt commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jokerwyt commented Apr 29, 2025 •

edited

Loading

jokerwyt commented Jun 7, 2025 •

edited

Loading