Improve DP attention #4390

merrymercy · 2025-03-13T14:03:41Z

Use a better padding strategy for cuda graph. If TP=8, DP=8, when batch size = 1, the previous implementation will pad it to global batch size 8. The new implementation will allow running global batch size 1, so it is faster at low bs range. It is 1.15x faster then the old implementation for deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct at TP=8 and bs=1.
Support TP != DP. Now you need to explicitly specify --dp and --tp. The constraint is that --dp should be smaller than --tp. You can first set --tp as the number of total GPUs you have, then tune --dp to trade-off between latency and KV cache capacity (or throughput). For example, to achieve better latency for small bs, you can do --tp 8 --dp 2. To allow more KV cache capacity for larger bs, you can do --tp 8 --dp 8. An example command:

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --trust-remote-code --tp 8 --enable-dp-attention --dp 2

99% of the code is done by @dhou-xai .

Co-authored-by: dhou-xai <[email protected]>
Co-authored-by: SangBin Cho <[email protected]>

Co-authored-by: dhou-xai <[email protected]>

xihuai18 · 2025-03-13T17:53:40Z

Can we run 671b models with --dp 2 --tp 8 in 16 x H100 ?

Co-authored-by: dhou-xai <[email protected]> Co-authored-by: SangBin Cho <[email protected]>

merrymercy · 2025-03-13T21:55:59Z

@xihuai18 Yes. You can use --tp 16 --dp 2 for 16 x H100

merrymercy · 2025-03-13T21:56:18Z

Please share the command here and in the docs once you finish the testing

binarycrayon · 2025-03-13T23:36:32Z

We should have a hyperparameter tuning best practice in the documentation

"""
You can first set --tp as the number of total GPUs you have, then tune --dp to trade-off between latency and KV cache capacity (or throughput). For example, to achieve better latency for small bs, you can do --tp 8 --dp 2. To allow more KV cache capacity for larger bs, you can do --tp 8 --dp 8.
"""

Wesley-Jzy · 2025-03-13T23:53:50Z

I also try to run it on tp16 dp2 setting. I found that capture cuda graph will cause segmentation fault. I can run it with --disable-cuda-graph or update nccl. Also, for older sglang version, lower nccl is okay. May I know is it necessary for me to update nccl for this version update?

xihuai18 · 2025-03-14T06:56:36Z

[2025-03-14 14:50:57 DP0 TP4] Scheduler hit an exception: Traceback (most recent call last):
File "/path/to/sglang/python/sglang/srt/managers/scheduler.py", line 1748, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/path/to/sglang/python/sglang/srt/managers/scheduler.py", line 230, in init
self.draft_worker = EAGLEWorker(
File "/path/to/sglang/python/sglang/srt/speculative/eagle_worker.py", line 102, in init
self.init_cuda_graphs()
File "/path/to/sglang/python/sglang/srt/speculative/eagle_worker.py", line 153, in init_cuda_graphs
self.cuda_graph_runner = EAGLEDraftCudaGraphRunner(self)
File "/path/to/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 78, in init
self.capture()
File "/path/to/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 99, in capture
CudaGraphRunner.capture(self)
File "/path/to/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 336, in capture
) = self.capture_one_batch_size(bs, forward)
File "/path/to/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 162, in capture_one_batch_size
run_once()
File "/path/to/sglang/python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 152, in run_once
ret = self.eagle_worker.draft_forward(forward_batch)
File "/path/to/sglang/python/sglang/srt/speculative/eagle_worker.py", line 325, in draft_forward
logits_output = self.model_runner.model.forward(
File "/usr/local/conda/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/path/to/sglang/python/sglang/srt/models/deepseek_nextn.py", line 154, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/path/to/sglang/python/sglang/srt/models/deepseek_nextn.py", line 105, in forward
hidden_states, residual = self.decoder(
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/path/to/sglang/python/sglang/srt/models/deepseek_v2.py", line 951, in forward
forward_batch.gathered_buffer[: forward_batch.input_ids.shape[0]],
TypeError: 'NoneType' object is not subscriptable

not compatible with MTP, will it be supported in the future?

xihuai18 · 2025-03-14T06:59:58Z

Please share the command here and in the docs once you finish the testing

--tp 16 --enable-dp-attention --dp 2 for running in 16 x H100 (fp8) or 16 x A100 (int8)

following options are tested but failed:

--enable-torch-compile: always OOM
--speculative-algo EAGLE --speculative-draft $NEXTN_PATH (MTP): not compatible

Wesley-Jzy · 2025-03-14T17:14:45Z

I also met OOM

jokerwyt · 2025-04-23T06:29:26Z

Do we still need self.chunked_prefill_size = self.chunked_prefill_size // self.dp_size now? @merrymercy

merrymercy added 2 commits March 13, 2025 06:58

Improve dp attention

87a5564

Co-authored-by: dhou-xai <[email protected]>

Fix padding

ebcbbc7

merrymercy requested review from Ying1123, zhyncs, hnyls2002, ispobock, ByronHsu and HaiShaw as code owners March 13, 2025 14:03

Fix

00391f5

merrymercy merged commit 8e66fbe into main Mar 13, 2025
36 of 39 checks passed

merrymercy deleted the pr-dp branch March 13, 2025 15:23

hebiao064 pushed a commit to hebiao064/sglang that referenced this pull request Mar 13, 2025

Improve DP attention (sgl-project#4390)

de89b18

Co-authored-by: dhou-xai <[email protected]> Co-authored-by: SangBin Cho <[email protected]>

merrymercy mentioned this pull request Mar 17, 2025

Development Roadmap (2025 H1) #4042

Open

67 tasks

luzengxiangcn mentioned this pull request Mar 19, 2025

[Feature] attention dp + attention tp for deepseek v3 #3750

Closed

2 tasks

Cydia2018 mentioned this pull request Apr 18, 2025

Cuda graph supported bs in DP attention #5527

Open

jokerwyt mentioned this pull request Apr 23, 2025

Remove chunked prefill length divide for DP #5663

Closed

6 tasks

ch-wan mentioned this pull request May 7, 2025

[Performance] Use the max num_tokens per DP rank as the CUDA graph batch size #6092

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve DP attention #4390

Improve DP attention #4390

Uh oh!

merrymercy commented Mar 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

xihuai18 commented Mar 13, 2025

Uh oh!

merrymercy commented Mar 13, 2025

Uh oh!

merrymercy commented Mar 13, 2025

Uh oh!

binarycrayon commented Mar 13, 2025

Uh oh!

Wesley-Jzy commented Mar 13, 2025 •

edited

Loading

Uh oh!

xihuai18 commented Mar 14, 2025

Uh oh!

xihuai18 commented Mar 14, 2025

Uh oh!

Wesley-Jzy commented Mar 14, 2025

Uh oh!

jokerwyt commented Apr 23, 2025

Uh oh!

Uh oh!

Improve DP attention #4390

Improve DP attention #4390

Uh oh!

Conversation

merrymercy commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

xihuai18 commented Mar 13, 2025

Uh oh!

merrymercy commented Mar 13, 2025

Uh oh!

merrymercy commented Mar 13, 2025

Uh oh!

binarycrayon commented Mar 13, 2025

Uh oh!

Wesley-Jzy commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xihuai18 commented Mar 14, 2025

Uh oh!

xihuai18 commented Mar 14, 2025

Uh oh!

Wesley-Jzy commented Mar 14, 2025

Uh oh!

jokerwyt commented Apr 23, 2025

Uh oh!

Uh oh!

merrymercy commented Mar 13, 2025 •

edited

Loading

Wesley-Jzy commented Mar 13, 2025 •

edited

Loading