feat: add dp attention support for Qwen 2/3 MoE models, fixes #6088 #6121

Fr4nk1inCs · 2025-05-08T10:19:04Z

This is the prerequisites of EP, which is introduced in PR #5917

Motivation

As described in #6088, DP attention is not supported for Qwen MoE models, but #5917 introduces EP MoE for them. This PR introduces DP attention support for them.

Modifications

Following DP Attention part in deepseek_v2.py, I modified qwen2_moe.py and qwen3_moe.py.

Benchmark Results and Accuracy Results

The following benchmark results contains in-negligible measurement error, see #6121 (comment)

Using 4xA40 (PCIe).

Baseline: TP=4 before this PR (commit f1ff736).

Commands:

# bench_one_batch
python -m sglang.bench_one_batch \
    --model-path Qwen/Qwen3-30B-A3B \
    --mem-fraction-static 0.6 \
    --batch 64 \
    --input-len 256 \
    --output-len 32 \
    --tp-size 4 ...

# bench_offline_throughput
python -m sglang.bench_offline_throughput \
    --model-path Qwen/Qwen3-30B-A3B \
    --mem-fraction-static 0.6 \
    --num-prompts 16 \
    --tp-size 4 ...

# bench_serving server
python -m sglang.launch_server \
    --model-path Qwen/Qwen3-30B-A3B \
    --trust-remote-code \
    --mem-fraction-static 0.6 \
    --max-prefill-tokens 2048 \
    --chunked-prefill-size 2048 \
    --disable-radix-attention \
    --tp-size 4

# bench_serving client
python3 -m sglang.bench_serving \
    --backend sglang \
    --dataset-name random \
    --dataset-path ShareGPT.json \
    --model Qwen3-30B-A3B \
    --host 127.0.0.1 \
    --random-range-ratio 1.0 \
    --num-prompts 512 \
    --random-input-len 512 \
    --random-output-len 256 \
    --max-concurrency 128

Benchmark Results(Qwen3-30B-A3B):

sglang.bench_one_batch

Configuration	Prefill Throughput	Decode Throughput
TP = 4 (baseline)	17397.76	2628.55
TP = 4, No DP, No EP	17677.90	2487.72
TP = DP = 4, No EP	4561.55	600.84
TP = DP = EP = 4	2114.99	649.72

sglang.bench_offline_throughput

Configuration Duration (s) Total token Throughput

TP = 4 (baseline) 13.60 833.01

TP = 4, No DP, No EP 13.85 817.86

TP = DP = 4, No EP 27.97 405.01

TP = DP = EP = 4 31.49 359.66

sglang.bench_serving

Configuration	Duration (s)	Median TTFT (ms)	Median ITL (ms)
TP = 4 (baseline)	64.81	2678.38	45.51
TP = 4, No DP, No EP	68.40	2979.11	45.00
TP = DP = 4, No EP	75.51	3188.26	48.65
TP = DP = EP = 4	93.75	4617.38	55.41.

Detailed results

TP = 4 (baseline)

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                128
Successful requests:                     512
Benchmark duration (s):                  64.81
Total input tokens:                      262144
Total generated tokens:                  131072
Total generated tokens (retokenized):    131071
Request throughput (req/s):              7.90
Input token throughput (tok/s):          4044.97
Output token throughput (tok/s):         2022.48
Total token throughput (tok/s):          6067.45
Concurrency:                             127.54
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   16143.09
Median E2E Latency (ms):                 16095.80
---------------Time to First Token----------------
Mean TTFT (ms):                          2667.26
Median TTFT (ms):                        2678.38
P99 TTFT (ms):                           4914.29
---------------Inter-Token Latency----------------
Mean ITL (ms):                           52.85
Median ITL (ms):                         45.51
P95 ITL (ms):                            55.12
P99 ITL (ms):                            62.50
Max ITL (ms):                            4268.96
==================================================

TP = 4, No DP, No EP

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                128
Successful requests:                     512
Benchmark duration (s):                  68.40
Total input tokens:                      262144
Total generated tokens:                  131072
Total generated tokens (retokenized):    131071
Request throughput (req/s):              7.48
Input token throughput (tok/s):          3832.30
Output token throughput (tok/s):         1916.15
Total token throughput (tok/s):          5748.46
Concurrency:                             127.54
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   17038.84
Median E2E Latency (ms):                 16078.40
---------------Time to First Token----------------
Mean TTFT (ms):                          2994.61
Median TTFT (ms):                        2979.11
P99 TTFT (ms):                           6363.70
---------------Inter-Token Latency----------------
Mean ITL (ms):                           55.08
Median ITL (ms):                         45.00
P95 ITL (ms):                            55.40
P99 ITL (ms):                            61.72
Max ITL (ms):                            7124.37
==================================================

TP = DP = 4, No EP

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                128
Successful requests:                     512
Benchmark duration (s):                  75.51
Total input tokens:                      262144
Total generated tokens:                  131072
Total generated tokens (retokenized):    131070
Request throughput (req/s):              6.78
Input token throughput (tok/s):          3471.75
Output token throughput (tok/s):         1735.88
Total token throughput (tok/s):          5207.63
Concurrency:                             127.70
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18833.12
Median E2E Latency (ms):                 17459.89
---------------Time to First Token----------------
Mean TTFT (ms):                          3275.09
Median TTFT (ms):                        3188.26
P99 TTFT (ms):                           8554.38
---------------Inter-Token Latency----------------
Mean ITL (ms):                           61.01
Median ITL (ms):                         48.65
P95 ITL (ms):                            58.50
P99 ITL (ms):                            64.15
Max ITL (ms):                            8558.96
==================================================

TP = DP = EP = 4

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                128
Successful requests:                     512
Benchmark duration (s):                  93.75
Total input tokens:                      262144
Total generated tokens:                  131072
Total generated tokens (retokenized):    131071
Request throughput (req/s):              5.46
Input token throughput (tok/s):          2796.26
Output token throughput (tok/s):         1398.13
Total token throughput (tok/s):          4194.39
Concurrency:                             127.70
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   23382.37
Median E2E Latency (ms):                 22447.24
---------------Time to First Token----------------
Mean TTFT (ms):                          4567.59
Median TTFT (ms):                        4617.38
P99 TTFT (ms):                           8424.25
---------------Inter-Token Latency----------------
Mean ITL (ms):                           73.78
Median ITL (ms):                         55.41
P95 ITL (ms):                            64.97
P99 ITL (ms):                            70.56
Max ITL (ms):                            11509.58
==================================================

Accuracy Results (MMLU):
- Qwen2-57B-A13B
  
  Configuration Accuracy
  
  TP = 4 (baseline) 0.728
  
  TP = 4, No DP, No EP 0.730
  
  TP = DP = 4, No EP 0.730
  
  TP = DP = EP = 4 0.733
- Qwen3-30B-A3B
  
  Configuration Accuracy
  
  TP = 4 (baseline) 0.798
  
  TP = 4, No DP, No EP 0.797
  
  TP = DP = 4, No EP 0.797
  
  TP = DP = EP = 4 0.796

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

yhyang201 · 2025-05-09T01:23:48Z

Does it seem like the performance might have decreased?

yizhang2077 · 2025-05-11T07:54:48Z

I think TP baseline and TP = 4, No DP, No EP are the same setting? Why the decode throughput declines ?

Fr4nk1inCs · 2025-05-11T09:01:59Z

I would investigate the performance issue soon.

Fr4nk1inCs · 2025-05-11T13:20:09Z

I think TP baseline and TP = 4, No DP, No EP are the same setting? Why the decode throughput declines ?

This might be due to measurement error. I've re-run bench_one_batch with batch size = 128 (repeated 8 times), input len = 512, output len = 128 for TP = 4, here's the result.

Prefill Throughput (tokens/s)

Before/After this PR Min Median Avg Max

Before 17680.70 17798.875 17796.655 17879.61

After 17739.24 17811.515 17811.445 17865.12
Decode Throughput (tokens/s)

Before/After this PR Min Median Avg Max

Before 3405.03 3504.10 3489.61 3531.65

After 3392.91 3491.015 3498.1425 3627.14

Full result

Run	Prefill (Before)	Prefill (After)	Decode (Before)	Decode (After)
1	17879.61	17865.12	3531.65	3523.08
2	17855.86	17855.13	3405.03	3515.60
3	17746.00	17839.14	3511.97	3392.91
4	17818.82	17819.87	3430.00	3466.43
5	17795.70	17739.24	3496.23	3627.14
6	17794.50	17803.16	3531.02	3431.58
7	17802.05	17784.95	3522.43	3432.50
8	17680.70	17784.95	3488.55	3595.90

Fr4nk1inCs · 2025-05-11T15:24:59Z

Re-run the benchmarks with repetitions (repeated 8 times, results represented as avg ± std):

sglang.bench_one_batch

Configuration	Prefill Throughput	Decode Throughput
TP = 4 (baseline)	17778.80 ± 36.54	2530.26 ± 60.64
TP = 4, No DP, No EP	17775.3825 ± 49.82	2555.96 ± 97.14
TP = DP = 4, No EP	4561.97 ± 3.83	602.78 ± 73.29
TP = DP = EP = 4	2115.43 ± 1.51	594.87 ± 68.65

sglang.bench_offline_throughput (radix cache disabled, one warm up run)

Configuration	Duration (s)	Total token Throughput
TP = 4 (baseline)	9.42 ± 0.05	1202.09 ± 6.78
TP = 4, No DP, No EP	9.38 ± 0.07	1206.99 ± 8.39
TP = DP = 4, No EP	11.02 ± 0.28	1028.37 ± 24.32
TP = DP = EP = 4	15.48 ± 0.24	731.94 ± 10.99

sglang.bench_serving (two warm up run, benched only 4 times to save time)

Configuration	Duration (s)	Median TTFT (ms)	Median ITL (ms)
TP = 4 (baseline)	64.72 ± 0.36	2563.87 ± 72.88	45.10 ± 0.09
TP = 4, No DP, No EP	65.00 ± 0.32	2599.49 ± 46.41	45.08 ± 0.10
TP = DP = 4, No EP	68.02 ± 0.22	2406.02 ± 51.81	48.66 ± 0.08
TP = DP = EP = 4	88.76 ± 0.08	4243.19 ± 38.45	55.47 ± 0.03

At least this PR didn't introduce performance degradation for TP = 4 😇.

Raw result

tp_4_baseline = {
    "bench_one_batch": {
        "prefill": [17702.29, 17807.12, 17773.61, 17829.26, 17780.26, 17752.33, 17806.22, 17779.30],
        "decode": [2490.59, 2560.48, 2634.20, 2496.47, 2527.55, 2536.02, 2418.03, 2578.73],
    },
    "bench_offline": {
        "duration": [9.43, 9.44, 9.47, 9.45, 9.34, 9.32, 9.45, 9.47],
        "throughput": [1200.62, 1199.46, 1196.42, 1198.27, 1212.37, 1214.78, 1198.14, 1196.68],
    },
    "bench_serving": {
        "duration": [64.52, 65.13, 64.24, 65.01],
        "median-ttft": [2530.11, 2640.31, 2460.83, 2624.23],
        "median-itl": [45.12, 44.95, 45.17, 45.17],
    },
}

tp_4 = {
    "bench_one_batch": {
        "prefill": [17667.91, 17848.48, 17810.90, 17764.10, 17748.00, 17785.70, 17802.21, 17775.76],
        "decode": [2582.03, 2737.21, 2424.80, 2547.00, 2486.23, 2494.87, 2671.64, 2503.90],
    },
    "bench_offline": {
        "duration": [9.43, 9.34, 9.34, 9.48, 9.33, 9.39, 9.29, 9.47],
        "throughput": [1200.95, 1212.12, 1212.71, 1194.27, 1213.77, 1206.06, 1219.36, 1196.71],
    },
    "bench_serving": {
        "duration": [64.83, 64.66, 65.01, 65.51],
        "median-ttft": [2539.91, 2573.61, 2661.52, 2622.90],
        "median-itl": [45.11, 45.21, 44.93, 45.07],
    },
}

tp_4_dp = {
    "bench_one_batch": {
        "prefill": [4567.77, 4564.49, 4562.66, 4566.43, 4560.84, 4559.56, 4557.41, 4556.60],
        "decode": [593.52, 460.47, 749.52, 626.24, 616.48, 594.94, 590.84, 590.19],
    },
    "bench_offline": {
        "duration": [11.07, 11.73, 10.87, 10.96, 10.89, 10.90, 10.86, 10.89],
        "throughput": [1022.80, 966.05, 1041.73, 1033.83, 1040.47, 1038.91, 1042.65, 1040.52],
    },
    "bench_serving": {
        "duration": [68.01, 67.68, 68.10, 68.28],
        "median-ttft": [2437.41, 2386.20, 2332.08, 2468.41],
        "median-itl": [48.74, 48.53, 48.71, 48.67],
    },
}

tp_4_dp_ep = {
    "bench_one_batch": {
        "prefill": [2115.01, 2115.65, 2114.37, 2117.98, 2116.62, 2115.33, 2115.96, 2112.49],
        "decode": [499.52, 514.55, 659.83, 509.63, 656.42, 645.64, 656.87, 616.47],
    },
    "bench_offline": {
        "duration": [15.82, 15.34, 15.40, 15.34, 15.34, 15.95, 15.29, 15.35],
        "throughput": [716.06, 738.42, 735.38, 738.30, 738.62, 710.37, 740.57, 737.83],
    },
    "bench_serving": {
        "duration": [88.80, 88.87, 88.68, 88.70],
        "median-ttft": [4213.00, 4303.33, 4206.58, 4249.86],
        "median-itl": [55.45, 55.48, 55.43, 55.51],
    },
}

…ject#6088 This is the prerequisites of EP

python/sglang/srt/models/qwen3_moe.py

yizhang2077 · 2025-05-14T18:13:14Z

I made several changes to maintain dp+tp logic the same as deepseek, and addressed some conflicts with #5657

python/sglang/srt/layers/dp_attention.py

xutizhou · 2025-05-16T14:17:11Z

python/sglang/srt/models/qwen3_moe.py

+            [] if not hasattr(config, "mlp_only_layers") else config.mlp_only_layers
+        )
+        is_sparse = (layer_id not in mlp_only_layers) and (
+            config.num_experts > 0 and (layer_id + 1) % config.decoder_sparse_step == 0


why use layer_id + 1 instead of layer_id

I think layer_id is start from 0, so we need layer_id + 1, but actually this code is useless since config.decoder_sparse_step is 1 and mlp_only_layers is empty list so is_sparse is always true

xutizhou · 2025-05-16T14:22:18Z

python/sglang/srt/models/qwen3_moe.py

+                    hidden_states, residual = self.post_attention_layernorm(
+                        hidden_states, residual
+                    )
+        elif hidden_states.shape[0] != 0:


if tp==1, is there any case when input is empty?

I think there is no case, actually we have a more clear way for this logic in #6316 , here is just to keep the same as current deepseek code

…els. As described in sgl-project#6121, DP attention is supported for Qwen 2/3 MoE models. But the command-line string isn't updated accordingly. So fix it. Signed-off-by: miter <[email protected]>

…ject#6088 (sgl-project#6121) Co-authored-by: King.Zevin <[email protected]> Co-authored-by: Yi Zhang <[email protected]>

Fr4nk1inCs force-pushed the fix-dp-for-qwen branch 3 times, most recently from cde897f to 85d6d9e Compare May 8, 2025 12:22

Fr4nk1inCs marked this pull request as ready for review May 8, 2025 13:51

Fr4nk1inCs requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners May 8, 2025 13:51

ch-wan self-assigned this May 9, 2025

Fr4nk1inCs force-pushed the fix-dp-for-qwen branch from d5e26cc to 39365a2 Compare May 11, 2025 13:30

feat: add dp attention support for Qwen 2/3 MoE models, fixes sgl-pro…

27fe522

…ject#6088 This is the prerequisites of EP

Fr4nk1inCs force-pushed the fix-dp-for-qwen branch from 39365a2 to 27fe522 Compare May 11, 2025 15:27

fix: server hangs when attn_tp_size != 1

bbc0bf2

xutizhou reviewed May 14, 2025

View reviewed changes

python/sglang/srt/models/qwen3_moe.py Show resolved Hide resolved

xutizhou reviewed May 14, 2025

View reviewed changes

python/sglang/srt/models/qwen3_moe.py Show resolved Hide resolved

xutizhou reviewed May 14, 2025

View reviewed changes

python/sglang/srt/models/qwen3_moe.py Show resolved Hide resolved

xutizhou reviewed May 14, 2025

View reviewed changes

python/sglang/srt/models/qwen3_moe.py Outdated Show resolved Hide resolved

yizhang2077 mentioned this pull request May 14, 2025

Support qwen3 deepep #6120

Merged

Merge branch 'main' into fix-dp-for-qwen

2e21c92

yizhang2077 requested review from HaiShaw and ch-wan as code owners May 14, 2025 18:06

yizhang2077 reviewed May 14, 2025

View reviewed changes

python/sglang/srt/layers/dp_attention.py Show resolved Hide resolved

fix some bugs, keep the same format as deepseek for refractor

1674d85

yizhang2077 force-pushed the fix-dp-for-qwen branch from 588b5e8 to 1674d85 Compare May 14, 2025 18:48

Merge branch 'main' into fix-dp-for-qwen

8be710d

yizhang2077 approved these changes May 15, 2025

View reviewed changes

yizhang2077 added 2 commits May 15, 2025 11:32

Merge branch 'main' into fix-dp-for-qwen

9876247

Merge branch 'main' into fix-dp-for-qwen

c0c76db

xutizhou reviewed May 16, 2025

View reviewed changes

merrymercy merged commit 4bd2952 into sgl-project:main May 16, 2025
55 of 58 checks passed

MiterV1 mentioned this pull request May 22, 2025

Update cmdline --enable-dp-attention help string for Qwen 2/3 Moe models. #6524

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add dp attention support for Qwen 2/3 MoE models, fixes #6088 #6121

feat: add dp attention support for Qwen 2/3 MoE models, fixes #6088 #6121

Fr4nk1inCs commented May 8, 2025 •

edited

Loading

Uh oh!

yhyang201 commented May 9, 2025

Uh oh!

yizhang2077 commented May 11, 2025

Uh oh!

Fr4nk1inCs commented May 11, 2025

Uh oh!

Fr4nk1inCs commented May 11, 2025

Uh oh!

Fr4nk1inCs commented May 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yizhang2077 commented May 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

xutizhou May 16, 2025

Uh oh!

yizhang2077 May 16, 2025

Uh oh!

xutizhou May 16, 2025

Uh oh!

yizhang2077 May 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Configuration	Duration (s)	Total token Throughput
TP = 4 (baseline)	13.60	833.01
TP = 4, No DP, No EP	13.85	817.86
TP = DP = 4, No EP	27.97	405.01
TP = DP = EP = 4	31.49	359.66

Configuration	Accuracy
TP = 4 (baseline)	0.728
TP = 4, No DP, No EP	0.730
TP = DP = 4, No EP	0.730
TP = DP = EP = 4	0.733

Configuration	Accuracy
TP = 4 (baseline)	0.798
TP = 4, No DP, No EP	0.797
TP = DP = 4, No EP	0.797
TP = DP = EP = 4	0.796

feat: add dp attention support for Qwen 2/3 MoE models, fixes #6088 #6121

feat: add dp attention support for Qwen 2/3 MoE models, fixes #6088 #6121

Conversation

Fr4nk1inCs commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmark Results and Accuracy Results

Checklist

Uh oh!

yhyang201 commented May 9, 2025

Uh oh!

yizhang2077 commented May 11, 2025

Uh oh!

Fr4nk1inCs commented May 11, 2025

Uh oh!

Fr4nk1inCs commented May 11, 2025

Uh oh!

Fr4nk1inCs commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yizhang2077 commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

xutizhou May 16, 2025

Choose a reason for hiding this comment

Uh oh!

yizhang2077 May 16, 2025

Choose a reason for hiding this comment

Uh oh!

xutizhou May 16, 2025

Choose a reason for hiding this comment

Uh oh!

yizhang2077 May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Fr4nk1inCs commented May 8, 2025 •

edited

Loading

Fr4nk1inCs commented May 11, 2025 •

edited

Loading

yizhang2077 commented May 14, 2025 •

edited

Loading

yizhang2077 May 16, 2025 •

edited

Loading