Skip to content

feat: add dp attention support for Qwen 2/3 MoE models, fixes #6088 #6121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 16, 2025

Conversation

Fr4nk1inCs
Copy link
Contributor

@Fr4nk1inCs Fr4nk1inCs commented May 8, 2025

This is the prerequisites of EP, which is introduced in PR #5917

Motivation

As described in #6088, DP attention is not supported for Qwen MoE models, but #5917 introduces EP MoE for them. This PR introduces DP attention support for them.

Modifications

Following DP Attention part in deepseek_v2.py, I modified qwen2_moe.py and qwen3_moe.py.

Benchmark Results and Accuracy Results

The following benchmark results contains in-negligible measurement error, see #6121 (comment)

Using 4xA40 (PCIe).

Baseline: TP=4 before this PR (commit f1ff736).

Commands:

# bench_one_batch
python -m sglang.bench_one_batch \
    --model-path Qwen/Qwen3-30B-A3B \
    --mem-fraction-static 0.6 \
    --batch 64 \
    --input-len 256 \
    --output-len 32 \
    --tp-size 4 ...

# bench_offline_throughput
python -m sglang.bench_offline_throughput \
    --model-path Qwen/Qwen3-30B-A3B \
    --mem-fraction-static 0.6 \
    --num-prompts 16 \
    --tp-size 4 ...

# bench_serving server
python -m sglang.launch_server \
    --model-path Qwen/Qwen3-30B-A3B \
    --trust-remote-code \
    --mem-fraction-static 0.6 \
    --max-prefill-tokens 2048 \
    --chunked-prefill-size 2048 \
    --disable-radix-attention \
    --tp-size 4

# bench_serving client
python3 -m sglang.bench_serving \
    --backend sglang \
    --dataset-name random \
    --dataset-path ShareGPT.json \
    --model Qwen3-30B-A3B \
    --host 127.0.0.1 \
    --random-range-ratio 1.0 \
    --num-prompts 512 \
    --random-input-len 512 \
    --random-output-len 256 \
    --max-concurrency 128
  • Benchmark Results(Qwen3-30B-A3B):
    • sglang.bench_one_batch

      Configuration Prefill Throughput Decode Throughput
      TP = 4 (baseline) 17397.76 2628.55
      TP = 4, No DP, No EP 17677.90 2487.72
      TP = DP = 4, No EP 4561.55 600.84
      TP = DP = EP = 4 2114.99 649.72
    • sglang.bench_offline_throughput

      Configuration Duration (s) Total token Throughput
      TP = 4 (baseline) 13.60 833.01
      TP = 4, No DP, No EP 13.85 817.86
      TP = DP = 4, No EP 27.97 405.01
      TP = DP = EP = 4 31.49 359.66
    • sglang.bench_serving

      Configuration Duration (s) Median TTFT (ms) Median ITL (ms)
      TP = 4 (baseline) 64.81 2678.38 45.51
      TP = 4, No DP, No EP 68.40 2979.11 45.00
      TP = DP = 4, No EP 75.51 3188.26 48.65
      TP = DP = EP = 4 93.75 4617.38 55.41.
      Detailed results
      • TP = 4 (baseline)
        ============ Serving Benchmark Result ============
        Backend:                                 sglang
        Traffic request rate:                    inf
        Max reqeuest concurrency:                128
        Successful requests:                     512
        Benchmark duration (s):                  64.81
        Total input tokens:                      262144
        Total generated tokens:                  131072
        Total generated tokens (retokenized):    131071
        Request throughput (req/s):              7.90
        Input token throughput (tok/s):          4044.97
        Output token throughput (tok/s):         2022.48
        Total token throughput (tok/s):          6067.45
        Concurrency:                             127.54
        ----------------End-to-End Latency----------------
        Mean E2E Latency (ms):                   16143.09
        Median E2E Latency (ms):                 16095.80
        ---------------Time to First Token----------------
        Mean TTFT (ms):                          2667.26
        Median TTFT (ms):                        2678.38
        P99 TTFT (ms):                           4914.29
        ---------------Inter-Token Latency----------------
        Mean ITL (ms):                           52.85
        Median ITL (ms):                         45.51
        P95 ITL (ms):                            55.12
        P99 ITL (ms):                            62.50
        Max ITL (ms):                            4268.96
        ==================================================
        
      • TP = 4, No DP, No EP
        ============ Serving Benchmark Result ============
        Backend:                                 sglang
        Traffic request rate:                    inf
        Max reqeuest concurrency:                128
        Successful requests:                     512
        Benchmark duration (s):                  68.40
        Total input tokens:                      262144
        Total generated tokens:                  131072
        Total generated tokens (retokenized):    131071
        Request throughput (req/s):              7.48
        Input token throughput (tok/s):          3832.30
        Output token throughput (tok/s):         1916.15
        Total token throughput (tok/s):          5748.46
        Concurrency:                             127.54
        ----------------End-to-End Latency----------------
        Mean E2E Latency (ms):                   17038.84
        Median E2E Latency (ms):                 16078.40
        ---------------Time to First Token----------------
        Mean TTFT (ms):                          2994.61
        Median TTFT (ms):                        2979.11
        P99 TTFT (ms):                           6363.70
        ---------------Inter-Token Latency----------------
        Mean ITL (ms):                           55.08
        Median ITL (ms):                         45.00
        P95 ITL (ms):                            55.40
        P99 ITL (ms):                            61.72
        Max ITL (ms):                            7124.37
        ==================================================
        
      • TP = DP = 4, No EP
        ============ Serving Benchmark Result ============
        Backend:                                 sglang
        Traffic request rate:                    inf
        Max reqeuest concurrency:                128
        Successful requests:                     512
        Benchmark duration (s):                  75.51
        Total input tokens:                      262144
        Total generated tokens:                  131072
        Total generated tokens (retokenized):    131070
        Request throughput (req/s):              6.78
        Input token throughput (tok/s):          3471.75
        Output token throughput (tok/s):         1735.88
        Total token throughput (tok/s):          5207.63
        Concurrency:                             127.70
        ----------------End-to-End Latency----------------
        Mean E2E Latency (ms):                   18833.12
        Median E2E Latency (ms):                 17459.89
        ---------------Time to First Token----------------
        Mean TTFT (ms):                          3275.09
        Median TTFT (ms):                        3188.26
        P99 TTFT (ms):                           8554.38
        ---------------Inter-Token Latency----------------
        Mean ITL (ms):                           61.01
        Median ITL (ms):                         48.65
        P95 ITL (ms):                            58.50
        P99 ITL (ms):                            64.15
        Max ITL (ms):                            8558.96
        ==================================================
        
      • TP = DP = EP = 4
        ============ Serving Benchmark Result ============
        Backend:                                 sglang
        Traffic request rate:                    inf
        Max reqeuest concurrency:                128
        Successful requests:                     512
        Benchmark duration (s):                  93.75
        Total input tokens:                      262144
        Total generated tokens:                  131072
        Total generated tokens (retokenized):    131071
        Request throughput (req/s):              5.46
        Input token throughput (tok/s):          2796.26
        Output token throughput (tok/s):         1398.13
        Total token throughput (tok/s):          4194.39
        Concurrency:                             127.70
        ----------------End-to-End Latency----------------
        Mean E2E Latency (ms):                   23382.37
        Median E2E Latency (ms):                 22447.24
        ---------------Time to First Token----------------
        Mean TTFT (ms):                          4567.59
        Median TTFT (ms):                        4617.38
        P99 TTFT (ms):                           8424.25
        ---------------Inter-Token Latency----------------
        Mean ITL (ms):                           73.78
        Median ITL (ms):                         55.41
        P95 ITL (ms):                            64.97
        P99 ITL (ms):                            70.56
        Max ITL (ms):                            11509.58
        ==================================================
        
  • Accuracy Results (MMLU):
    • Qwen2-57B-A13B

      Configuration Accuracy
      TP = 4 (baseline) 0.728
      TP = 4, No DP, No EP 0.730
      TP = DP = 4, No EP 0.730
      TP = DP = EP = 4 0.733
    • Qwen3-30B-A3B

      Configuration Accuracy
      TP = 4 (baseline) 0.798
      TP = 4, No DP, No EP 0.797
      TP = DP = 4, No EP 0.797
      TP = DP = EP = 4 0.796

Checklist

@Fr4nk1inCs Fr4nk1inCs force-pushed the fix-dp-for-qwen branch 3 times, most recently from cde897f to 85d6d9e Compare May 8, 2025 12:22
@Fr4nk1inCs Fr4nk1inCs marked this pull request as ready for review May 8, 2025 13:51
@yhyang201
Copy link
Contributor

Does it seem like the performance might have decreased?

@ch-wan ch-wan self-assigned this May 9, 2025
@yizhang2077
Copy link
Collaborator

I think TP baseline and TP = 4, No DP, No EP are the same setting? Why the decode throughput declines ?

@Fr4nk1inCs
Copy link
Contributor Author

I would investigate the performance issue soon.

@Fr4nk1inCs
Copy link
Contributor Author

I think TP baseline and TP = 4, No DP, No EP are the same setting? Why the decode throughput declines ?

This might be due to measurement error. I've re-run bench_one_batch with batch size = 128 (repeated 8 times), input len = 512, output len = 128 for TP = 4, here's the result.

  • Prefill Throughput (tokens/s)
    Before/After this PR Min Median Avg Max
    Before 17680.70 17798.875 17796.655 17879.61
    After 17739.24 17811.515 17811.445 17865.12
  • Decode Throughput (tokens/s)
    Before/After this PR Min Median Avg Max
    Before 3405.03 3504.10 3489.61 3531.65
    After 3392.91 3491.015 3498.1425 3627.14
Full result
Run Prefill (Before) Prefill (After) Decode (Before) Decode (After)
1 17879.61 17865.12 3531.65 3523.08
2 17855.86 17855.13 3405.03 3515.60
3 17746.00 17839.14 3511.97 3392.91
4 17818.82 17819.87 3430.00 3466.43
5 17795.70 17739.24 3496.23 3627.14
6 17794.50 17803.16 3531.02 3431.58
7 17802.05 17784.95 3522.43 3432.50
8 17680.70 17784.95 3488.55 3595.90

@Fr4nk1inCs
Copy link
Contributor Author

Fr4nk1inCs commented May 11, 2025

Re-run the benchmarks with repetitions (repeated 8 times, results represented as avg ± std):

  • sglang.bench_one_batch

    Configuration Prefill Throughput Decode Throughput
    TP = 4 (baseline) 17778.80 ± 36.54 2530.26 ± 60.64
    TP = 4, No DP, No EP 17775.3825 ± 49.82 2555.96 ± 97.14
    TP = DP = 4, No EP 4561.97 ± 3.83 602.78 ± 73.29
    TP = DP = EP = 4 2115.43 ± 1.51 594.87 ± 68.65
  • sglang.bench_offline_throughput (radix cache disabled, one warm up run)

    Configuration Duration (s) Total token Throughput
    TP = 4 (baseline) 9.42 ± 0.05 1202.09 ± 6.78
    TP = 4, No DP, No EP 9.38 ± 0.07 1206.99 ± 8.39
    TP = DP = 4, No EP 11.02 ± 0.28 1028.37 ± 24.32
    TP = DP = EP = 4 15.48 ± 0.24 731.94 ± 10.99
  • sglang.bench_serving (two warm up run, benched only 4 times to save time)

    Configuration Duration (s) Median TTFT (ms) Median ITL (ms)
    TP = 4 (baseline) 64.72 ± 0.36 2563.87 ± 72.88 45.10 ± 0.09
    TP = 4, No DP, No EP 65.00 ± 0.32 2599.49 ± 46.41 45.08 ± 0.10
    TP = DP = 4, No EP 68.02 ± 0.22 2406.02 ± 51.81 48.66 ± 0.08
    TP = DP = EP = 4 88.76 ± 0.08 4243.19 ± 38.45 55.47 ± 0.03

At least this PR didn't introduce performance degradation for TP = 4 😇.

Raw result
tp_4_baseline = {
    "bench_one_batch": {
        "prefill": [17702.29, 17807.12, 17773.61, 17829.26, 17780.26, 17752.33, 17806.22, 17779.30],
        "decode": [2490.59, 2560.48, 2634.20, 2496.47, 2527.55, 2536.02, 2418.03, 2578.73],
    },
    "bench_offline": {
        "duration": [9.43, 9.44, 9.47, 9.45, 9.34, 9.32, 9.45, 9.47],
        "throughput": [1200.62, 1199.46, 1196.42, 1198.27, 1212.37, 1214.78, 1198.14, 1196.68],
    },
    "bench_serving": {
        "duration": [64.52, 65.13, 64.24, 65.01],
        "median-ttft": [2530.11, 2640.31, 2460.83, 2624.23],
        "median-itl": [45.12, 44.95, 45.17, 45.17],
    },
}

tp_4 = {
    "bench_one_batch": {
        "prefill": [17667.91, 17848.48, 17810.90, 17764.10, 17748.00, 17785.70, 17802.21, 17775.76],
        "decode": [2582.03, 2737.21, 2424.80, 2547.00, 2486.23, 2494.87, 2671.64, 2503.90],
    },
    "bench_offline": {
        "duration": [9.43, 9.34, 9.34, 9.48, 9.33, 9.39, 9.29, 9.47],
        "throughput": [1200.95, 1212.12, 1212.71, 1194.27, 1213.77, 1206.06, 1219.36, 1196.71],
    },
    "bench_serving": {
        "duration": [64.83, 64.66, 65.01, 65.51],
        "median-ttft": [2539.91, 2573.61, 2661.52, 2622.90],
        "median-itl": [45.11, 45.21, 44.93, 45.07],
    },
}

tp_4_dp = {
    "bench_one_batch": {
        "prefill": [4567.77, 4564.49, 4562.66, 4566.43, 4560.84, 4559.56, 4557.41, 4556.60],
        "decode": [593.52, 460.47, 749.52, 626.24, 616.48, 594.94, 590.84, 590.19],
    },
    "bench_offline": {
        "duration": [11.07, 11.73, 10.87, 10.96, 10.89, 10.90, 10.86, 10.89],
        "throughput": [1022.80, 966.05, 1041.73, 1033.83, 1040.47, 1038.91, 1042.65, 1040.52],
    },
    "bench_serving": {
        "duration": [68.01, 67.68, 68.10, 68.28],
        "median-ttft": [2437.41, 2386.20, 2332.08, 2468.41],
        "median-itl": [48.74, 48.53, 48.71, 48.67],
    },
}

tp_4_dp_ep = {
    "bench_one_batch": {
        "prefill": [2115.01, 2115.65, 2114.37, 2117.98, 2116.62, 2115.33, 2115.96, 2112.49],
        "decode": [499.52, 514.55, 659.83, 509.63, 656.42, 645.64, 656.87, 616.47],
    },
    "bench_offline": {
        "duration": [15.82, 15.34, 15.40, 15.34, 15.34, 15.95, 15.29, 15.35],
        "throughput": [716.06, 738.42, 735.38, 738.30, 738.62, 710.37, 740.57, 737.83],
    },
    "bench_serving": {
        "duration": [88.80, 88.87, 88.68, 88.70],
        "median-ttft": [4213.00, 4303.33, 4206.58, 4249.86],
        "median-itl": [55.45, 55.48, 55.43, 55.51],
    },
}

@yizhang2077
Copy link
Collaborator

yizhang2077 commented May 14, 2025

I made several changes to maintain dp+tp logic the same as deepseek, and addressed some conflicts with #5657

[] if not hasattr(config, "mlp_only_layers") else config.mlp_only_layers
)
is_sparse = (layer_id not in mlp_only_layers) and (
config.num_experts > 0 and (layer_id + 1) % config.decoder_sparse_step == 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use layer_id + 1 instead of layer_id

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think layer_id is start from 0, so we need layer_id + 1, but actually this code is useless since config.decoder_sparse_step is 1 and mlp_only_layers is empty list so is_sparse is always true

hidden_states, residual = self.post_attention_layernorm(
hidden_states, residual
)
elif hidden_states.shape[0] != 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if tp==1, is there any case when input is empty?

Copy link
Collaborator

@yizhang2077 yizhang2077 May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no case, actually we have a more clear way for this logic in #6316 , here is just to keep the same as current deepseek code

@merrymercy merrymercy merged commit 4bd2952 into sgl-project:main May 16, 2025
55 of 58 checks passed
MiterV1 added a commit to MiterV1/sglang that referenced this pull request May 22, 2025
…els.

As described in sgl-project#6121, DP attention is supported for Qwen 2/3 MoE models.
But the command-line string isn't updated accordingly. So fix it.

Signed-off-by: miter <[email protected]>
Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025
xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants