-
Notifications
You must be signed in to change notification settings - Fork 2.3k
feat: add dp attention support for Qwen 2/3 MoE models, fixes #6088 #6121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cde897f
to
85d6d9e
Compare
Does it seem like the performance might have decreased? |
I think |
I would investigate the performance issue soon. |
This might be due to measurement error. I've re-run
Full result
|
d5e26cc
to
39365a2
Compare
Re-run the benchmarks with repetitions (repeated 8 times, results represented as
At least this PR didn't introduce performance degradation for TP = 4 😇. Raw resulttp_4_baseline = {
"bench_one_batch": {
"prefill": [17702.29, 17807.12, 17773.61, 17829.26, 17780.26, 17752.33, 17806.22, 17779.30],
"decode": [2490.59, 2560.48, 2634.20, 2496.47, 2527.55, 2536.02, 2418.03, 2578.73],
},
"bench_offline": {
"duration": [9.43, 9.44, 9.47, 9.45, 9.34, 9.32, 9.45, 9.47],
"throughput": [1200.62, 1199.46, 1196.42, 1198.27, 1212.37, 1214.78, 1198.14, 1196.68],
},
"bench_serving": {
"duration": [64.52, 65.13, 64.24, 65.01],
"median-ttft": [2530.11, 2640.31, 2460.83, 2624.23],
"median-itl": [45.12, 44.95, 45.17, 45.17],
},
}
tp_4 = {
"bench_one_batch": {
"prefill": [17667.91, 17848.48, 17810.90, 17764.10, 17748.00, 17785.70, 17802.21, 17775.76],
"decode": [2582.03, 2737.21, 2424.80, 2547.00, 2486.23, 2494.87, 2671.64, 2503.90],
},
"bench_offline": {
"duration": [9.43, 9.34, 9.34, 9.48, 9.33, 9.39, 9.29, 9.47],
"throughput": [1200.95, 1212.12, 1212.71, 1194.27, 1213.77, 1206.06, 1219.36, 1196.71],
},
"bench_serving": {
"duration": [64.83, 64.66, 65.01, 65.51],
"median-ttft": [2539.91, 2573.61, 2661.52, 2622.90],
"median-itl": [45.11, 45.21, 44.93, 45.07],
},
}
tp_4_dp = {
"bench_one_batch": {
"prefill": [4567.77, 4564.49, 4562.66, 4566.43, 4560.84, 4559.56, 4557.41, 4556.60],
"decode": [593.52, 460.47, 749.52, 626.24, 616.48, 594.94, 590.84, 590.19],
},
"bench_offline": {
"duration": [11.07, 11.73, 10.87, 10.96, 10.89, 10.90, 10.86, 10.89],
"throughput": [1022.80, 966.05, 1041.73, 1033.83, 1040.47, 1038.91, 1042.65, 1040.52],
},
"bench_serving": {
"duration": [68.01, 67.68, 68.10, 68.28],
"median-ttft": [2437.41, 2386.20, 2332.08, 2468.41],
"median-itl": [48.74, 48.53, 48.71, 48.67],
},
}
tp_4_dp_ep = {
"bench_one_batch": {
"prefill": [2115.01, 2115.65, 2114.37, 2117.98, 2116.62, 2115.33, 2115.96, 2112.49],
"decode": [499.52, 514.55, 659.83, 509.63, 656.42, 645.64, 656.87, 616.47],
},
"bench_offline": {
"duration": [15.82, 15.34, 15.40, 15.34, 15.34, 15.95, 15.29, 15.35],
"throughput": [716.06, 738.42, 735.38, 738.30, 738.62, 710.37, 740.57, 737.83],
},
"bench_serving": {
"duration": [88.80, 88.87, 88.68, 88.70],
"median-ttft": [4213.00, 4303.33, 4206.58, 4249.86],
"median-itl": [55.45, 55.48, 55.43, 55.51],
},
} |
…ject#6088 This is the prerequisites of EP
39365a2
to
27fe522
Compare
I made several changes to maintain dp+tp logic the same as deepseek, and addressed some conflicts with #5657 |
588b5e8
to
1674d85
Compare
[] if not hasattr(config, "mlp_only_layers") else config.mlp_only_layers | ||
) | ||
is_sparse = (layer_id not in mlp_only_layers) and ( | ||
config.num_experts > 0 and (layer_id + 1) % config.decoder_sparse_step == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use layer_id + 1 instead of layer_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think layer_id is start from 0, so we need layer_id + 1, but actually this code is useless since config.decoder_sparse_step is 1 and mlp_only_layers is empty list so is_sparse is always true
hidden_states, residual = self.post_attention_layernorm( | ||
hidden_states, residual | ||
) | ||
elif hidden_states.shape[0] != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if tp==1, is there any case when input is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is no case, actually we have a more clear way for this logic in #6316 , here is just to keep the same as current deepseek code
…els. As described in sgl-project#6121, DP attention is supported for Qwen 2/3 MoE models. But the command-line string isn't updated accordingly. So fix it. Signed-off-by: miter <[email protected]>
…ject#6088 (sgl-project#6121) Co-authored-by: King.Zevin <[email protected]> Co-authored-by: Yi Zhang <[email protected]>
…ject#6088 (sgl-project#6121) Co-authored-by: King.Zevin <[email protected]> Co-authored-by: Yi Zhang <[email protected]>
This is the prerequisites of EP, which is introduced in PR #5917
Motivation
As described in #6088, DP attention is not supported for Qwen MoE models, but #5917 introduces EP MoE for them. This PR introduces DP attention support for them.
Modifications
Following DP Attention part in
deepseek_v2.py
, I modifiedqwen2_moe.py
andqwen3_moe.py
.Benchmark Results and Accuracy Results
Using 4xA40 (PCIe).
Baseline: TP=4 before this PR (commit f1ff736).
Commands:
sglang.bench_one_batch
sglang.bench_offline_throughput
sglang.bench_serving
Detailed results
Qwen2-57B-A13B
Qwen3-30B-A3B
Checklist