Add pipeline parallelism for Qwen2 and Qwen3 Model #6250

libratiger · 2025-05-13T00:42:16Z

Motivation

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

libratiger · 2025-05-14T07:44:57Z

After change the model to Qwen/Qwen3-8B, here is the result for the pipeline parallelism test cases.

python3 -m unittest test_bench_serving.TestBenchServing.test_pp_offline_throughput_default_decode

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     16        
Benchmark duration (s):                  25.72     
Total input tokens:                      16        
Total generated tokens:                  7130      
Total generated tokens (retokenized):    7120      
Request throughput (req/s):              0.62      
Input token throughput (tok/s):          0.62      
Output token throughput (tok/s):         277.20    
Total token throughput (tok/s):          277.82    
Concurrency:                             9.21      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   14812.54  
Median E2E Latency (ms):                 14950.47  
---------------Time to First Token----------------
Mean TTFT (ms):                          70.29     
Median TTFT (ms):                        71.59     
P99 TTFT (ms):                           74.12     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           33.16     
Median ITL (ms):                         30.75     
P95 ITL (ms):                            43.64     
P99 ITL (ms):                            45.70     
Max ITL (ms):                            794.43    
==================================================

# without quant and random_input_len=40960
python3 -m unittest test_bench_serving.TestBenchServing.test_pp_long_context_prefill

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     4         
Benchmark duration (s):                  3.56      
Total input tokens:                      66101     
Total generated tokens:                  4         
Total generated tokens (retokenized):    4         
Request throughput (req/s):              1.12      
Input token throughput (tok/s):          18578.15  
Output token throughput (tok/s):         1.12      
Total token throughput (tok/s):          18579.27  
Concurrency:                             1.35      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1198.87   
Median E2E Latency (ms):                 522.53    
---------------Time to First Token----------------
Mean TTFT (ms):                          1198.85   
Median TTFT (ms):                        522.52    
P99 TTFT (ms):                           3468.37   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

libratiger · 2025-05-14T07:49:40Z

@Ying1123 This PR can be reviewed quickly, thanks!

…p-qwen

…qwen

libratiger · 2025-05-15T08:45:12Z

also test for the Qwen3-30B-A3B model

python3 -m sglang.launch_server --model Qwen/Qwen3-30B-A3B --pp 2

libratiger · 2025-05-15T08:46:44Z

this is a following improve for the pp impl. We want to test the Qwen model for the PP

#5724

libratiger · 2025-05-15T08:47:42Z

ping @zhyncs , @merrymercy if have time for this small PR.

Ying1123

HI @libratiger, thanks for the PR. Could you also add an accuracy test for PP on these models? Also, could you resolve the conflicts and pass the CI tests?

…qwen

…p-qwen

…qwen

libratiger · 2025-05-17T09:40:24Z

HI @libratiger, thanks for the PR. Could you also add an accuracy test for PP on these models? Also, could you resolve the conflicts and pass the CI tests?

I fix the conflicts and add a new accuracy test case as suggested。 Here is the result:

#Qwen/Qwen3-8B
python3 -m unittest test_pp_single_node.TestQwenPPAccuracy.test_pp_consistency

[Qwen PP Comparison] Baseline: {'accuracy': np.float64(0.95), 'latency': 11.482683465001173, 'output_throughput': 2063.977472882257} | PP: {'accuracy': np.float64(0.95), 'latency': 13.306963007024024, 'output_throughput': 1774.4845302069284}

#Qwen/Qwen3-30B-A3B
python3 -m unittest test_pp_single_node.TestQwenPPAccuracy.test_pp_consistency

[Qwen PP Comparison] Baseline: {'accuracy': np.float64(0.93), 'latency': 20.492150178004522, 'output_throughput': 1125.6017450407985} | PP: {'accuracy': np.float64(0.925), 'latency': 22.040848318021744, 'output_throughput': 1047.3281094686959}

libratiger · 2025-05-17T09:41:33Z

In the previous CI result, I just noticed some flaky failed test is timeout on DeepSeekV3 model.

zhaochenyang20 · 2025-05-17T15:39:30Z

@libratiger Great work. let me rerun the CI and review it. Do not need to rebase on your own unless we ask. thank1

libratiger added 2 commits May 13, 2025 08:26

Add pipeline parallelism for Qwen model

a9e97dc

Fix the pipeline parallelism for qwen model

85ceb95

libratiger requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners May 13, 2025 00:42

libratiger marked this pull request as draft May 13, 2025 00:42

libratiger added 3 commits May 14, 2025 07:06

fix for the pp in qwen2 model

b5591df

Fix for the qwen3 model

13dcbde

Merge branch 'main' into pp-qwen

f829150

libratiger changed the title ~~[Draft] Add pipeline parallelism for Qwen Model~~ Add pipeline parallelism for Qwen2 and Qwen3 Model May 14, 2025

libratiger marked this pull request as ready for review May 14, 2025 07:41

libratiger added 4 commits May 14, 2025 21:42

Add the PP for the qwen2moe, qwen3moe

4fa2d55

Merge branch 'pp-qwen' of https://github.com/libratiger/sglang into p…

db90ceb

…p-qwen

fix the type check

54e3941

Merge branch 'main' of https://github.com/sgl-project/sglang into pp-…

1192f79

…qwen

Merge branch 'main' into pp-qwen

df8cfc2

Ying1123 reviewed May 17, 2025

View reviewed changes

libratiger added 3 commits May 17, 2025 10:02

Merge branch 'main' of https://github.com/sgl-project/sglang into pp-…

ec447e7

…qwen

Merge branch 'pp-qwen' of https://github.com/libratiger/sglang into p…

55672d1

…p-qwen

Merge branch 'main' of https://github.com/sgl-project/sglang into pp-…

7181341

…qwen

libratiger requested a review from zhaochenyang20 as a code owner May 17, 2025 08:54

Add the pp consistency testcase for the qwen model

d6e7540

Ying1123 approved these changes May 18, 2025

View reviewed changes

zhyncs merged commit 11553c1 into sgl-project:main May 18, 2025
70 of 79 checks passed

Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025

Add pipeline parallelism for Qwen2 and Qwen3 Model (sgl-project#6250)

85b323b

xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025

Add pipeline parallelism for Qwen2 and Qwen3 Model (sgl-project#6250)

290efed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pipeline parallelism for Qwen2 and Qwen3 Model #6250

Add pipeline parallelism for Qwen2 and Qwen3 Model #6250

Uh oh!

libratiger commented May 13, 2025 •

edited

Loading

Uh oh!

libratiger commented May 14, 2025 •

edited

Loading

Uh oh!

libratiger commented May 14, 2025 •

edited

Loading

Uh oh!

libratiger commented May 15, 2025

Uh oh!

libratiger commented May 15, 2025

Uh oh!

libratiger commented May 15, 2025

Uh oh!

Ying1123 left a comment

Uh oh!

libratiger commented May 17, 2025

Uh oh!

libratiger commented May 17, 2025

Uh oh!

zhaochenyang20 commented May 17, 2025

Uh oh!

Uh oh!

Uh oh!

Add pipeline parallelism for Qwen2 and Qwen3 Model #6250

Add pipeline parallelism for Qwen2 and Qwen3 Model #6250

Uh oh!

Conversation

libratiger commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

libratiger commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

libratiger commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

libratiger commented May 15, 2025

Uh oh!

libratiger commented May 15, 2025

Uh oh!

libratiger commented May 15, 2025

Uh oh!

Ying1123 left a comment

Choose a reason for hiding this comment

Uh oh!

libratiger commented May 17, 2025

Uh oh!

libratiger commented May 17, 2025

Uh oh!

zhaochenyang20 commented May 17, 2025

Uh oh!

Uh oh!

Uh oh!

libratiger commented May 13, 2025 •

edited

Loading

libratiger commented May 14, 2025 •

edited

Loading

libratiger commented May 14, 2025 •

edited

Loading