Skip to content

Add pipeline parallelism for Qwen2 and Qwen3 Model #6250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
May 18, 2025

Conversation

libratiger
Copy link
Contributor

@libratiger libratiger commented May 13, 2025

Motivation

Modifications

Checklist

@libratiger libratiger marked this pull request as draft May 13, 2025 00:42
@libratiger libratiger changed the title [Draft] Add pipeline parallelism for Qwen Model Add pipeline parallelism for Qwen2 and Qwen3 Model May 14, 2025
@libratiger libratiger marked this pull request as ready for review May 14, 2025 07:41
@libratiger
Copy link
Contributor Author

libratiger commented May 14, 2025

After change the model to Qwen/Qwen3-8B, here is the result for the pipeline parallelism test cases.

python3 -m unittest test_bench_serving.TestBenchServing.test_pp_offline_throughput_default_decode
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     16        
Benchmark duration (s):                  25.72     
Total input tokens:                      16        
Total generated tokens:                  7130      
Total generated tokens (retokenized):    7120      
Request throughput (req/s):              0.62      
Input token throughput (tok/s):          0.62      
Output token throughput (tok/s):         277.20    
Total token throughput (tok/s):          277.82    
Concurrency:                             9.21      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   14812.54  
Median E2E Latency (ms):                 14950.47  
---------------Time to First Token----------------
Mean TTFT (ms):                          70.29     
Median TTFT (ms):                        71.59     
P99 TTFT (ms):                           74.12     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           33.16     
Median ITL (ms):                         30.75     
P95 ITL (ms):                            43.64     
P99 ITL (ms):                            45.70     
Max ITL (ms):                            794.43    
==================================================

# without quant and random_input_len=40960
python3 -m unittest test_bench_serving.TestBenchServing.test_pp_long_context_prefill
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     4         
Benchmark duration (s):                  3.56      
Total input tokens:                      66101     
Total generated tokens:                  4         
Total generated tokens (retokenized):    4         
Request throughput (req/s):              1.12      
Input token throughput (tok/s):          18578.15  
Output token throughput (tok/s):         1.12      
Total token throughput (tok/s):          18579.27  
Concurrency:                             1.35      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1198.87   
Median E2E Latency (ms):                 522.53    
---------------Time to First Token----------------
Mean TTFT (ms):                          1198.85   
Median TTFT (ms):                        522.52    
P99 TTFT (ms):                           3468.37   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

@libratiger
Copy link
Contributor Author

libratiger commented May 14, 2025

@Ying1123 This PR can be reviewed quickly, thanks!

@libratiger
Copy link
Contributor Author

also test for the Qwen3-30B-A3B model

python3 -m sglang.launch_server --model Qwen/Qwen3-30B-A3B --pp 2

@libratiger
Copy link
Contributor Author

this is a following improve for the pp impl. We want to test the Qwen model for the PP

#5724

@libratiger
Copy link
Contributor Author

ping @zhyncs , @merrymercy if have time for this small PR.

Copy link
Member

@Ying1123 Ying1123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @libratiger, thanks for the PR. Could you also add an accuracy test for PP on these models? Also, could you resolve the conflicts and pass the CI tests?

@libratiger
Copy link
Contributor Author

HI @libratiger, thanks for the PR. Could you also add an accuracy test for PP on these models? Also, could you resolve the conflicts and pass the CI tests?

I fix the conflicts and add a new accuracy test case as suggested。 Here is the result:

#Qwen/Qwen3-8B
python3 -m unittest test_pp_single_node.TestQwenPPAccuracy.test_pp_consistency
[Qwen PP Comparison] Baseline: {'accuracy': np.float64(0.95), 'latency': 11.482683465001173, 'output_throughput': 2063.977472882257} | PP: {'accuracy': np.float64(0.95), 'latency': 13.306963007024024, 'output_throughput': 1774.4845302069284}
#Qwen/Qwen3-30B-A3B
python3 -m unittest test_pp_single_node.TestQwenPPAccuracy.test_pp_consistency
[Qwen PP Comparison] Baseline: {'accuracy': np.float64(0.93), 'latency': 20.492150178004522, 'output_throughput': 1125.6017450407985} | PP: {'accuracy': np.float64(0.925), 'latency': 22.040848318021744, 'output_throughput': 1047.3281094686959}

@libratiger
Copy link
Contributor Author

In the previous CI result, I just noticed some flaky failed test is timeout on DeepSeekV3 model.

@zhaochenyang20
Copy link
Collaborator

@libratiger Great work. let me rerun the CI and review it. Do not need to rebase on your own unless we ask. thank1

@zhyncs zhyncs merged commit 11553c1 into sgl-project:main May 18, 2025
70 of 79 checks passed
Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025
xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants