Skip to content

Tiny refactor DeepSeek V3/R1 NextN shared experts fusion #5143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

lambert0312
Copy link
Contributor

@lambert0312 lambert0312 commented Apr 8, 2025

Motivation

Ref #4918
Ref #5707
Ref #5793

Modifications

  • Extract the public method compute_shared_experts_fusion_weights and put it in deepseek_v2.py first.
  • Add necessary unit tests.

Acc in A800

python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 128 --num-shots 8 

Accuracy: 0.960
Invalid: 0.000
Latency: 14.804 s
Output throughput: 1451.247 token/s

Benchmark in A800

# qps 16
python3 -m sglang.bench_serving --backend sglang --num-prompts 200 --dataset-name random --max-concurrency 16 --random-input 256 --random-output 256 --seed 42

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                16
Successful requests:                     200
Benchmark duration (s):                  57.65
Total input tokens:                      26096
Total generated tokens:                  26874
Total generated tokens (retokenized):    26763
Request throughput (req/s):              3.47
Input token throughput (tok/s):          452.70
Output token throughput (tok/s):         466.20
Total token throughput (tok/s):          918.90
Concurrency:                             15.77
Accept length:                           2.60
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4546.43
Median E2E Latency (ms):                 4602.09
---------------Time to First Token----------------
Mean TTFT (ms):                          207.83
Median TTFT (ms):                        174.89
P99 TTFT (ms):                           476.63
---------------Inter-Token Latency----------------
Mean ITL (ms):                           32.54
Median ITL (ms):                         19.18
P95 ITL (ms):                            90.16
P99 ITL (ms):                            168.08
Max ITL (ms):                            389.73
==================================================

Checklist

@xihuai18
Copy link
Contributor

xihuai18 commented Apr 8, 2025

will fused shared experts still improve performance with nextn?

@lambert0312
Copy link
Contributor Author

will fused shared experts still improve performance with nextn?

Yes, I'm still experimenting with the current effects

@merrymercy
Copy link
Contributor

Can you add a test case?

@lambert0312
Copy link
Contributor Author

lambert0312 commented Apr 21, 2025

Can you add a test case?

Ok. I will add it

@fzyzcjy
Copy link
Collaborator

fzyzcjy commented Apr 21, 2025

Maybe my PR can be firstly merged to make the commit history a bit more clear

@lambert0312
Copy link
Contributor Author

Maybe my PR can be firstly merged to make the commit history a bit more clear

Yes, I'm waiting for it to be merged @fzyzcjy

@lambert0312 lambert0312 force-pushed the support_nextn_shared_experts_fusion branch from 668c67c to 5769b91 Compare April 21, 2025 09:40
@xihuai18
Copy link
Contributor

xihuai18 commented May 7, 2025

any update in this PR?

@lambert0312
Copy link
Contributor Author

any update in this PR?

No, can merge it in. @xihuai18

@zhyncs
Copy link
Member

zhyncs commented Jun 9, 2025

@BBuf @fzyzcjy

@lambert0312 lambert0312 force-pushed the support_nextn_shared_experts_fusion branch from 64e6df1 to 1ee3b6a Compare June 9, 2025 11:49
@lambert0312 lambert0312 force-pushed the support_nextn_shared_experts_fusion branch from 682653d to 6351425 Compare June 9, 2025 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants