Skip to content

Support Page Size > 1 (when top k = 1) for FA3 Spec Decode #5168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

hebiao064
Copy link
Collaborator

@hebiao064 hebiao064 commented Apr 8, 2025

Co-authored with @yubofredwang

Motivation

Before this change, the output will be a mess for page size > 1

python /home/jobuser/sglang/python/sglang/test/send_one.py
 Below is thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst,

After the change, the output is correct

python /home/jobuser/sglang/python/sglang/test/send_one.py
 Below is an example of a fully functional FastAPI server. This example includes a simple API with a few endpoints to demonstrate how to create, read, update, and delete (CRUD) items in a list.

[Ignored the rest]

Benchmark

Server Script:

python3 -m sglang.launch_server --model /shared/public/elr-models/meta-llama/Meta-Llama-3.1-8B-Instruct/07eb05b21d191a58c577b4a45982fe0c049d0693 --speculative-algorithm EAGLE3 --speculative-draft-model-path /shared/public/elr-models/jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B/e5ed08d66f528a95ce89f5d4fd136a28f6def714 --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --mem-fraction 0.6 --dtype float16 --trust-remote-code --attention-backend fa3 --page-size 4
Benchmark Script:

python3 /home/jobuser/sglang/benchmark/mtbench/bench_sglang_eagle.py --num-questions 80 --parallel 4
Result:
page_size = 4, parallel = 1
#questions: 80, Throughput: 167.04 token/s, Acceptance length: 2.91

page_size = 1, parallel = 1
#questions: 80, Throughput: 159.48 token/s, Acceptance length: 2.71

page_size=4, parallel = 4
#questions: 80, Throughput: 1061.05 token/s, Acceptance length: 2.88

page_size = 1, parallel = 4
#questions: 80, Throughput: 1039.26 token/s, Acceptance length: 2.71

This benchmark verifies the accepted length is equally good and throughput slightly better when page_size == 4 is used. Similar result applies to page_size == 8.

DeepSeek-V3 + DeepSeek-V3-NextN

Iterations GSM Latency on H200 Send One on H200
Page Size 1 0.960, 550.669 token/s acc_length=2.64, speed=67.50 token/s
Page Size 4 0.960, 567.589 token/s acc_length=2.67, speed=67.21 token/s
Page Size 64 0.960, 549.955 token/s acc_length=2.65, speed=64.43 token/s

Modifications

Checklist

@hebiao064 hebiao064 mentioned this pull request Apr 8, 2025
6 tasks
@hebiao064 hebiao064 marked this pull request as draft April 9, 2025 04:42
@hebiao064
Copy link
Collaborator Author

converted to draft for now, need some double check

@hebiao064 hebiao064 marked this pull request as ready for review April 10, 2025 00:13
@hebiao064
Copy link
Collaborator Author

converted to draft for now, need some double check

ready for review again.

Previously we had some doubt whether this would work but after some reflection, we are more confident about this would work for top k = 1, since all the tokens in the page table are consecutive.

For top k > 1, this need to be updated but that would be next PR.

@hebiao064 hebiao064 requested a review from hnyls2002 as a code owner April 12, 2025 00:11
@hebiao064 hebiao064 changed the title Support Page Size > 1 for FA3 Spec Decode Support Page Size > 1 (when top k = 1) for FA3 Spec Decode Apr 12, 2025
@yubofredwang
Copy link
Contributor

Adding benchmark results for this PR:
Server Script:

python3 -m sglang.launch_server --model /shared/public/elr-models/meta-llama/Meta-Llama-3.1-8B-Instruct/07eb05b21d191a58c577b4a45982fe0c049d0693  --speculative-algorithm EAGLE3     --speculative-draft-model-path /shared/public/elr-models/jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B/e5ed08d66f528a95ce89f5d4fd136a28f6def714 --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4  --mem-fraction 0.6  --dtype float16 --trust-remote-code --attention-backend fa3 --page-size 4

Benchmark Script:

python3 /home/jobuser/sglang/benchmark/mtbench/bench_sglang_eagle.py --num-questions 80 --parallel 4

Result:
page_size = 4, parallel = 1
#questions: 80, Throughput: 167.04 token/s, Acceptance length: 2.91

page_size = 1, parallel = 1
#questions: 80, Throughput: 159.48 token/s, Acceptance length: 2.71

page_size=4, parallel = 4
#questions: 80, Throughput: 1061.05 token/s, Acceptance length: 2.88

page_size = 1, parallel = 4
#questions: 80, Throughput: 1039.26 token/s, Acceptance length: 2.71

This benchmark verifies the accepted length is equally good and throughput slightly better when page_size == 4 is used. Similar result applies to page_size == 8.

@hebiao064
Copy link
Collaborator Author

closing it as the change would be merged by #5318 and #5509

@hebiao064 hebiao064 closed this Apr 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants