-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Support Page Size > 1 (when top k = 1) for FA3 Spec Decode #5168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Page Size > 1 (when top k = 1) for FA3 Spec Decode #5168
Conversation
converted to draft for now, need some double check |
ready for review again. Previously we had some doubt whether this would work but after some reflection, we are more confident about this would work for top k = 1, since all the tokens in the page table are consecutive. For top k > 1, this need to be updated but that would be next PR. |
Adding benchmark results for this PR: python3 -m sglang.launch_server --model /shared/public/elr-models/meta-llama/Meta-Llama-3.1-8B-Instruct/07eb05b21d191a58c577b4a45982fe0c049d0693 --speculative-algorithm EAGLE3 --speculative-draft-model-path /shared/public/elr-models/jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B/e5ed08d66f528a95ce89f5d4fd136a28f6def714 --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --mem-fraction 0.6 --dtype float16 --trust-remote-code --attention-backend fa3 --page-size 4 Benchmark Script: python3 /home/jobuser/sglang/benchmark/mtbench/bench_sglang_eagle.py --num-questions 80 --parallel 4 Result: page_size = 1, parallel = 1 page_size=4, parallel = 4 page_size = 1, parallel = 4 This benchmark verifies the accepted length is equally good and throughput slightly better when page_size == 4 is used. Similar result applies to page_size == 8. |
Co-authored-by: Yubo Wang <[email protected]>
Co-authored with @yubofredwang
Motivation
Before this change, the output will be a mess for page size > 1
After the change, the output is correct
python /home/jobuser/sglang/python/sglang/test/send_one.py Below is an example of a fully functional FastAPI server. This example includes a simple API with a few endpoints to demonstrate how to create, read, update, and delete (CRUD) items in a list. [Ignored the rest]
Benchmark
Server Script:
python3 -m sglang.launch_server --model /shared/public/elr-models/meta-llama/Meta-Llama-3.1-8B-Instruct/07eb05b21d191a58c577b4a45982fe0c049d0693 --speculative-algorithm EAGLE3 --speculative-draft-model-path /shared/public/elr-models/jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B/e5ed08d66f528a95ce89f5d4fd136a28f6def714 --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --mem-fraction 0.6 --dtype float16 --trust-remote-code --attention-backend fa3 --page-size 4
Benchmark Script:
python3 /home/jobuser/sglang/benchmark/mtbench/bench_sglang_eagle.py --num-questions 80 --parallel 4
Result:
page_size = 4, parallel = 1
#questions: 80, Throughput: 167.04 token/s, Acceptance length: 2.91
page_size = 1, parallel = 1
#questions: 80, Throughput: 159.48 token/s, Acceptance length: 2.71
page_size=4, parallel = 4
#questions: 80, Throughput: 1061.05 token/s, Acceptance length: 2.88
page_size = 1, parallel = 4
#questions: 80, Throughput: 1039.26 token/s, Acceptance length: 2.71
This benchmark verifies the accepted length is equally good and throughput slightly better when page_size == 4 is used. Similar result applies to page_size == 8.
DeepSeek-V3 + DeepSeek-V3-NextN
Modifications
Checklist