Support Page Size > 1 (when top k = 1) for FA3 Spec Decode #5168

hebiao064 · 2025-04-08T22:09:38Z

Motivation

Support Page Size > 1 for FA3 Spec Decode
Addressed some leftover comments in PR: Refactor and Optimize FA3 Code #5090

Before this change, the output will be a mess for page size > 1

python /home/jobuser/sglang/python/sglang/test/send_one.py
 Below is thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst, thefirst,

After the change, the output is correct

python /home/jobuser/sglang/python/sglang/test/send_one.py
 Below is an example of a fully functional FastAPI server. This example includes a simple API with a few endpoints to demonstrate how to create, read, update, and delete (CRUD) items in a list.

[Ignored the rest]

Benchmark

Server Script:

python3 -m sglang.launch_server --model /shared/public/elr-models/meta-llama/Meta-Llama-3.1-8B-Instruct/07eb05b21d191a58c577b4a45982fe0c049d0693 --speculative-algorithm EAGLE3 --speculative-draft-model-path /shared/public/elr-models/jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B/e5ed08d66f528a95ce89f5d4fd136a28f6def714 --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --mem-fraction 0.6 --dtype float16 --trust-remote-code --attention-backend fa3 --page-size 4
Benchmark Script:

python3 /home/jobuser/sglang/benchmark/mtbench/bench_sglang_eagle.py --num-questions 80 --parallel 4
Result:
page_size = 4, parallel = 1
#questions: 80, Throughput: 167.04 token/s, Acceptance length: 2.91

page_size = 1, parallel = 1
#questions: 80, Throughput: 159.48 token/s, Acceptance length: 2.71

page_size=4, parallel = 4
#questions: 80, Throughput: 1061.05 token/s, Acceptance length: 2.88

page_size = 1, parallel = 4
#questions: 80, Throughput: 1039.26 token/s, Acceptance length: 2.71

This benchmark verifies the accepted length is equally good and throughput slightly better when page_size == 4 is used. Similar result applies to page_size == 8.

DeepSeek-V3 + DeepSeek-V3-NextN

Iterations	GSM Latency on H200	Send One on H200
Page Size 1	0.960, 550.669 token/s	acc_length=2.64, speed=67.50 token/s
Page Size 4	0.960, 567.589 token/s	acc_length=2.67, speed=67.21 token/s
Page Size 64	0.960, 549.955 token/s	acc_length=2.65, speed=64.43 token/s

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…de_fa3

python/sglang/srt/layers/attention/flashattention_backend.py

…de_fa3

hebiao064 · 2025-04-09T04:43:14Z

converted to draft for now, need some double check

hebiao064 · 2025-04-10T00:15:23Z

converted to draft for now, need some double check

ready for review again.

Previously we had some doubt whether this would work but after some reflection, we are more confident about this would work for top k = 1, since all the tokens in the page table are consecutive.

For top k > 1, this need to be updated but that would be next PR.

…de_fa3

yubofredwang · 2025-04-15T22:40:00Z

Adding benchmark results for this PR:
Server Script:

python3 -m sglang.launch_server --model /shared/public/elr-models/meta-llama/Meta-Llama-3.1-8B-Instruct/07eb05b21d191a58c577b4a45982fe0c049d0693  --speculative-algorithm EAGLE3     --speculative-draft-model-path /shared/public/elr-models/jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B/e5ed08d66f528a95ce89f5d4fd136a28f6def714 --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4  --mem-fraction 0.6  --dtype float16 --trust-remote-code --attention-backend fa3 --page-size 4

Benchmark Script:

python3 /home/jobuser/sglang/benchmark/mtbench/bench_sglang_eagle.py --num-questions 80 --parallel 4

Result:
page_size = 4, parallel = 1
#questions: 80, Throughput: 167.04 token/s, Acceptance length: 2.91

page_size = 1, parallel = 1
#questions: 80, Throughput: 159.48 token/s, Acceptance length: 2.71

page_size=4, parallel = 4
#questions: 80, Throughput: 1061.05 token/s, Acceptance length: 2.88

page_size = 1, parallel = 4
#questions: 80, Throughput: 1039.26 token/s, Acceptance length: 2.71

This benchmark verifies the accepted length is equally good and throughput slightly better when page_size == 4 is used. Similar result applies to page_size == 8.

…de_fa3

Co-authored-by: Yubo Wang <[email protected]>

hebiao064 · 2025-04-19T05:46:33Z

closing it as the change would be merged by #5318 and #5509

Support Page Size > 1 for FA3 Spec Decode

6b1d2b8

hebiao064 mentioned this pull request Apr 8, 2025

Refactor and Optimize FA3 Code #5090

Merged

6 tasks

hebiao064 added 2 commits April 8, 2025 22:19

address leftover comment

39633c7

Merge branch 'main' into support_page_size_greater_than_one_spec_deco…

5c8fc5c

…de_fa3

hebiao064 marked this pull request as ready for review April 8, 2025 22:40

hebiao064 requested review from merrymercy, Ying1123, zhyncs, ispobock and HaiShaw as code owners April 8, 2025 22:41

hebiao064 mentioned this pull request Apr 8, 2025

[Roadmap] FlashAttention3 Support as SGLang Attention Backend #4709

Closed

15 tasks

zcnrex reviewed Apr 8, 2025

View reviewed changes

python/sglang/srt/layers/attention/flashattention_backend.py Show resolved Hide resolved

python/sglang/srt/layers/attention/flashattention_backend.py Show resolved Hide resolved

Merge branch 'main' into support_page_size_greater_than_one_spec_deco…

ca0faee

…de_fa3

hebiao064 marked this pull request as draft April 9, 2025 04:42

hebiao064 marked this pull request as ready for review April 10, 2025 00:13

Merge branch 'main' into support_page_size_greater_than_one_spec_deco…

94258f7

…de_fa3

qingquansong approved these changes Apr 10, 2025

View reviewed changes

qingquansong and others added 3 commits April 9, 2025 17:26

Merge branch 'main' into support_page_size_greater_than_one_spec_deco…

83fb54d

…de_fa3

Merge branch 'main' into support_page_size_greater_than_one_spec_deco…

a8619b5

…de_fa3

fix

53c41b2

hebiao064 requested a review from hnyls2002 as a code owner April 12, 2025 00:11

hebiao064 changed the title ~~Support Page Size > 1 for FA3 Spec Decode~~ Support Page Size > 1 (when top k = 1) for FA3 Spec Decode Apr 12, 2025

Merge branch 'main' into support_page_size_greater_than_one_spec_deco…

4174e3f

…de_fa3

hebiao064 added 4 commits April 15, 2025 17:56

Merge branch 'main' into support_page_size_greater_than_one_spec_deco…

36bcfe5

…de_fa3

Merge branch 'main' into support_page_size_greater_than_one_spec_deco…

88531b1

…de_fa3

Merge branch 'main' into support_page_size_greater_than_one_spec_deco…

9ce1314

…de_fa3

Merge branch 'main' into support_page_size_greater_than_one_spec_deco…

753c475

…de_fa3

support_page_size_greater_than_one_spec_decode_fa3

3576c80

Co-authored-by: Yubo Wang <[email protected]>

hebiao064 closed this Apr 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Page Size > 1 (when top k = 1) for FA3 Spec Decode #5168

Support Page Size > 1 (when top k = 1) for FA3 Spec Decode #5168

Uh oh!

hebiao064 commented Apr 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

hebiao064 commented Apr 9, 2025

Uh oh!

hebiao064 commented Apr 10, 2025

Uh oh!

yubofredwang commented Apr 15, 2025

Uh oh!

hebiao064 commented Apr 19, 2025

Uh oh!

Uh oh!

Support Page Size > 1 (when top k = 1) for FA3 Spec Decode #5168

Support Page Size > 1 (when top k = 1) for FA3 Spec Decode #5168

Uh oh!

Conversation

hebiao064 commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Benchmark

DeepSeek-V3 + DeepSeek-V3-NextN

Modifications

Checklist

Uh oh!

Uh oh!

Uh oh!

hebiao064 commented Apr 9, 2025

Uh oh!

hebiao064 commented Apr 10, 2025

Uh oh!

yubofredwang commented Apr 15, 2025

Uh oh!

hebiao064 commented Apr 19, 2025

Uh oh!

Uh oh!

hebiao064 commented Apr 8, 2025 •

edited

Loading