[Fix] fix FlashMLA cudagraph config #4591

yinfan98 · 2025-03-19T18:05:03Z

Motivation

Fix FlashMLA cudagraph config.

And some result for using FlashMLA Decode

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote --tp 8 --attention-backend flashinfer --page-size 64 --enable-flashmla

python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 1 2 4 8 16 32 --input-len 256 --output-len 256

FlashMLA

batch size: 1
latency: 8.38 s
output throughput: 30.56 token/s
(input + output) throughput: 61.13 token/s
batch size: 2
latency: 10.36 s
output throughput: 49.43 token/s
(input + output) throughput: 98.86 token/s
batch size: 4
latency: 10.17 s
output throughput: 100.70 token/s
(input + output) throughput: 201.39 token/s
batch size: 8
latency: 11.70 s
output throughput: 175.04 token/s
(input + output) throughput: 350.08 token/s
batch size: 16
latency: 12.70 s
output throughput: 322.57 token/s
(input + output) throughput: 645.14 token/s
batch size: 32
latency: 15.25 s
output throughput: 537.16 token/s
(input + output) throughput: 1074.32 token/s

FlashInfer

batch size: 1
latency: 8.82 s
output throughput: 29.02 token/s
(input + output) throughput: 58.04 token/s
batch size: 2
latency: 10.17 s
output throughput: 50.37 token/s
(input + output) throughput: 100.74 token/s
batch size: 4
latency: 9.94 s
output throughput: 103.06 token/s
(input + output) throughput: 206.12 token/s
batch size: 8
latency: 12.07 s
output throughput: 169.72 token/s
(input + output) throughput: 339.44 token/s
batch size: 16
latency: 12.87 s
output throughput: 318.35 token/s
(input + output) throughput: 636.70 token/s
batch size: 32
latency: 15.26 s
output throughput: 536.66 token/s
(input + output) throughput: 1073.32 token/s

With longer sequences, the performance advantages of FlashMLA become more significant.

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Co-authored-by: sleepcoo <[email protected]>

yinfan98 · 2025-03-19T18:09:44Z

Avoid CPU-GPU synchronization due to PR: #4514. cc: @zhyncs @merrymercy

zhyncs · 2025-03-20T09:11:52Z

ref #4577

tianchongchong · 2025-03-28T05:55:59Z

Motivation

Fix FlashMLA cudagraph config.

And some result for using FlashMLA Decode
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote --tp 8 --attention-backend flashinfer --page-size 64 --enable-flashmla
python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 1 2 4 8 16 32 --input-len 256 --output-len 256
FlashMLA
batch size: 1
latency: 8.38 s
output throughput: 30.56 token/s
(input + output) throughput: 61.13 token/s
batch size: 2
latency: 10.36 s
output throughput: 49.43 token/s
(input + output) throughput: 98.86 token/s
batch size: 4
latency: 10.17 s
output throughput: 100.70 token/s
(input + output) throughput: 201.39 token/s
batch size: 8
latency: 11.70 s
output throughput: 175.04 token/s
(input + output) throughput: 350.08 token/s
batch size: 16
latency: 12.70 s
output throughput: 322.57 token/s
(input + output) throughput: 645.14 token/s
batch size: 32
latency: 15.25 s
output throughput: 537.16 token/s
(input + output) throughput: 1074.32 token/s
FlashInfer
batch size: 1
latency: 8.82 s
output throughput: 29.02 token/s
(input + output) throughput: 58.04 token/s
batch size: 2
latency: 10.17 s
output throughput: 50.37 token/s
(input + output) throughput: 100.74 token/s
batch size: 4
latency: 9.94 s
output throughput: 103.06 token/s
(input + output) throughput: 206.12 token/s
batch size: 8
latency: 12.07 s
output throughput: 169.72 token/s
(input + output) throughput: 339.44 token/s
batch size: 16
latency: 12.87 s
output throughput: 318.35 token/s
(input + output) throughput: 636.70 token/s
batch size: 32
latency: 15.26 s
output throughput: 536.66 token/s
(input + output) throughput: 1073.32 token/s
With longer sequences, the performance advantages of FlashMLA become more significant.

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.

Add unit tests as outlined in the Running Unit Tests.

Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.

Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.

For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.

Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

hi @yinfan98 I want to know what gpu the above test results are based on. Is there test results of compared with triton-backend

fix FlashMLA cudagraph

48c70f5

Co-authored-by: sleepcoo <[email protected]>

yinfan98 requested review from merrymercy, Ying1123, zhyncs, ispobock and HaiShaw as code owners March 19, 2025 18:05

yinfan98 assigned yinfan98 and sleepcoo Mar 19, 2025

sleepcoo mentioned this pull request Mar 21, 2025

Support FlashMLA backend cuda graph #4514

Merged

sleepcoo closed this Mar 23, 2025

sleepcoo mentioned this pull request Mar 23, 2025

fix FlashMLA cudagraph config #4691

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] fix FlashMLA cudagraph config #4591

[Fix] fix FlashMLA cudagraph config #4591

Uh oh!

yinfan98 commented Mar 19, 2025

Uh oh!

yinfan98 commented Mar 19, 2025 •

edited

Loading

Uh oh!

zhyncs commented Mar 20, 2025

Uh oh!

tianchongchong commented Mar 28, 2025

Motivation

Modifications

Checklist

Uh oh!

Uh oh!

[Fix] fix FlashMLA cudagraph config #4591

[Fix] fix FlashMLA cudagraph config #4591

Uh oh!

Conversation

yinfan98 commented Mar 19, 2025

Motivation

Modifications

Checklist

Uh oh!

yinfan98 commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhyncs commented Mar 20, 2025

Uh oh!

tianchongchong commented Mar 28, 2025

Motivation

Modifications

Checklist

Uh oh!

Uh oh!

yinfan98 commented Mar 19, 2025 •

edited

Loading